Submitting Jobs
Interactive Jobs
Submitting an interactive job to GPU cluster establishes direct terminal access to a GPU node, where you can test/develop your code before actually submitting in batch.
To launch an interactive job, you can issue "sinteractive" command
itambol89@raad2-gfx:~$ sinteractive
itambol89@gfx1:~$
You will notice that in the terminal, raad2-gfx has changed to gfx1. This means that you are now on a GPU node.
Interactive Python Job
Load python
[itambol89@gfx1 ~]$ module load python39
[itambol89@gfx1 ~]$ module load anaconda/2024.10
[itambol89@gfx1 ~]$ source /cm/shared/apps/anaconda/2024.10/etc/profile.d/conda.sh
Let us activate the sample dlproject virtual environment we created and start testing:
[itambol89@gfx1 ~]$ conda activate dlproject
(dlproject) [itambol89@gfx1 ~]$ python dl.py
After making sure that everything in the code is working fine, deactivate your virtual environment, exit to the login node, and you are ready to make a Batch submission.
To deactivate your virtual environment:
(dlproject) [itambol89@gfx1 ~]$ conda deactivate
To exit to the login node:
[itambol89@gfx1 ~]$ exit
Interactive CUDA Job
A sample CUDA code is placed at "/ddn/share/examples/gpu-tutorial/01_cuda/add.cu". A very nice tutorial on this can be found C++/CUDA here.
1. Copy sample cuda code in your home directory
[itambol89@raad2-gfx ~]$ cp /ddn/share/examples/gpu-tutorial/01_cuda/add.cu .
2. Submit an interactive job to compile cuda code
[itambol89@raad2-gfx ~]$ sinteractive
[itambol89@gfx1 ~]$
3. Load CUDA modules
[itambol89@gfx1 ~]$ module load cuda12.8
4. Compile sample code using Nvidia Cuda Compiler (nvcc)
[itambol89@gfx1 ~]$ which nvcc
/cm/shared/apps/cuda12.8/toolkit/12.8.0/bin/nvcc
[itambol89@gfx1 ~]$ nvcc add.cu -o add_cuda
5. Run the executable
[itambol89@gfx1 ~]$ ./add_cuda
Max error: 0
[itambol89@gfx1 ~]$
6. Profile your code
[itambol89@gfx1 ~]$ nvprof ./add_cuda
==886903== NVPROF is profiling process 886903, command: ./add_cuda
Max error: 0
==886903== Profiling application: ./add_cuda
==886903== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 2.5970ms 1 2.5970ms 2.5970ms 2.5970ms add(int, float*, float*)
API calls: 87.62% 127.20ms 2 63.601ms 79.714us 127.12ms cudaMallocManaged
10.19% 14.794ms 1 14.794ms 14.794ms 14.794ms cudaLaunchKernel
1.79% 2.5983ms 1 2.5983ms 2.5983ms 2.5983ms cudaDeviceSynchronize
...
...
==886903== Unified Memory profiling result:
Device "Tesla V100-PCIE-16GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
48 170.67KB 4.0000KB 0.9961MB 8.000000MB 797.4010us Host To Device
24 170.67KB 4.0000KB 0.9961MB 4.000000MB 344.1280us Device To Host
12 - - - - 1.723315ms Gpu page fault groups
Total CPU Page faults: 36
[itambol89@gfx1 ~]$
7. Exit from Interactive Job
[itambol89@gfx1 ~]$ exit
Batch Jobs
Sample Slurm Job file for Python
To run the sample job, you have to:
- Copy the sample job file to your working directory.
[itambol89@raad2-gfx ~]$ cp -r /ddn/share/examples/gpu-tutorial/02_python .
- In the job file, line 21, change <env_name> to the name of your virtual envirnment. (e.g. gpu-test)
- In the job file, line 23, change gpu_test.py to the name of your python file. (e.g. dl.py)
Your sample python job will then look like this:
#!/bin/bash
# Set the job name
#PBS -N gpu_test
# Set the wall time for the job (1 hour)
#PBS -l walltime=01:00:00
# Request resources: 1 node, 1 CPU, 1 GPU, and 8GB of memory
# This line is crucial for requesting the GPU
#PBS -l select=1:ncpus=4:ngpus=1:mem=8gb
echo "Running on node(s): $PBS_NODEFILE"
cat $PBS_NODEFILE
module load anaconda/2024.10
source /cm/shared/apps/anaconda/2024.10/etc/profile.d/conda.sh
conda activate gpu-test
python gpu_test.py
Then you will be able to run it:
[itambol89@raad2-gfx 02_python]$ qsub job_gpu1.pbs
Sample PBS Job file for Cuda
Below is a sample PBS job file for Cuda program. The source file "add.cu" can be found here; "/ddn/share/examples/gpu-tutorial/01_cuda/add.cu"
#!/bin/bash
# Set the job name
#PBS -N gpu_test
# Set the wall time for the job (1 hour)
#PBS -l walltime=01:00:00
# Request resources: 1 node, 1 CPU, 1 GPU, and 8GB of memory
# This line is crucial for requesting the GPU
#PBS -l select=1:ncpus=4:ngpus=1:mem=8gb
echo "Running on node(s): $PBS_NODEFILE"
cat $PBS_NODEFILE
module load cuda12.8
nvcc add.cu -o add_cuda
./add_cuda
Now submit batch job
[itambol89@raad2-gfx 01_cuda]$ qsub job_cuda1.pbs
The output of this job will be placed in the same directory.
PBS Command-Line Cheat Sheet
This guide provides the most commonly used PBSPro commands to submit, monitor, and manage jobs on the HPC cluster.
Submitting Jobs
Submit a job script to the queue.
[itambol89@raad2-gfx]$ qsub jobscript.pbs
Monitoring Jobs
List all jobs in the system.
#List all jobs in the system.
[itambol89@raad2-gfx]$ qstat
#Show jobs belonging to a specific user.
[itambol89@raad2-gfx]$ qstat -u username
#Show detailed job information, including assigned node(s).
[itambol89@raad2-gfx]$ qstat -n -1 jobid
Check finished/completed job details.
[itambol89@raad2-gfx]$ qstat -x jobid
Job Control
#Delete (cancel) a job.
[itambol89@raad2-gfx]$ qdel jobid
#Place a job on hold.
[itambol89@raad2-gfx]$ qhold jobid
#Release a held job.
[itambol89@raad2-gfx]$ qrls jobid
Queue & Node Information
At the moment only workq
is available for users
#Show all available queues.
[itambol89@raad2-gfx]$ qstat -q
#Show all nodes and their status.
[itambol89@raad2-gfx]$ pbsnodes -a
#Show nodes with resources summary.
[itambol89@raad2-gfx]$ pbsnodes -avS
Checking Resource Usage
#Full job information (resources, status, queue, node, etc.).
[itambol89@raad2-gfx]$ qstat -f jobid
#View job execution history (start, run, finish).
[itambol89@raad2-gfx]$ tracejob jobid