Submitting Jobs

Interactive Jobs

Submitting an interactive job to GPU cluster establishes direct terminal access to a GPU node, where you can test/develop your code before actually submitting in batch.

To launch an interactive job, you can issue "sinteractive" command

itambol89@raad2-gfx:~$ sinteractive
itambol89@gfx1:~$

You will notice that in the terminal, raad2-gfx has changed to gfx1. This means that you are now on a GPU node.

Interactive Python Job

Load python

[itambol89@gfx1 ~]$ module load python39
[itambol89@gfx1 ~]$ module load anaconda/2024.10
[itambol89@gfx1 ~]$ source /cm/shared/apps/anaconda/2024.10/etc/profile.d/conda.sh

Let us activate the sample dlproject virtual environment we created and start testing:

[itambol89@gfx1 ~]$ conda activate dlproject

(dlproject) [itambol89@gfx1 ~]$ python dl.py

After making sure that everything in the code is working fine, deactivate your virtual environment, exit to the login node, and you are ready to make a Batch submission.

To deactivate your virtual environment:

(dlproject) [itambol89@gfx1 ~]$ conda deactivate

To exit to the login node:

[itambol89@gfx1 ~]$ exit

Interactive CUDA Job

A sample CUDA code is placed at "/ddn/share/examples/gpu-tutorial/01_cuda/add.cu". A very nice tutorial on this can be found C++/CUDA here.

1. Copy sample cuda code in your home directory

[itambol89@raad2-gfx ~]$ cp /ddn/share/examples/gpu-tutorial/01_cuda/add.cu .

2. Submit an interactive job to compile cuda code

[itambol89@raad2-gfx ~]$ sinteractive
[itambol89@gfx1 ~]$

3. Load CUDA modules

[itambol89@gfx1 ~]$ module load cuda12.8

4. Compile sample code using Nvidia Cuda Compiler (nvcc)

[itambol89@gfx1 ~]$ which nvcc
/cm/shared/apps/cuda12.8/toolkit/12.8.0/bin/nvcc
[itambol89@gfx1 ~]$ nvcc add.cu -o add_cuda

5. Run the executable

[itambol89@gfx1 ~]$ ./add_cuda
Max error: 0
[itambol89@gfx1 ~]$

6. Profile your code

[itambol89@gfx1 ~]$ nvprof ./add_cuda
==886903== NVPROF is profiling process 886903, command: ./add_cuda
Max error: 0
==886903== Profiling application: ./add_cuda
==886903== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  2.5970ms         1  2.5970ms  2.5970ms  2.5970ms  add(int, float*, float*)
      API calls:   87.62%  127.20ms         2  63.601ms  79.714us  127.12ms  cudaMallocManaged
                   10.19%  14.794ms         1  14.794ms  14.794ms  14.794ms  cudaLaunchKernel
                    1.79%  2.5983ms         1  2.5983ms  2.5983ms  2.5983ms  cudaDeviceSynchronize

...
...
==886903== Unified Memory profiling result:
Device "Tesla V100-PCIE-16GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      48  170.67KB  4.0000KB  0.9961MB  8.000000MB  797.4010us  Host To Device
      24  170.67KB  4.0000KB  0.9961MB  4.000000MB  344.1280us  Device To Host
      12         -         -         -           -  1.723315ms  Gpu page fault groups
Total CPU Page faults: 36
[itambol89@gfx1 ~]$

7. Exit from Interactive Job

[itambol89@gfx1 ~]$ exit

Batch Jobs

Sample Slurm Job file for Python

To run the sample job, you have to:

Copy the sample job file to your working directory.

[itambol89@raad2-gfx ~]$ cp -r /ddn/share/examples/gpu-tutorial/02_python .

In the job file, line 21, change <env_name> to the name of your virtual envirnment. (e.g. gpu-test)
In the job file, line 23, change gpu_test.py to the name of your python file. (e.g. dl.py)

Your sample python job will then look like this:

#!/bin/bash

# Set the job name
#PBS -N gpu_test

# Set the wall time for the job (1 hour)
#PBS -l walltime=01:00:00

# Request resources: 1 node, 1 CPU, 1 GPU, and 8GB of memory
# This line is crucial for requesting the GPU
#PBS -l select=1:ncpus=4:ngpus=1:mem=8gb


echo "Running on node(s): $PBS_NODEFILE"
cat $PBS_NODEFILE

module load anaconda/2024.10

source /cm/shared/apps/anaconda/2024.10/etc/profile.d/conda.sh

conda activate gpu-test

python gpu_test.py

Then you will be able to run it:

[itambol89@raad2-gfx 02_python]$ qsub job_gpu1.pbs

Sample PBS Job file for Cuda

Below is a sample PBS job file for Cuda program. The source file "add.cu" can be found here; "/ddn/share/examples/gpu-tutorial/01_cuda/add.cu"

#!/bin/bash
# Set the job name
#PBS -N gpu_test
# Set the wall time for the job (1 hour)
#PBS -l walltime=01:00:00
# Request resources: 1 node, 1 CPU, 1 GPU, and 8GB of memory
# This line is crucial for requesting the GPU
#PBS -l select=1:ncpus=4:ngpus=1:mem=8gb

echo "Running on node(s): $PBS_NODEFILE"
cat $PBS_NODEFILE

module load cuda12.8

nvcc add.cu -o add_cuda

./add_cuda

Now submit batch job

[itambol89@raad2-gfx 01_cuda]$ qsub job_cuda1.pbs

The output of this job will be placed in the same directory.

PBS Command-Line Cheat Sheet

This guide provides the most commonly used PBSPro commands to submit, monitor, and manage jobs on the HPC cluster.

Submitting Jobs

Submit a job script to the queue.

[itambol89@raad2-gfx]$ qsub jobscript.pbs

Monitoring Jobs

List all jobs in the system.

#List all jobs in the system.
[itambol89@raad2-gfx]$ qstat

#Show jobs belonging to a specific user.
[itambol89@raad2-gfx]$ qstat -u username

#Show detailed job information, including assigned node(s).
[itambol89@raad2-gfx]$ qstat -n -1 jobid

Check finished/completed job details.
[itambol89@raad2-gfx]$ qstat -x jobid

Job Control

#Delete (cancel) a job.
[itambol89@raad2-gfx]$ qdel jobid

#Place a job on hold.
[itambol89@raad2-gfx]$ qhold jobid

#Release a held job.
[itambol89@raad2-gfx]$ qrls jobid

Queue & Node Information

At the moment only workq is available for users

#Show all available queues.
[itambol89@raad2-gfx]$ qstat -q

#Show all nodes and their status.
[itambol89@raad2-gfx]$ pbsnodes -a

#Show nodes with resources summary.
[itambol89@raad2-gfx]$ pbsnodes -avS

Checking Resource Usage

#Full job information (resources, status, queue, node, etc.).
[itambol89@raad2-gfx]$ qstat -f jobid

#View job execution history (start, run, finish).
[itambol89@raad2-gfx]$ tracejob jobid