Slurm Job Management

class: center, top, title-slide

# Slurm Job Management

### Center for Advanced Research Computing <br> University of Southern California <br>

#### Last updated on 2024-06-20

---

## Outline

1. What is Slurm?
2. Cluster info
3. Submitting jobs
4. Monitoring jobs
5. Other commands
6. Job dependencies
7. Job arrays

---

## First some terminology

- **compute node** — individual computer with certain specs (CPUs, RAM, etc.)
- **login (head) node** — special node for accessing cluster and submitting jobs
- **cluster** — set of interconnected compute nodes
- **partition** — distinct group of compute nodes within a cluster
- **terminal** — program for running shells
- **shell** — program for running commands
- **script** — text file with a sequence of commands to execute
- **job** — distinct set of computational work to be performed
- **scheduler** — program that controls when and where jobs are run on a cluster

---

## What is Slurm?

- Open-source management and job scheduling system for Linux computing clusters
- Three main functions
  1. Allocates access to resources (compute nodes)
  2. Provides a framework to run and monitor jobs on allocated nodes
  3. Manages a job queue for competing resource requests
- Configuration may differ across clusters
- [Official docs](https://slurm.schedmd.com)

---

## Overview of Slurm commands

| Category | Command | Purpose |
|---|---|---|
| Cluster info    | sinfo    | Display compute partition and node information |
| Submitting jobs | salloc   | Allocate resources for an interactive job |
|                 | sbatch   | Submit a job script for remote execution |
|                 | srun     | Launch parallel tasks (job steps) (typically for MPI jobs) |
|                 | squeue   | Display status of jobs and job steps |
|                 | sprio    | Display job priority information |
|                 | scancel  | Cancel pending or running jobs |
| Monitoring jobs | sstat    | Display status information for running jobs |
|                 | sacct    | Display accounting information for jobs |
|                 | sreport  | Generate reports from job accounting data |
| Other           | scontrol | Display or modify Slurm configuration and state |

---

## Custom CARC Slurm commands

| Command | Purpose |
|---|---|
| myaccount | View account information for user |
| nodeinfo | View node information by partition, CPU/GPU models, and state |
| noderes | View available resources on nodes |
| jobqueue | View job queue information |
| myqueue | View job queue information for user |
| jobhist | View compact history of jobs |
| jobinfo | View detailed job information |

---

## Typical batch job workflow

1. Create job script
2. Submit job script with `sbatch`
3. Check job status with `myqueue` or `jobhist`
4. When job completes, check log file
5. If job failed, modify job and resubmit (back to step 1)
  - Look for warning and error messages in log file
  - If needed, check job information with `jobinfo`
6. If job succeeded, check job information with `jobinfo`
  - Review efficiency information
  - If possible, use less resources next time for similar job

---

## Discovery cluster partitions

| Partition | Purpose |
|---|---|
| main       | Serial and small-to-medium parallel jobs (single node or multiple nodes) |
| epyc-64    | Medium-to-large parallel jobs (single node or multiple nodes) |
| gpu        | Jobs requiring GPU nodes |
| largemem   | Jobs requiring larger amounts of memory (up to 1 TB) |
| debug      | Short-running jobs for testing or debugging purposes (up to 1 hour) |
| oneweek    | Long-running jobs (up to 7 days) |

---

## sinfo

- Display compute partition and node information
- Default output lists information by partition and node status
- `nodeinfo` lists information by partition, node type, and node status
- `sinfo --help`
- [https://slurm.schedmd.com/sinfo.html](https://slurm.schedmd.com/sinfo.html)

```bash
$ sinfo

$ sinfo -lNp debug

$ nodeinfo

$ nodeinfo -s idle
```

---

## sinfo notes

- Useful options include `--Node (-N)`, `--partition (-p)`, and `--states`
- Many formatting options with `--format (-o)` or `--Format (-O)`
  - Can also use SINFO_FORMAT environment variable to set
  - `export SINFO_FORMAT=...`
  - Add to `~/.bashrc` to automatically set when logging in

---

## Codes for common node states

- allocated
  - The node has been allocated to one or more jobs
- down
  - The node is unavailable for use
- draining
  - The node is currently executing a job, but will not be allocated additional jobs
- idle
  - The node is not allocated to any jobs and is available for use
- maint
  - The node is currently in a reservation with a flag value of "maintenance"
- mixed
  - The node has some of its CPUs ALLOCATED while others are IDLE
- reserved
  - The node is in an advanced reservation and not generally available
- planned
  - The node is planned for a job

---

## Submitting jobs

- `salloc` — request an interactive job
- `sbatch` — submit a batch job
- `srun` — run parallel (MPI) tasks (job steps) within batch jobs

---

## salloc

- Allocate resources for an interactive job
- Useful for development and debugging
- Similar options to `sbatch`
- `salloc --help`
- [https://slurm.schedmd.com/salloc.html](https://slurm.schedmd.com/salloc.html)

```bash
$ salloc

$ salloc --partition=debug --cpus-per-task=8

$ salloc -p debug -c 8
```

---

## sbatch

- Submit a job script for remote execution
- A job script is a special type of Bash script
- `sbatch --help`
- [https://slurm.schedmd.com/sbatch.html](https://slurm.schedmd.com/sbatch.html)

```bash
$ sbatch julia.job

$ sbatch julia.slurm

$ sbatch -p debug julia.job
```

---

## Generic structure of a job script

1. Specify SBATCH options (resource requests)
2. Load software and set up environment
3. Run command(s)

---

## Example job script for single-threaded job

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=8G
#SBATCH --time=1:00:00

module purge
module load julia/1.10.3

julia script.jl
```

---

## Example job script for multi-threaded job

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=1:00:00

module purge
module load julia/1.10.3

julia --threads 16 script.jl
```

---

## Commonly used salloc/sbatch options

| Option | Description | Default values |
|---|---|---|
| `--nodes=<number>`             | Number of nodes to use | 1 |
| `--ntasks=<number>`            | Number of processes to run | 1 |
| `--cpus-per-task=<number>`     | Number of CPUs (cores) per task | 1 |
| `--mem=<number>`               | Memory per node | n/a |
| `--mem-per-cpu=<number>`       | Memory per CPU (core) | 2G |
| `--time=<D-HH:MM:SS>`          | Maximum run time | 1:00:00 |
| `--partition=<partition_name>` | Request nodes on specified partition | main |
| `--constraint=<attribute>`     | Node property to request (e.g., `xeon-4116`) | n/a |
| `--account=<project_id>`       | Account to charge resources to | default |
| `--mail-type=<value>`          | E-mail notifications (e.g., `all`) | n/a |
| `--mail-user=<address>`        | E-mail address | n/a |

---

## sbatch notes

- Shell environment is transferred to job when submitted
- Use `module purge` to clear modules
- Use `--mem=0` to request all available memory on a node
- For submitting lots of similar jobs, use job arrays
- For submitting lots of short-running jobs, pack them together in one job

---

## Example job script for GPU job

- Request gpu partition and one or more GPUs (model is optional)
- [Using GPUs on CARC clusters](https://carc.usc.edu/user-information/user-guides/software-and-programming/using-gpus)

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gpus-per-task=a100:1
#SBATCH --mem=16G
#SBATCH --time=1:00:00

module purge
module load nvhpc/23.11

./cuda_program
```

---

## Job log files

- By default, output log files are named `slurm-<job_id>.out` and saved to the submit directory
- By default, both output and error messages are printed to the same output file
- Use `--output` and/or `--error` options to customize filenames or split output
- Filename patterns can be used (e.g., %x = job name → `#SBATCH --output=%x.out`)

---

## Job limits on Discovery

| Partition | Maximum run time | Maximum concurrent CPUs | Maximum concurrent GPUs | Maximum concurrent memory | Maximum number of jobs queued | Maximum number of jobs running |
|---|---|---|---|---|---|---|
| main     |  48 hours   | 2,000 | 36  | n/a  | 5,000 | 100 |
| epyc-64  |  48 hours   | 2,000 | n/a | n/a  | 5,000 | 100 |
| gpu      |  48 hours   | 400   | 36  | n/a  | 100   | 36  |
| largemem | 168 hours   | 64    | n/a | 1 TB | 10    | 3   |
| debug    |   1 hour    | 48    | 4   | n/a  | 5     | 5   |
| oneweek  | 168 hours   | 208   | n/a | n/a  | 50    | 50  |

- *By default, Endeavour condo partitions will inherit the main partition limits*
- *But with a two-week maximum run time*
- *Custom limits can be requested for Endeavour condo partitions*

---

## Environment variables for sbatch

- Output variables can be used in job or application scripts
- Some examples:

| Variable | Description |
|---|---|
| SLURM_JOB_ID         | The ID of the job allocation |
| SLURM_JOB_NODELIST   | List of nodes allocated to the job |
| SLURM_JOB_NUM_NODES  | Total number of nodes in the job's resource allocation |
| SLURM_NTASKS         | Number of tasks requested |
| SLURM_CPUS_PER_TASK  | Number of CPUs requested per task |
| SLURM_SUBMIT_DIR     | The directory from which `sbatch` was invoked |
| SLURM_ARRAY_TASK_ID  | Job array ID (index) number |

---

## Example job script using Slurm variables

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --time=1:00:00

module purge
module load julia/1.10.3

cd $SLURM_SUBMIT_DIR/scripts

julia --threads $SLURM_CPUS_PER_TASK script.jl
```

---

## srun

- Launch parallel tasks (job steps) for MPI jobs
- Can also be used for job packing
- `srun --help`
- [https://slurm.schedmd.com/srun.html](https://slurm.schedmd.com/srun.html)
- [Using MPI on CARC clusters](https://carc.usc.edu/user-information/user-guides/software-and-programming/mpi)

---

## Example job script for MPI job

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=epyc-64
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=0
#SBATCH --time=1:00:00

module purge
module load gcc/11.3.0
module load openmpi/4.1.4

ulimit -s unlimited

srun --mpi=pmix_v2 -n $SLURM_NTASKS ./mpi_program
```

---

## Example job packing script

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=1:00:00

module purge
module load julia/1.10.3

srun --ntasks=1 --exclusive julia script1.jl &
srun --ntasks=1 --exclusive julia script2.jl &
srun --ntasks=1 --exclusive julia script3.jl &
srun --ntasks=1 --exclusive julia script4.jl &
wait
```

---

## Additional commands related to job submission

- `squeue` — view the job queue
- `sprio` — check job priority
- `scancel` — cancel jobs

---

## squeue

- Display status of jobs and job steps
- Completed jobs will not show up in `squeue`; use `jobhist` instead
- Use `myqueue` or `jobqueue` for better output formats
- `squeue --help`
- [https://slurm.schedmd.com/squeue.html](https://slurm.schedmd.com/squeue.html)

```bash
$ squeue

$ squeue --me

$ myqueue

$ jobqueue -p epyc-64
```

---

## Common codes for job state

- PD / PENDING
  - Job is awaiting resource allocation
- R / RUNNING
  - Job currently has an allocation
- CG / COMPLETING
  - Job is in the process of completing. Some processes on some nodes may still be active
- CD / COMPLETED
  - Job has terminated all processes on all nodes with an exit code of zero
- CA / CANCELLED
  - Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated

---

## Common codes for PD / PENDING reason

- Resources
  - The job is waiting for resources to become available
- Priority
  - One or more higher priority jobs exist for this partition or advanced reservation
- ReqNodeNotAvail
  - Some node specifically required by the job is not currently available
- QOSMaxJobsPerUserLimit
  - The job has reached the maximum jobs per user limit
- QOSMaxCpuPerUserLimit
  - The job has reached the maximum CPU per user limit
- QOSMaxGRESPerUserLimit
  - The job has reached the maximum GRES (e.g, GPU) per user limit
- AssocGrpCPUMinutesLimit
  - The project account has run out of CPU time
- InvalidAccount
  - The project account is invalid

---

## squeue notes

- Useful options include `--start` and `--partition`
- Many formatting options with `--format` or `--Format`
  - Can also use SQUEUE_FORMAT environment variable to set
  - `export SQUEUE_FORMAT=...`
  - Add to `~/.bashrc` to automatically set when logging in
- Create shorter alias
  - `alias myq="squeue --me"`
  - Add to `~/.bashrc` to automatically set when logging in

---

## Job priorities

- Based on a fairshare algorithm and job age
- Fairshare values depend on a number of factors
  - Number of jobs submitted
  - Resources used
  - Project account activity
- [https://slurm.schedmd.com/fair_tree.html](https://slurm.schedmd.com/fair_tree.html)

---

## sprio

- Display job priority information
- `sprio --help`
- Can be difficult to interpret
- If normalized, a priority value closer to 1 means a higher priority
- [https://slurm.schedmd.com/sprio.html](https://slurm.schedmd.com/sprio.html)

```bash
$ sprio -nj <job_id>

$ sprio -nu $USER

$ sprio -nlp gpu
```

---

## scancel

- Cancel pending or running jobs
- `scancel --help`
- [https://slurm.schedmd.com/scancel.html](https://slurm.schedmd.com/scancel.html)

```bash
$ scancel <job_id>

$ scancel -u $USER
```

---

## Monitoring jobs

- `sstat` — view job status for currently running job
- `sacct` — view information about current or past jobs
- `jobhist` — view compact history of jobs
- `jobinfo` — view detailed job information

---

## sstat

- Display status information for running jobs
- `sstat --help`
- [https://slurm.schedmd.com/sstat.html](https://slurm.schedmd.com/sstat.html)

```bash
$ sstat -j <job_id>

$ sstat -j <job_id> --format=JobID,MaxRSS,AveCPUFreq,MaxDiskRead,MaxDiskWrite
```

---

## sacct

- Display accounting information for jobs
- `sacct --help`
- [https://slurm.schedmd.com/sacct.html](https://slurm.schedmd.com/sacct.html)

```bash
$ sacct

$ sacct -j <job_id>

$ sacct --format=JobID,MaxRSS,AveCPUFreq,MaxDiskRead,MaxDiskWrite,State,ExitCode
```

---

## sacct notes

- By default, only displays jobs from past day
- Some useful options are `--starttime`, `--endtime`, `--brief`, and `--state`
- Many formatting options with `--format`
  - Can also use SACCT_FORMAT environment variable to set
  - `export SACCT_FORMAT=...`
  - Add to `~/.bashrc` to automatically set when logging in

---

## Job exit codes

- 0 → success
- non-zero → failure
- Exit code 1 indicates a general failure
- Exit code 2 indicates incorrect use of shell builtins
- Exit codes 3-124 indicate some error in job (check software exit codes)
- Exit code 125 indicates out of memory
- Exit code 126 indicates command cannot execute
- Exit code 127 indicates command not found
- Exit code 128 indicates invalid argument to exit
- Exit codes 129-192 indicate jobs terminated by Linux signals
  - For these, subtract 128 from the number and match to signal code
  - Enter `kill -l` to list signal codes
  - Enter `man signal` for more information

---

## jobhist and jobinfo

- Convenience commands with better output formats
- Alternatives to `sstat` and `sacct`
- `jobinfo` provides CPU and memory efficiency

```bash
$ jobhist <days>

$ jobinfo <job_id>
```

---

## sreport

- Generate reports from job accounting data
- `sreport --help`
- [https://slurm.schedmd.com/sreport.html](https://slurm.schedmd.com/sreport.html)

```bash
$ sreport cluster accountutilizationbyuser start=2023-04-01 end=2023-04-30 users=$USER

$ sreport user topusage start=2023-04-01 end=2023-04-30 accounts=<project_id>
```

---

## scontrol

- Display or modify Slurm configuration and state
- `scontrol --help`
- Mostly for admins, some commands for users
- [https://slurm.schedmd.com/scontrol.html](https://slurm.schedmd.com/scontrol.html)

```bash
$ scontrol show partition <partition>

$ scontrol show node <node_id>

$ scontrol show job <job_id>

$ scontrol hold <job_id>

$ scontrol release <job_id>
```

---

## Job dependencies

- If a job depends on another job
- Defer the start of a job until the specified dependencies have been satisfied
- Some examples
  - Workflows
  - Checkpointing and splitting up long-running jobs
- Setting up a job dependency
  - Need job IDs (use `squeue` or `jobhist`)
  - Add `#SBATCH --dependency=<type>` to job script
  - Or use `sbatch --dependency=<type> my.job`
  - Different dependency types available

---

## Commonly used job dependency types

`--dependency=afterok:<job_id>[:<job_id>...]`

- This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of 0)

`--dependency=afternotok:<job_id>[:<job_id>...]`

- This job can begin execution after the specified jobs have terminated in some failed state (non-0 exit code, node failure, timed out, etc.)

`--dependency=after:<job_id>[[+time][:<job_id>[+time]...]]`

- After the specified jobs start or are cancelled and time in minutes from job start or cancellation happens, this job can begin execution. If no time is given, then there is no delay after start or cancellation

`--dependency=afterany:<job_id>[:<job_id>...]`

- This job can begin execution after the specified jobs have terminated

---

## Job dependency example

```bash
$ sbatch jl.job
Submitted batch job 3353548

$ sbatch --dependency=afterok:3353548 jl2.job
Submitted batch job 3353549

$ squeue --me
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3353549      main  jl2.job  ttrojan PD       0:00      1 (Dependency)
3353548      main   jl.job  ttrojan  R       0:32      1 b22-31
```

---

## Job arrays

- For submitting and managing collections of similar jobs quickly and easily
- Some examples
  - Varying simulation or model parameters
  - Running the same processing steps on different datasets
  - Running the same statistical models on different datasets
- Setting up a job array
  - Add `#SBATCH --array=<index>` option to job script
  - Each job array task will use the same resources requested
  - Modify job or application script to use array index
  - Multiple implementation methods exist
- [https://slurm.schedmd.com/job_array.html](https://slurm.schedmd.com/job_array.html)
- For lots of short-running jobs, pack them in a single Slurm job using [HyperQueue](https://it4innovations.github.io/hyperqueue/stable/)

---

## Job array example

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=1:00:00
#SBATCH --array=1-3

module purge
module load gcc/11.3.0
module load openblas/0.3.20
module load r/4.4.0

echo "Task ID: $SLURM_ARRAY_TASK_ID"

Rscript script.R
```

---

## Job array example (continued)

```r
# R script to process and model data

library(data.table)

files <- list.files("./data", full.names = TRUE)
task <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))
file <- files[task]
print(file)

data <- fread(file)

summary(data)

...
```

---

## Job array example (continued)

```bash
$ sbatch array.job
Submitted batch job 3355483

$ squeue --me
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3355483_1      main array.jo  ttrojan  R       0:05      1 d05-35
3355483_2      main array.jo  ttrojan  R       0:05      1 e16-15
3355483_3      main array.jo  ttrojan  R       0:05      1 d18-29
```

---

## Additional resources

- [Slurm documentation](https://slurm.schedmd.com/)
- [Slurm cheatsheet](https://slurm.schedmd.com/pdfs/summary.pdf)
- [Running jobs on CARC clusters](https://carc.usc.edu/user-information/user-guides/hpc-basics/running-jobs)
- [CARC Slurm job script templates](https://carc.usc.edu/user-information/user-guides/hpc-basics/slurm-templates)
- [CARC Slurm cheatsheet](https://www.carc.usc.edu/user-information/user-guides/hpc-basics/slurm-cheatsheet)
- [CARC Slurm tools](https://github.com/uschpc/slurm-tools)

---

## CARC support

- [Submit a support ticket](https://carc.usc.edu/user-information/ticket-submission)
- Office Hours
  - Every Tuesday 2:30-5pm
  - Find Zoom link [here](https://www.carc.usc.edu/education-and-resources/office-hours)

---

## Exercises

1. View configured resources for nodes on the largemem partition
2. Request an interactive job on the debug partition
3. Submit a "Hello world" job to the debug partition
4. Check efficiency of completed jobs