Tools for Many Task Computing

# Tools for Many Task Computing

### Center for Advanced Research Computing University of Southern California

#### Last updated on 2026-03-20

---

## Outline

1. Many task computing
2. Slurm job arrays
3. HyperShell
4. Other tools
5. Resources and support
6. Exercises

---

## Section 1

### Many task computing (MTC)

---

## Key terms and definitions

- **job**
  - a workload composed of one or more tasks to perform
- **task**
  - a distinct unit of work to perform
- **many task computing (MTC)**
  - running a large number of (similar) computing tasks

---

## What is many task computing?

- Running a large number of (similar) computing tasks
- Repeating tasks with different inputs
  - Typically tasks are independent
  - For example, data processing workflows
  - For example, model parameter sweeps
- Parallel processing of tasks
  - Across multiple CPUs or GPUs
  - Across one or more compute nodes

---

## Tools for MTC

- Use tools designed for MTC
  - Slurm job arrays
  - HyperShell
- Easily submit thousands of tasks
- Various benefits
  - Automation and scaling
  - Fault tolerance and retries
  - Job packing

---

## Section 2

### Slurm job arrays

---

## Slurm job arrays

- Slurm feature for submitting and managing collections of similar batch jobs quickly and easily
- Useful for repetitive workloads that follow a common job pattern
- Submit many similar jobs using one job script
- All jobs in the array use the same options (e.g., nodes, CPUs, time, etc.)
- For long-running or short-running tasks
- [Slurm docs for job arrays](https://slurm.schedmd.com/job_array.html)

---

## Array index is key

- Use `#SBATCH --array=` option with some index (e.g., 1-10)
- Slurm creates environment variable $SLURM_ARRAY_TASK_ID
  - Evalutes to array index value (e.g., 1, 2, 3, etc.)
- Use this variable to select different inputs
  - Within script/program or as script/program argument
  - For example, select file to use from list of files
  - For example, select parameters to use from list of model parameter sets
- Can also use this variable to save to different outputs

---

## Simple Slurm job array example

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --time=5:00
#SBATCH --array=1-10

echo "Hello world from array task $SLURM_ARRAY_TASK_ID"
```

---

## Simple Slurm job array example (continued)

Submit via sbatch:

```bash
$ sbatch array.slurm
Submitted batch job 111222
```

Multiple jobs are created with unique output files:

```bash
$ cat slurm-111222_1.out
==========================================
SLURM_JOB_ID = 111223
SLURM_JOB_NODELIST = b22-16
TMPDIR = /tmp/SLURM_111223
==========================================
Hello world from array task 1
```

---

## File processing example

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=1:00:00
#SBATCH --array=1-10

echo "Task ID = $SLURM_ARRAY_TASK_ID"

module purge
module load rstats/4.5.2

Rscript files.R
```

---

## File processing example (continued)

```r
# R script to process data files (job array)

library(data.table)

# Select file to process
files <- list.files("data/raw", full.names = TRUE)
task <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))
file <- files[task]
file

# Read file
data <- fread(file)

# Filter rows
data <- data[col1 > 3]

# Write processed data to file
out <- paste0("data/derived/file-", task, ".csv")
fwrite(data, out)
```

---

## Array index option arguments

- `#SBATCH --array=1-100`
  - Continuous sequence
- `#SBATCH --array=1,5,12-20`
  - Specific index values
  - Useful for running specific tasks in an array
- `#SBATCH --array=0-100:10`
  - Sequence with steps
  - For example, step size of 10: 0, 10, 20, etc.
- `#SBATCH --array=1-100%10`
  - Continuous sequence with run limit
  - For example, only run 10 jobs at the same time

---

## Slurm environment variables for job array

- SLURM_ARRAY_TASK_ID
  - the job array index value
- SLURM_ARRAY_TASK_COUNT
  - the number of tasks in the job array
- SLURM_ARRAY_TASK_MAX
  - the highest job array index value
- SLURM_ARRAY_TASK_MIN
  - the lowest job array index value
- SLURM_ARRAY_JOB_ID
  - the first job ID of the array

---

## Additional notes

- Programming languages use different index bases
  - 0-based indexing (Bash, Python)
  - 1-based indexing (Julia, R, MATLAB)
- User job limits restrict the size of the array
  - Submitting a large number of jobs at the same time negatively impacts Slurm
  - Limits vary by Slurm partition
  - main partition = 5000 jobs can be in the queue at one time
  - gpu partition = 100 jobs can be in the queue at one time
  - May need to submit array in batches

---

## Section 3

### HyperShell

---

## HyperShell utility

- Cross-platform utility for processing shell commands from a task queue
- Use HyperShell to run multiple tasks (jobs) within a single Slurm job
 - Aka job packing
 - Functions like a mini-scheduler within the Slurm job
 - Schedules tasks on CPUs or GPUs
 - Useful for many short-running tasks (e.g., < 1 hr each)
- Tasks can be run in sequence or parallel
 - Depends on available resources (CPUs, memory, GPUs)
 - Depends on how much memory each task needs
 - Depends on if tasks are serial or parallel
- [HyperShell website](https://hypershell.readthedocs.io)

---

## HyperShell example 1

Running serial tasks in parallel

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --time=5:00

module purge
module load hypershell/2.7.2

yes 'echo "Hello world from task $TASK_ID"' | head -n 100 > tasks.txt

hs cluster tasks.txt --no-db --launcher=srun \
    --template="{} >& task-\$TASK_ID.log" \
    --failures=slurm-$SLURM_JOB_ID-failures.log
```

---

## HyperShell example 2

Running parallel tasks in sequence

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=8G
#SBATCH --time=1:00:00

module purge
module load hypershell/2.7.2
module load julia/1.12.5

export JULIA_NUM_THREADS=$SLURM_CPUS_PER_TASK

hs cluster tasks.txt --no-db --launcher=srun \
    --template="{} >& task-\$TASK_ID.log" \
    --failures=slurm-$SLURM_JOB_ID-failures.log
```

---

## HyperShell example 3

Running parallel tasks in parallel

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --mem=0
#SBATCH --time=1:00:00

module purge
module load hypershell/2.7.2
module load julia/1.12.5

export JULIA_NUM_THREADS=$SLURM_CPUS_PER_TASK

hs cluster tasks.txt --no-db --launcher=srun \
    --template="{} >& task-\$TASK_ID.log" \
    --failures=slurm-$SLURM_JOB_ID-failures.log
```

---

## HyperShell example 4

Running single GPU tasks in parallel

```bash
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --gpus-per-task=a100:1
#SBATCH --mem=32G
#SBATCH --time=1:00:00

module purge
module load hypershell/2.7.2
module load ver/2506
module load gcc/14.3.0
module load cuda/12.9.1

hs cluster tasks.txt --no-db --launcher=srun \
    --template="{} >& task-\$TASK_ID.log" \
    --failures=slurm-$SLURM_JOB_ID-failures.log
```

---

## Additional notes

- Node resource limits will determine HyperShell setup
  - Number of CPUs
  - Amount of memory
  - Number of GPUs
  - May need to run tasks in batches in separate Slurm jobs
  - Can use a Slurm job array!

---

## Section 4

### Other tools

---

## Other tools for MTC

- Shell command processors
  - [`xargs -P`](https://www.man7.org/linux/man-pages/man1/xargs.1.html)
  - [GNU parallel](https://www.gnu.org/software/parallel/)
- Language-specific tools
  - Python: [Dask](https://www.dask.org)
  - R: [future](https://www.futureverse.org), [clustermq](https://mschubert.github.io/clustermq/)
  - Julia: [Dagger.jl](https://juliaparallel.org/Dagger.jl/stable/), [SlurmClusterManager.jl](https://github.com/JuliaParallel/SlurmClusterManager.jl)
  - MATLAB: [Parallel Computing Toolbox](https://www.mathworks.com/products/parallel-computing.html)
- General workflow tools
  - [HyperQueue](https://it4innovations.github.io/hyperqueue/stable/)
  - [Makeflow](https://ccl.cse.nd.edu/software/makeflow/)
  - [Snakemake](https://snakemake.github.io)
  - [Pegasus](https://pegasus.isi.edu)

---

## Section 5

### Resources and support

---

## CARC support

- [Workshop materials](https://github.com/uschpc/workshop-mtc)
- [Submit a support ticket](https://www.carc.usc.edu/user-support/submit-ticket)
- Office Hours
  - Every Tuesday 2:30-5pm
  - Get Zoom link [here](https://www.carc.usc.edu/user-support/office-hours-and-consultations)

---

## Section 6

### Exercises

---

## Exercises

1. Submit a Slurm job array using index 0-1000 with step size 100
2. Submit a HyperShell job using 8 nodes with 1 CPU core each
3. Submit a Slurm job array that uses HyperShell