class: center, top, title-slide # R Programming for HPC
### Center for Advanced Research Computing
University of Southern California
#### Last updated on 2026-05-08 --- ## Outline 1. HPC overview 2. Parallel programming 3. Using memory efficiently 4. Processing data efficiently 5. Profiling and benchmarking 6. Resources and support --- class: center, middle, inverse ## Section 1 ### HPC overview --- ## What is HPC? - High-performance computing - Relative to laptop and desktop computers - More and better compute resources - More processors (cluster of compute nodes) - More memory (shared and distributed memory) - High-capacity, fast-access storage - High-speed, high-bandwidth networks --- ## What does HPC enable? - **Scaling** — using more compute resources (CPUs, memory, GPUs, etc.) - **Speedup** — faster run times - **Throughput** — running many jobs at the same time - **Duration** — running jobs for days or weeks --- ## Computing constraints - A compute node (individual computer) has four main components - Central Processing Unit (CPU with multiple cores/logical CPUs) - Random Access Memory (RAM) - Storage drives (direct-attached and/or network-attached storage) - Input/Output (I/O) buses - Four types of constraints (not exclusive) - CPU cores = CPU-bound jobs - Memory = Memory-bound jobs - I/O speed = I/O-bound jobs - Network speed = Network-bound jobs --- ## Compute node specs - HPC clusters may have a mix of different compute nodes - Different types of CPUs (e.g., Intel vs. AMD) - Varying numbers of **cores per node** (time) - Varying amounts of **memory per node** (space) - Different memory bandwidth rates - Some nodes may also have GPUs - On CARC clusters - Use `noderes` command to display available node resources - 1 logical CPU = 1 core = 1 thread (`--cpus-per-task`) --- ## CPU specs - Different CPU models have different capabilities and features - Instruction set architectures (e.g., x86-64, ARM64, etc.) - Microarchitecture designs (e.g., AMD Zen 3, Intel Cascade Lake, etc.) - Vector extensions (e.g., AVX, AVX2, AVX-512, etc.) - Clock speed (aka frequency) - Local memory cache size - Simply running jobs on a better CPU will typically reduce run time - Use `lscpu` to display CPU specs on a node --- ## Requesting vs. using HPC resources - For Slurm jobs, requesting resources does not mean that the job actually uses them - Likely need to modify your R code to make use of multiple cores or nodes - There are different ways to do this - Requesting more cores/nodes does not necessarily lead to speedups - There is an optimal number or cores depending on the computations --- ## Limitations of R - R is an interpreted language - Can be slow relative to compiled languages (C/C++/Fortran) - But easier to program - R uses a single core by default - R stores objects in memory by default - Maximum length of a vector is 2^31 - 1 (2,147,483,647) (32-bit integer) - Some of these limitations can be overcome --- ## Some recommendations for performance - Use a better CPU - Use existing optimized solutions - Code first, optimize later (if needed) - Simplify when possible (do less) - Parallelize when appropriate - Modify-in-place (avoid duplicating data in memory) - Profile code to identify bottlenecks - Consult package and function documentation --- ## Using R on CARC clusters - Via [CARC OnDemand](https://ondemand.carc.usc.edu) interactive apps - RStudio Server, JupyterLab, Code Server, etc. - Apps run as Slurm jobs - Via command line within Slurm jobs (interactive or batch) - R loaded via software module - Which uses R container built and run using Apptainer ```bash # Load R module module purge module load r/4.5.3 # Interactive mode R # Batch mode Rscript script.R ``` --- ## Prereqs for examples Start interactive Slurm job via [RStudio Server app](https://ondemand.carc.usc.edu/pun/sys/dashboard/batch_connect/sys/rstudio/session_contexts/new) Or via salloc: ```bash # Request resources salloc --cpus-per-task=8 --time=2:00:00 # Load and start R module purge module load r/4.5.3 R ``` Install packages: ```bash install.packages("RhpcBLASctl") install.packages("data.table") install.packages("parallelly") install.packages("future") install.packages("future.apply") install.packages("proftools") install.packages("boot") install.packages("bench") ``` --- class: center, middle, inverse ## Section 2 ### Parallel programming --- ## Parallel processing and programming - Parallel processing is the simultaneous execution of different parts of a larger computation - Parallel programming implements parallel processing with code - Different kinds of parallel processing - Data parallelism (e.g., split data into chunks and process) - Task parallelism (e.g., model parameter sweeps) - Loosely coupled parallelism (e.g., independent tasks with no communication) - Tightly coupled parallelism (e.g., interdependent tasks with communication) - Parallel programming for one compute node is easier than for multiple nodes - **Scaling** to more processors can lead to a **speedup** (decrease in run time) --- ## Some costs to parallelizing - Some computations are not worth parallelizing - Incurs overhead - Changing code - Spawning child processes - Copying data and environment - Communications - Load balancing - Speedup is not necessarily proportional to number of cores (Amdahl's law) - There is an optimal number of cores to use - Depends on specific data and computations - Must experiment to find --- ## Parallel processing in R - Easy to do in most cases using existing packages - Requires changing only a few lines of code - Simply specify number of cores or threads to use - Functions with a cores or threads argument - Parallel programming details are abstracted away for end user - Typically limited to one compute node (so maximize cores on node) - Set less than or equal to physical cores - Various examples - Multi-threading linear algebra library (e.g., OpenBLAS) - Multi-threading `data.table` package (via OpenMP) - Multi-processing `parallel` package - Multi-processing `future` packages --- ## Multi-threaded BLAS/LAPACK library - R uses external library for linear algebra operations - BLAS = Basic Linear Algebra Subprograms - LAPACK = Linear Algebra Package - On CARC clusters, the R modules use multi-threaded OpenBLAS - By default, multiple threads will be used as needed (if data is large enough) - Also uses dynamic CPU architecture optimizations - Can control number of OpenBLAS threads - Statically via environment variable OPENBLAS_NUM_THREADS - Dynamically via `RhpcBLASctl` package --- ## OpenBLAS threads example Compare run time with varying threads and data sizes ```r library(RhpcBLASctl) omp_get_max_threads() # Start with 1 thread, then repeat with 8 threads omp_set_num_threads(1) # Small data mat <- matrix(rexp(1000000), ncol = 1000) mat2 <- matrix(rexp(1000000), ncol = 1000) system.time(mat %*% mat2) system.time(eigen(mat)) system.time(svd(mat)) # Large data mat <- matrix(rexp(25000000), ncol = 5000) mat2 <- matrix(rexp(25000000), ncol = 5000) system.time(mat %*% mat2) system.time(eigen(mat)) system.time(svd(mat)) ``` --- ## data.table threads example Compare time to write files with varying threads ```r library(data.table) getDTthreads() mat <- matrix(rexp(200000000), ncol = 10000) # Start with 1 thread setDTthreads(1) system.time(fwrite(mat, paste0(Sys.getenv("TMPDIR"), "/mat.csv"))) # Repeat with multiple threads threads <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")) setDTthreads(threads) system.time(fwrite(mat, paste0(Sys.getenv("TMPDIR"), "/mat2.csv"))) ``` --- ## Parallel iteration - Repeat operations with different inputs and process in parallel - Various use cases - Running same data processing workflow on many data files - Running same statistical model on many datasets - Running many alternative models on same dataset - Set up cluster of cores to parallelize over (on one or more nodes) - Use high-level functions to do this (e.g., `mclapply()`) - Easier to set up with single node compared to multiple nodes --- ## The parallel package - `library(parallel)` - Included with base R - Various functions for parallel processing - [Manual page](https://cran.r-project.org/doc/manuals/r-release/packages/parallel/refman/parallel.html) --- ## mclapply() and parLapply() - Parallel versions of `lapply()` - Apply same function to different inputs in parallel using multiple cores - Return list with same length as vector of inputs - For independent tasks (no communication needed between tasks) - `mclapply()` - Parallel version of `lapply()` using forking - Single node - Works on Linux or macOS; uses 1 core on Windows - May be faster than `parLapply()` - May not work well within RStudio or other GUIs - `parLapply()` - Parallel version of `lapply()` using sockets - Single node or multiple nodes - Works on Linux, macOS, and Windows - Requires setting up workers, exporting data, etc. --- ## mclapply() example ```r # R parallel mclapply() test (bootstrapping a GLM) library(parallel) data <- iris[iris$Species != "setosa", c(1, 5)] data$Species <- factor(data$Species) model <- function(i, samp = data) { ind <- sample(nrow(samp), nrow(samp), replace = TRUE) results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit")) coef(results) } trials <- 100000 cores <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")) # Start with 1 core system.time(lapply(1:trials, model)) # Then repeat with multiple cores system.time(mclapply(1:trials, model, mc.cores = cores)) ``` --- ## Slurm job script for mclapply() example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=2G #SBATCH --time=5:00 module purge module load r/4.5.3 Rscript mclapply.R ``` --- ## The futureverse packages - A unified parallelization framework for R - Makes parallel programming easier by abstracting away technical details - Enhances base `parallel` package - **future** – abstraction for a value that may be available in the future - Programming construct for synchronizing program execution - Designed for concurrent and asynchronous evaluation of code - Various packages - `parallelly` — for utility functions - `future` — for setting up parallel workers - `future.apply` — for parallel versions of apply functions - `furrr` — for parallel versions of mapping functions - `futurize` — for one magic function - [Futureverse website](https://www.futureverse.org) --- ## Setting up parallel workers - `library(future)` - Set up parallel workers with `plan()` - `plan(sequential)` - Not parallel; uses 1 core - `plan(multicore)` - Single node - Works on Linux or macOS; uses 1 core on Windows - Not supported within RStudio or other GUIs - `plan(multisession)` - Single node - Works on Linux, macOS, and Windows - `plan(cluster)` - Single node or multiple nodes - Works on Linux, macOS, and Windows - Then use functions for parallel processing (e.g., `future_lapply()`) --- ## future plan(multisession) example ```r # R future multisession test (bootstrapping a GLM) library(parallelly) library(future) library(future.apply) data <- iris[iris$Species != "setosa", c(1, 5)] data$Species <- factor(data$Species) model <- function(i, samp = data) { ind <- sample(nrow(samp), nrow(samp), replace = TRUE) results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit")) coef(results) } trials <- 100000 cores <- availableCores() # Start with 1 core plan(sequential) system.time(future_lapply(1:trials, model, future.seed = TRUE)) # Then repeat with multiple cores plan(multisession, workers = cores) system.time(future_lapply(1:trials, model, future.seed = TRUE)) ``` --- ## Slurm job script for plan(multisession) example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=2G #SBATCH --time=5:00 module purge module load r/4.5.3 Rscript future-multisession.R ``` --- ## future plan(cluster) example ```r # R future cluster test (bootstrapping a GLM) library(parallelly) library(future) library(future.apply) data <- iris[iris$Species != "setosa", c(1, 5)] data$Species <- factor(data$Species) model <- function(i, samp = data) { ind <- sample(nrow(samp), nrow(samp), replace = TRUE) results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit")) coef(results) } trials <- 100000 cores <- availableWorkers() cl <- makeClusterPSOCK(cores, revtunnel = FALSE, rscript = c("/apps/generic/apptainer/1.4.5/bin/apptainer", "exec", Sys.getenv("R_CONTAINER"), "Rscript")) plan(cluster, workers = cl) system.time(future_lapply(1:trials, model, future.seed = TRUE)) ``` --- ## Slurm job script for plan(cluster) example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=2G #SBATCH --time=5:00 module purge module load r/4.5.3 Rscript future-cluster.R ``` --- ## Nested parallelism - Multiple levels of parallelism - Level 1 - iterations running in parallel - Level 2 - each iteration is parallelized in some way - May cause resource contention and degrade performance - So make sure you have enough CPU cores - Or remove nested parallelism - For example, parallel iteration of multi-threaded BLAS or `data.table` operations - 8 cores available, but running 8 tasks in parallel where each task uses 8 threads - So restrict to 1 BLAS or `data.table` thread - Or run tasks in sequence instead - Or get 64 CPU cores --- ## Job parallelism and many-task computing - Running many jobs/tasks in parallel (independent or dependent) - `rslurm`, `slurmR`, or `batchtools` for parallel processing by submitting Slurm jobs - `clustermq` or `mirai` for parallel processing of many tasks (locally or as Slurm jobs) - For defined workflows, use `targets` package with parallel processing - For running many similar jobs, use Slurm job arrays - For running many short-running jobs, use [HyperShell](https://hypershell.readthedocs.io/) --- ## Slurm job arrays - For submitting many similar Slurm jobs using a single job script - Some use cases - Varying simulation or model parameters - Running the same processing steps on different datasets - Running the same models on different datasets - Setting up a job array - Add `#SBATCH --array=
` option to job script - Each job task will use the same resource requests - Modify job script or R script to use the array index - Slurm job limits may require submitting array in chunks - [Slurm job array docs](https://slurm.schedmd.com/job_array.html) --- ## Slurm job script for job array example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=16G #SBATCH --time=1:00:00 #SBATCH --array=1-3 module purge module load r/4.5.3 echo "Task ID: $SLURM_ARRAY_TASK_ID" Rscript script.R ``` --- ## Slurm job array example (continued) ```r # R script to process data (job array) library(data.table) files <- list.files("data", full.names = TRUE) task <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID")) file <- files[task] file data <- fread(file) summary(data) ``` --- ## Many-task computing with HyperShell - [HyperShell](https://hypershell.readthedocs.io/) can be used like a mini-scheduler within a Slurm job - Submitting lots of jobs (e.g., > 1000) negatively impacts the Slurm job scheduler - And there are limits on the number of jobs a user can submit at the same time - So pack short-running jobs (e.g., < 1 hour) into one Slurm job - Lots of serial jobs that could be run in parallel on different cores - Lots of parallel jobs that could be run sequentially or in parallel ```bash # Load modules module purge module load hypershell/2.7.2 module load r/4.5.3 ``` --- ## Interfacing to a compiled program - Primarily for package developers - If your R code is not fast enough, consider rewriting it in a compiled language - Which language to use depends on the data types, computations, etc. - R has a native interface for C and Fortran programs - Interface packages for different languages - C++ — `Rcpp` family of packages - Rust — `rextendr` - Julia — `JuliaCall` --- class: center, middle, inverse ## Section 3 ### Using memory efficiently --- ## Using memory efficiently - If R runs out of memory - Within R, error message like "cannot allocate vector" - For Slurm jobs, error message like "oom-kill event" (out-of-memory) - May just need to request more memory if available - Avoid copying data and modify-in-place when possible - Remove objects from environment when no longer needed - Store in simpler formats (e.g., use matrix instead of data frame when possible) - Store data in alternative efficient formats --- ## Avoiding object duplication - R tries not to copy objects (copy-on-modify) - Copying slows down run time and uses memory - Functions that modify objects will typically copy objects before modifying - Working with large data objects can lead to large memory use because of this - Growing objects will duplicate objects in memory - Pre-allocate objects when possible - Use modify-in-place operations when possible --- ## Object duplication example ```r # Create object a <- c(1, 2, 3, 4) tracemem(a) # Same object, different name b <- a tracemem(b) # Modify object with different name, so gets duplicated b[1] <- 5 # Modify same object in place, so no duplication b[3] <- 6 tracemem(b) ``` --- ## When does R copy objects? - Depends on how objects are modified and which functions are used - Can be difficult to predict when copies are made - Use `tracemem()` and memory profiling to collect data --- ## Memory-efficient data formats - Alternative data formats can reduce memory use - Smaller object size - Faster data I/O and data operations - For on-disk formats, performance limited by read/write speeds - Packages for data frames - `data.table` for modify-in-place operations - `duckdb` for on-disk database format - `arrow` for on-disk binary format (columnar format) - `fst` for on-disk binary format (efficient reading and subsetting) - Packages for other data structures - `bigmemory` for big matrices - `bigstatsr` for on-disk big matrices - `pbdDMAT` for big matrices in MPI jobs - `ncdf4` or `RNetCDF` for NetCDF files (arrays) - `pbdNCDF4` for NetCDF files (arrays) in MPI jobs - `hdf5r` for HDF5 files --- class: center, middle, inverse ## Section 4 ### Processing data efficiently --- ## Fast data input/output (I/O) - Base R functions for I/O are relatively slow - Packages for faster I/O (based on format) - `data.table` for tabular data - `vroom` for tabular data - `readr` for tabular data - `duckdb` for on-disk database format - `arrow` for binary Arrow files - `fst` for binary fst files (data frames) - `qs` for binary qs files - Minimize I/O when possible to improve performance - On CARC HPC clusters, `/project2` and `/scratch1` are high-performance, parallel I/O file systems --- ## Fast data processing - Base R and `tidyverse` packages are relatively slow - Packages for faster data processing - `data.table` in general - `dtplyr` for drop-in `dplyr` substitute (`data.table` backend) - `tidytable` for tidy data manipulation (`data.table` backend) - `tidyfst` for tidy data manipulation (`data.table` backend) - `multidplyr` for big data `dplyr` substitute (> 10M obs) - `bigstatsr` for big matrices (larger than memory) - `collapse` for data transformation and exploration --- class: center, middle, inverse ## Section 5 ### Profiling and benchmarking --- ## Profiling and benchmarking - Aim for code that is *fast enough* - Programmer time is more expensive than compute time - Basic profiling workflow: 1. Profile code to understand run time and memory use 2. Identify bottlenecks (i.e., parts of code that take the most time) 3. Try to improve performance of bottlenecks by modifying code 4. Benchmark alternative code to identify best alternative --- ## Profiling R code - Base R: `Rprof()`, `summaryRprof()`, and `Rprofmem()` - `profvis` for RStudio (uses `Rprof()` output file) - `proftools` (more features, better interface) - `profmem` for memory profiling - `pbdPROF`, `pbdPAPI`, and `hpcvis` for MPI and low-level profiling - Note that C/C++/Fortran code is not profiled (except with `pbdPAPI`) - Profiling output can be difficult to interpret - Use [CARC OnDemand](https://ondemand.carc.usc.edu) apps like RStudio Server to view graphics --- ## Profiling using proftools - Use `profileExpr()` to profile R code or script - Profiles line-by-line and saves output - Then use other functions to summarize output - `srcSummary()` - `flatProfile()` - `hotPaths()` --- ## proftools example ```r library(proftools) library(boot) srcfile <- system.file("samples", "bootlmEx.R", package = "proftools") system(paste("cat", srcfile)) pd <- profileExpr(source(srcfile)) srcSummary(pd) ``` --- ## Benchmarking R code - Base R: `system.time()` - `bench` for more features - `microbenchmark` for short-running code --- ## bench::mark() example ```r data <- data.frame(x = rexp(100000000), y = rexp(100000000)) bench::mark( data[data$x > 1, ], data[which(data$x > 1), ], subset(data, x > 1), iterations = 5 ) ``` --- class: center, middle, inverse ## Section 6 ### Resources and support --- ## Additional resources - [R Project](https://www.r-project.org) - [R Manuals](https://cran.r-project.org/manuals.html) - [R package documentation](https://rdrr.io) - [R cheatsheets](https://posit.co/resources/cheatsheets/) - [CRAN Task View on High-Performance and Parallel Computing with R](https://cran.r-project.org/web/views/HighPerformanceComputing.html) - [Fastverse](https://fastverse.github.io/fastverse/) - [Futureverse](https://www.futureverse.org) - [Programming with Big Data in R](https://pbdr.org) - [HPCRAN](https://hpcran.org) - [rOpenSci](https://ropensci.org) - [Bioconductor](https://www.bioconductor.org) --- ## Web books - [The R Inferno](https://www.burns-stat.com/documents/books/the-r-inferno/) - [Advanced R](https://adv-r.hadley.nz/) - [Efficient R Programming](https://csgillespie.github.io/efficientR/) --- ## CARC support - [Workshop materials](https://github.com/uschpc/workshop-r-hpc) - [Submit a support ticket](https://www.carc.usc.edu/user-support/submit-ticket) - Office Hours - Every Tuesday 2:30-5pm - Get Zoom link [here](https://www.carc.usc.edu/user-support/office-hours-and-consultations)