class: center, top, title-slide # R Programming for HPC
### Center for Advanced Research Computing
University of Southern California
#### Last updated on 2025-08-14 --- ## Outline 1. HPC overview 2. Parallel programming 3. Vectorizing code 4. Using memory efficiently 5. Processing data efficiently 6. Profiling and benchmarking 7. Resources and support --- class: center, middle, inverse ## Section 1 ### HPC overview --- ## What is HPC? - High-performance computing - Relative to laptop and desktop computers - More processors (cluster of compute nodes) - More memory (shared and distributed memory) - High-capacity, fast-access storage - High-speed, high-bandwidth networks --- ## What does HPC enable? - **Scaling** — using more compute resources (CPUs, memory, GPUs, etc.) - **Speedup** — faster run times - **Throughput** — running many jobs at the same time - **Duration** — running jobs for days or weeks --- ## Computing constraints - A compute node (individual computer) has four main components - Central Processing Unit (CPU with multiple cores/logical CPUs) - Random Access Memory (RAM) - Storage drives (direct-attached and/or network-attached storage) - Input/Output (I/O) buses - Four types of constraints (not exclusive) - CPU cores = CPU-bound jobs - Memory = Memory-bound jobs - I/O speed = I/O-bound jobs - Network speed = Network-bound jobs --- ## Compute node specs - HPC clusters may have a mix of different compute nodes - Different types of CPUs (e.g., Intel vs. AMD) - Varying numbers of **cores per node** (time) - Varying amounts of **memory per node** (space) - Some nodes may also have GPUs - On CARC clusters - Use `noderes` command to display available node resources - 1 logical CPU = 1 core = 1 thread (`--cpus-per-task`) --- ## CPU specs - Different CPU models have different capabilities and features - Instruction set architectures (e.g., x86-64, ARM64, etc.) - Microarchitecture designs (e.g., AMD Zen 3, Intel Cascade Lake, etc.) - Vector extensions (e.g., AVX, AVX2, AVX-512, etc.) - Clock speed (aka frequency) - Local memory cache size - Simply running jobs on a better CPU will typically reduce run time - Use `lscpu` to display CPU specs on a node --- ## Requesting vs. using HPC resources - For Slurm jobs, requesting resources does not mean that the job actually uses them - Likely need to modify your R code to make use of multiple cores or nodes - There are different ways to do this - Requesting more cores/nodes does not necessarily lead to speedups - There is an optimal number or cores depending on the computations --- ## Limitations of R - R is an interpreted language - Can be slow relative to compiled languages (C/C++/Fortran) - But easier to program - R uses a single core by default - R stores objects in memory by default - Maximum length of a vector is 2^31 - 1 (2,147,483,647) (32-bit integer) - Some of these limitations can be overcome --- ## Software optimization - R itself is mostly written in compiled languages (C/C++/Fortran) - Many R functions and packages are actually interfaces to compiled programs - Compiled programs can be optimized based on CPU architectures - Compiler optimization flags can be used to improve performance of R and some R packages - Primarily for installing packages on Linux from source - Specify optimization flags in `~/.R/Makevars` - If targeting a specific CPU type, the program will only run on that CPU type - Examples: `data.table`, `RStan` --- ## General recommendations to improve performance of R code - Code first, optimize later (if needed) - Simplify when possible (do less) - Use existing optimized solutions - Consult package and function documentation - Parallelize when appropriate - Use vectorized functions - Modify-in-place (avoid duplicating data in memory) - Profile code to identify bottlenecks --- ## Improving performance of serial (single-core) jobs - Use optimized software (e.g., OpenBLAS, data.table, etc.) - Use vectorized functions - Modify-in-place (avoid duplicating data in memory) - Use a better CPU - Higher frequency - Larger cache sizes --- class: center, middle, inverse ## Section 2 ### Parallel programming --- ## Parallel programming - Simultaneous execution of different parts of a larger computation - Data vs. task parallelism - Tightly coupled (interdependent) vs. loosely coupled (independent) computations - Using one (multicore) compute node is easier than using multiple nodes - **Scaling** computation to more processors (cores) - Focus on **speedup** (decrease in run time) --- ## Costs of parallelizing - Some computations are not worth parallelizing - Some costs to parallelizing (overhead) - Changing code - Spawning child processes - Copying data and environment - Communications - Load balancing - Speedup not proportional to number of cores (Amdahl's law) - Optimal number of cores - Depends on specific computations - Experiment to find --- ## Easy parallelism - Just specify number of cores or threads for function to use - Parallel programming details are abstracted away - Typically limited to one compute node (so maximize cores) - Less than or equal to physical cores or `$SLURM_CPUS_PER_TASK` - Multi-threaded BLAS/LAPACK library (not default reference) - BLAS = Basic Linear Algebra Subprograms - LAPACK = Linear Algebra Package - For example, OpenBLAS, Intel MKL, or AMD BLIS - Packages with multi-threaded programs - Typically programs written in C/C++ and using OpenMP for multi-threading - For example, `data.table` is multi-threaded via OpenMP --- ## Multi-threaded OpenBLAS - On CARC HPC clusters, R modules use multi-threaded OpenBLAS (via OpenMP) - Also built with dynamic CPU architecture optimizations - Set number of threads to use with environment variable `OMP_NUM_THREADS` - Or use `openblasctl` package - `install.packages("openblasctl", repos = "https://hpcran.org")` - Set threads dynamically using `openblasctl::openblas_set_num_threads()` --- ## OpenBLAS example ```r library(openblasctl) mat <- matrix(rexp(4000000), ncol = 2000) openblas_set_num_threads(1) system.time(eigen(mat)) system.time(svd(mat)) openblas_set_num_threads(4) system.time(eigen(mat)) system.time(svd(mat)) ``` --- ## data.table example ```r library(data.table) getDTthreads() setDTthreads(16) threads <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")) setDTthreads(threads) ``` --- ## Parallel iteration - Repeat operations with different inputs and process in parallel - Various use cases - Running same data processing workflow on many data files - Running same statistical model on many datasets - Running many alternative models on same dataset - Processing and analyzing data larger than memory available on single node - Set up cluster of cores or nodes to parallelize over - Use high-level functions to do this (e.g., `mclapply()`) - Easier to set up with single node compared to multiple nodes - Nested parallelism can cause resource contention - So set BLAS and OpenMP threads to 1 in these cases --- ## Base R parallel package - `library(parallel)` - Various functions for parallel computing - `mclapply()` - `mcmapply()` - `pvec()` - `makeCluster()` - `parApply()` - `parLapply()` - `parSapply()` - `mcparallel()` - `mccollect()` --- ## Using mclapply() - Parallel version of `lapply()` using forking - Works on Linux or macOS - Will only use 1 core on Windows - Forking is faster than making socket clusters - Use on a single node with multiple cores - For loosely coupled (independent) tasks (no communication needed between tasks) - Apply same function to multiple inputs simultaneously using multiple cores --- ## mclapply() example ```r # R multicore test (bootstrapping a GLM) library(parallel) trials <- 100000 cores <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")) data <- iris[iris$Species != "setosa", c(1, 5)] data$Species <- factor(data$Species) model <- function(i, samp = data) { ind <- sample(nrow(samp), nrow(samp), replace = TRUE) results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit")) coef(results) } coefs <- mclapply(1:trials, model, mc.cores = cores) ``` --- ## Slurm job script for mclapply() example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=2G #SBATCH --time=5:00 module purge module load rstats/4.5.1 Rscript multicore.R ``` --- ## Parallel with the futureverse - A unified parallelization framework for R - Abstracts away the technical aspects of parallel processing - Packages include `future`, `future.apply`, `furrr`, etc. - Set up parallel resources with `plan()` - Single node: `plan(multicore)` - Multiple nodes: `plan(cluster)` - Does not scale well beyond a few nodes - Various functions for parallel computing - `future_lapply()` - `future_map()` - [Futureverse website](https://www.futureverse.org/) --- ## future plan(multicore) example ```r # R future multicore test (bootstrapping a GLM) library(parallelly) library(future) library(future.apply) trials <- 100000 cores <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")) data <- iris[iris$Species != "setosa", c(1, 5)] data$Species <- factor(data$Species) model <- function(i, samp = data) { ind <- sample(nrow(samp), nrow(samp), replace = TRUE) results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit")) coef(results) } plan(multicore, workers = cores) coefs <- future_lapply(1:trials, model, future.seed = TRUE) ``` --- ## Job script for future plan(multicore) example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=2G #SBATCH --time=5:00 module purge module load rstats/4.5.1 Rscript future-multicore.R ``` --- ## future plan(cluster) example ```r # R future cluster test (bootstrapping a GLM) library(parallelly) library(future) library(future.apply) trials <- 100000 data <- iris[iris$Species != "setosa", c(1, 5)] data$Species <- factor(data$Species) model <- function(i, samp = data) { ind <- sample(nrow(samp), nrow(samp), replace = TRUE) results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit")) coef(results) } cl <- makeClusterPSOCK(availableWorkers(), revtunnel = FALSE) plan(cluster, workers = cl) coefs <- future_lapply(1:trials, model, future.seed = TRUE) ``` --- ## Job script for future plan(cluster) example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=2G #SBATCH --time=5:00 module purge module load rstats/4.5.1 Rscript future-cluster.R ``` --- ## Parallel with pbdMPI - `pbdMPI` package provides an interface to an MPI library - Message Passing Interface (MPI) - Optimized for multi-node communications - On CARC HPC clusters, R modules use OpenMPI - Provides best performance when scaling to many nodes - For using many CPU cores across multiple nodes - Or using few CPU cores but lots of memory on multiple nodes - Main function to use is `pbdLapply()` --- ## pbdLapply() example ```r # R pbdMPI multinode test (bootstrapping a GLM) library(pbdMPI) init() trials <- 400000 data <- iris[iris$Species != "setosa", c(1, 5)] data$Species <- factor(data$Species) model <- function(i, samp = data) { ind <- sample(nrow(samp), nrow(samp), replace = TRUE) results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit")) coef(results) } coefs <- pbdLapply(1:trials, model, pbd.mode = "spmd") comm.print(coefs[[1]]) finalize() ``` --- ## Job script for pbdLapply() example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 #SBATCH --cpus-per-task=1 #SBATCH --mem=64G #SBATCH --time=1:00:00 module purge module load rstats/4.5.1 srun Rscript script.R ``` --- ## Other relevant packages - `foreach` for parallel for loops - `BiocParallel` for Bioconductor objects - `rslurm` or `slurmR` for submitting Slurm jobs from within R - `targets` for defining workflows and parallel execution of steps --- ## Slurm job arrays - For submitting and managing collections of similar jobs quickly and easily - Some use cases - Varying simulation or model parameters - Running the same processing steps on different datasets - Running the same models on different datasets - Setting up a job array - Add `#SBATCH --array=
` option to job script - Each job task will use the same resource requests - Modify job script or R script to use the array index - [Slurm job array docs](https://slurm.schedmd.com/job_array.html) --- ## Slurm job array example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=16G #SBATCH --time=1:00:00 #SBATCH --array=1-3 module purge module load rstats/4.5.1 echo "Task ID: $SLURM_ARRAY_TASK_ID" Rscript script.R ``` --- ## Slurm job array example (continued) ```r # R script to process data (job array) library(data.table) files <- list.files("data", full.names = TRUE) task <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID")) file <- files[task] file data <- fread(file) summary(data) ``` --- ## Many-task computing - Lots of short-running jobs (< 15 minutes) - Lots of serial jobs that could be run in parallel on different cores - Lots of parallel jobs that could be run sequentially or in parallel - Submitting lots of jobs (> 1000) negatively impacts job scheduler - Pack short-running jobs into one job - Use a program like [HyperShell](https://hypershell.readthedocs.io/) --- ## Using GPUs - No well-maintained R packages for using GPUs - May not be worth using GPUs (compared to multi-threaded BLAS) - Mostly useful for machine learning packages - `torch` - `keras` - `tensorflow` --- ## Interfacing to a compiled language - If your R code is still not fast enough, consider rewriting it in a compiled language - R has a native interface for C and Fortran programs - Use `Rcpp` family of packages as interface for C++ programs - Could also interface to Julia via `JuliaCall` - Which language to use depends on the data types, computations, etc. --- class: center, middle, inverse ## Section 3 ### Vectorizing code --- ## Vectorizing code - R is designed around vectors (objects stored in column-major order) - Vectorize code when possible to improve performance - Think in terms of vectors (or columns), not scalars - Perform operations on vectors, not individual elements - Use vectorized functions that already exist - These functions are typically for loops written in C/C++/Fortran - Examples: arithmetic and matrix operators, `colMeans()`, `ifelse()` - A vectorized function in R is easier to read and write --- ## Vectorizing example ```r vec <- rexp(200000000) system.time({ test <- numeric(length(vec)) for (i in seq_along(vec)) { test[i] <- vec[i] * 3 } }) system.time(test2 <- vec * 3) ``` --- class: center, middle, inverse ## Section 4 ### Using memory efficiently --- ## Using memory efficiently - If R runs out of memory - Within R, error message like "cannot allocate vector" - For Slurm jobs, error message like "oom-kill event" (out-of-memory) - May just need to request more memory if available - Avoid copying data and modify-in-place when possible - Remove objects from environment when no longer needed - Store in simpler formats (e.g., use matrix instead of data frame when possible) - Store data in alternative efficient formats --- ## Avoiding object duplication - R tries not to copy objects (copy-on-modify) - Copying slows down run time and uses memory - Functions that modify objects will typically copy objects before modifying - Working with large data objects can lead to large memory use because of this - Growing objects will duplicate objects in memory - Pre-allocate objects when possible - Use modify-in-place operations when possible --- ## Object duplication example ```r # Create object a <- c(1, 2, 3, 4) tracemem(a) # Same object, different name b <- a tracemem(b) # Modify object with different name, so gets duplicated b[1] <- 5 # Modify same object in place, so no duplication b[3] <- 6 tracemem(b) ``` --- ## When does R copy objects? - Depends on how objects are modified and which functions are used - Can be difficult to predict when copies are made - Use `tracemem()` and memory profiling to collect data --- ## Memory-efficient data formats - Alternative data formats can reduce memory use - Smaller object size - Faster data I/O and data operations - For on-disk formats, performance limited by read/write speeds - For data frames - `data.table` package for modify-in-place operations - `arrow` package for on-disk binary format (columnar format) - `fst` package for on-disk binary format (efficient reading and subsetting) - For other data structures - `bigmemory` for big matrices - `bigstatsr` for on-disk big matrices - `pbdDMAT` for big matrices in MPI jobs - `ncdf4` or `RNetCDF` for NetCDF files (arrays) - `pbdNCDF4` for NetCDF files (arrays) in MPI jobs - `hdf5r` for HDF5 files --- class: center, middle, inverse ## Section 5 ### Processing data efficiently --- ## Fast data input/output (I/O) - Base R functions for I/O are relatively slow - For faster I/O (based on format) - `data.table` for tabular data - `vroom` for tabular data - `readr` for tabular data - `arrow` for binary Arrow files - `fst` for binary fst files (data frames) - `qs` for binary qs files - Minimize I/O when possible to improve performance - On CARC HPC clusters, `/project2` and `/scratch1` are high-performance, parallel I/O file systems --- ## Fast data processing - Base R and `tidyverse` packages are relatively slow - For faster data processing - `data.table` in general - `dtplyr` for drop-in `dplyr` substitute (`data.table` backend) - `tidytable` for tidy data manipulation (`data.table` backend) - `tidyfst` for tidy data manipulation (`data.table` backend) - `multidplyr` for big data `dplyr` substitute (> 10M obs) - `bigstatsr` for big matrices (larger than memory) - `collapse` for data transformation and exploration --- class: center, middle, inverse ## Section 6 ### Profiling and benchmarking --- ## Profiling and benchmarking - Aim for code that is *fast enough* - Programmer time is more expensive than compute time - Basic profiling workflow: 1. Profile code to understand run time and memory use 2. Identify bottlenecks (i.e., parts of code that take the most time) 3. Try to improve performance of bottlenecks by modifying code 4. Benchmark alternative code to identify best alternative --- ## Profiling R code - Base R: `Rprof()`, `summaryRprof()`, and `Rprofmem()` - `profvis` for RStudio (uses `Rprof()` output file) - `proftools` (more features, better interface) - `profmem` for memory profiling - `pbdPROF`, `pbdPAPI`, and `hpcvis` for MPI and low-level profiling - Profiling output can be difficult to interpret - Note that C/C++/Fortran code is not profiled (except with `pbdPAPI`) - Use [CARC OnDemand](https://ondemand.carc.usc.edu) apps like RStudio Server to view graphics --- ## Profiling using proftools - Use `profileExpr()` to profile R code or script - Profiles line-by-line and saves output - Then use other functions to summarize output - `srcSummary()` - `flatProfile()` - `hotPaths()` --- ## proftools example ```r library(proftools) srcfile <- system.file("samples", "bootlmEx.R", package = "proftools") system(paste("cat", srcfile)) pd <- profileExpr(source(srcfile)) srcSummary(pd) ``` --- ## Benchmarking R code - Base R: `system.time()` - `bench` (more features) - `microbenchmark` for short-running code - `benchmarkme` for benchmarking hardware --- ## system.time() examples ```r data <- data.frame(x = rexp(200000000), y = rexp(200000000)) system.time(data[data$x > 1, ]) system.time(data[which(data$x > 1), ]) system.time(subset(data, x > 1)) mat <- matrix(rexp(200000000), ncol = 800000) data <- as.data.frame(mat) system.time(for (i in seq_along(data)) mean(data[[i]])) system.time(apply(data, 2, FUN = mean)) system.time(sapply(data, mean)) system.time(colMeans(data)) system.time(colMeans(mat)) ``` --- class: center, middle, inverse ## Section 7 ### Resources and support --- ## Additional resources - [R Project](https://www.r-project.org) - [R Manuals](https://cran.r-project.org/manuals.html) - [R package documentation](https://rdrr.io) - [CRAN Task View on High-Performance and Parallel Computing with R](https://cran.r-project.org/web/views/HighPerformanceComputing.html) - [HPCRAN](https://hpcran.org/) - [Programming with Big Data in R](https://pbdr.org/) - [Futureverse](https://www.futureverse.org/) - [Fastverse](https://fastverse.github.io/fastverse/) - [rOpenSci](https://ropensci.org/) - [Bioconductor](https://www.bioconductor.org/) --- ## Web books - [The R Inferno](https://www.burns-stat.com/documents/books/the-r-inferno/) - [Advanced R](https://adv-r.hadley.nz/) - [Efficient R Programming](https://csgillespie.github.io/efficientR/) --- ## CARC support - [Workshop materials](https://github.com/uschpc/workshop-r-hpc) - [Submit a support ticket](https://www.carc.usc.edu/user-support/submit-ticket) - Office Hours - Every Tuesday 2:30-5pm - Get Zoom link [here](https://www.carc.usc.edu/user-support/office-hours-and-consultations)