class: center, top, title-slide # R Programming for HPC
### Center for Advanced Research Computing
University of Southern California
#### Last updated on 2025-10-09 --- ## Outline 1. HPC overview 2. Parallel programming 3. Using memory efficiently 4. Processing data efficiently 5. Profiling and benchmarking 6. Resources and support --- class: center, middle, inverse ## Section 1 ### HPC overview --- ## What is HPC? - High-performance computing - Relative to laptop and desktop computers - More/better compute resources - More processors (cluster of compute nodes) - More memory (shared and distributed memory) - High-capacity, fast-access storage - High-speed, high-bandwidth networks --- ## What does HPC enable? - **Scaling** — using more compute resources (CPUs, memory, GPUs, etc.) - **Speedup** — faster run times - **Throughput** — running many jobs at the same time - **Duration** — running jobs for days or weeks --- ## Computing constraints - A compute node (individual computer) has four main components - Central Processing Unit (CPU with multiple cores/logical CPUs) - Random Access Memory (RAM) - Storage drives (direct-attached and/or network-attached storage) - Input/Output (I/O) buses - Four types of constraints (not exclusive) - CPU cores = CPU-bound jobs - Memory = Memory-bound jobs - I/O speed = I/O-bound jobs - Network speed = Network-bound jobs --- ## Compute node specs - HPC clusters may have a mix of different compute nodes - Different types of CPUs (e.g., Intel vs. AMD) - Varying numbers of **cores per node** (time) - Varying amounts of **memory per node** (space) - Different memory bandwidth rates - Some nodes may also have GPUs - On CARC clusters - Use `noderes` command to display available node resources - 1 logical CPU = 1 core = 1 thread (`--cpus-per-task`) --- ## CPU specs - Different CPU models have different capabilities and features - Instruction set architectures (e.g., x86-64, ARM64, etc.) - Microarchitecture designs (e.g., AMD Zen 3, Intel Cascade Lake, etc.) - Vector extensions (e.g., AVX, AVX2, AVX-512, etc.) - Clock speed (aka frequency) - Local memory cache size - Simply running jobs on a better CPU will typically reduce run time - Use `lscpu` to display CPU specs on a node --- ## Requesting vs. using HPC resources - For Slurm jobs, requesting resources does not mean that the job actually uses them - Likely need to modify your R code to make use of multiple cores or nodes - There are different ways to do this - Requesting more cores/nodes does not necessarily lead to speedups - There is an optimal number or cores depending on the computations --- ## Limitations of R - R is an interpreted language - Can be slow relative to compiled languages (C/C++/Fortran) - But easier to program - R uses a single core by default - R stores objects in memory by default - Maximum length of a vector is 2^31 - 1 (2,147,483,647) (32-bit integer) - Some of these limitations can be overcome --- ## Software optimization - R itself is mostly written in compiled languages (C/C++/Fortran) - Many R functions and packages are actually interfaces to compiled programs - Compiled programs can be optimized based on CPU architectures - Compiler optimization flags can be used to improve performance of R and some R packages - Primarily for installing packages on Linux from source - Specify optimization flags in `~/.R/Makevars` - If targeting a specific CPU type, the program will only run on that CPU type - Examples: `data.table`, `RStan` --- ## Some recommendations for performance - Use a better CPU - Use existing optimized solutions - Code first, optimize later (if needed) - Simplify when possible (do less) - Parallelize when appropriate - Modify-in-place (avoid duplicating data in memory) - Profile code to identify bottlenecks - Consult package and function documentation --- class: center, middle, inverse ## Section 2 ### Parallel programming --- ## Parallel processing and programming - Parallel processing is the simultaneous execution of different parts of a larger computation - Parallel programming implements parallel processing with code - Different kinds of parallel processing - Data vs. task parallelism - Tightly coupled (interdependent) vs. loosely coupled (independent) computations - Using one (multicore) compute node is easier than using multiple nodes - **Scaling** computation to more processors (cores) - Focus on **speedup** (decrease in run time) --- ## Costs of parallelizing - Some computations are not worth parallelizing - Some costs to parallelizing (overhead) - Changing code - Spawning child processes - Copying data and environment - Communications - Load balancing - Speedup not proportional to number of cores (Amdahl's law) - Optimal number of cores - Depends on specific computations - Experiment to find --- ## Easy parallelism - Simply specify number of cores or threads for function to use - Parallel programming details are abstracted away - May be limited to one compute node (so maximize cores) - Set less than or equal to physical cores or `$SLURM_CPUS_PER_TASK` - Various examples - Multi-threading linear algebra library - Multi-threading `data.table` (via OpenMP) - Multi-processing `mclappy()` - Multi-processing `BiocParallel` --- ## Multi-threaded BLAS/LAPACK library - R uses external library for linear algebra operations - BLAS = Basic Linear Algebra Subprograms - LAPACK = Linear Algebra Package - On CARC clusters, the R modules use multi-threaded OpenBLAS (via OpenMP) - Default setting is one thread - Also built with dynamic CPU architecture optimizations - Set number of threads to use with environment variable `OMP_NUM_THREADS` ```bash # Allocate resources on compute node salloc -c 8 # Load R, set threads, then start R module purge module load rstats/4.5.1 export OMP_NUM_THREADS=8 R ``` --- ## OpenBLAS threads example Compare time for operations with varying threads and data sizes ```r # Smaller data mat <- matrix(rexp(1000000), ncol = 1000) mat2 <- matrix(rexp(1000000), ncol = 1000) system.time(mat %*% mat2) system.time(eigen(mat)) system.time(svd(mat)) # Larger data mat <- matrix(rexp(25000000), ncol = 5000) mat2 <- matrix(rexp(25000000), ncol = 5000) system.time(mat %*% mat2) system.time(eigen(mat)) system.time(svd(mat)) ``` --- ## Guidance for OpenBLAS threads - The default of one thread is fine in many cases - If working with large data sets, consider using multiple threads - Could save a lot of time if doing many BLAS operations - If doing parallel iteration with BLAS operations, use one thread (the default) or restrict iterations - Otherwise, resource contention that can degrade performance - For example, 8 cores total but running 8 parallel iterations trying to use 8 threads each - So restrict to 1 thread or get 64 cores in that case --- ## data.table threads example Compare time to write files with varying threads ```r library(data.table) getDTthreads() mat <- matrix(rexp(100000000), ncol = 10000) system.time(fwrite(mat, paste0(Sys.getenv("TMPDIR"), "/mat.csv"))) threads <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")) setDTthreads(threads) system.time(fwrite(mat, paste0(Sys.getenv("TMPDIR"), "/mat2.csv"))) ``` --- ## Parallel iteration - Repeat operations with different inputs and process in parallel - Various use cases - Running same data processing workflow on many data files - Running same statistical model on many datasets - Running many alternative models on same dataset - Processing and analyzing data larger than memory available on single node - Set up cluster of cores or nodes to parallelize over - Use high-level functions to do this (e.g., `mclapply()`) - Easier to set up with single node compared to multiple nodes - Nested parallelism can cause resource contention - So set BLAS and OpenMP threads to 1 in these cases (the default) --- ## Base R parallel package - `library(parallel)` - Various functions for parallel computing - `mclapply()` - `mcmapply()` - `pvec()` - `makeCluster()` - `parApply()` - `parLapply()` - `parSapply()` - `mcparallel()` - `mccollect()` --- ## Using mclapply() - Parallel version of `lapply()` using forking - Works on Linux or macOS - Will only use 1 core on Windows - Forking is faster than making socket clusters - Use on a single node with multiple cores - For loosely coupled (independent) tasks (no communication needed between tasks) - Apply same function to multiple inputs simultaneously using multiple cores --- ## mclapply() example ```r # R multicore test (bootstrapping a GLM) library(parallel) trials <- 100000 cores <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")) data <- iris[iris$Species != "setosa", c(1, 5)] data$Species <- factor(data$Species) model <- function(i, samp = data) { ind <- sample(nrow(samp), nrow(samp), replace = TRUE) results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit")) coef(results) } coefs <- mclapply(1:trials, model, mc.cores = cores) ``` --- ## Slurm job script for mclapply() example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=2G #SBATCH --time=5:00 module purge module load rstats/4.5.1 Rscript multicore.R ``` --- ## Parallel with the futureverse - A unified parallelization framework for R - Abstracts away the technical aspects of parallel processing - Packages include `future`, `future.apply`, `furrr`, etc. - Set up parallel resources with `plan()` - Single node: `plan(multicore)` - Multiple nodes: `plan(cluster)` - Does not scale well beyond a few nodes - Various functions for parallel computing - `future_lapply()` - `future_map()` - [Futureverse website](https://www.futureverse.org/) --- ## future plan(multicore) example ```r # R future multicore test (bootstrapping a GLM) library(parallelly) library(future) library(future.apply) trials <- 100000 cores <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")) data <- iris[iris$Species != "setosa", c(1, 5)] data$Species <- factor(data$Species) model <- function(i, samp = data) { ind <- sample(nrow(samp), nrow(samp), replace = TRUE) results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit")) coef(results) } plan(multicore, workers = cores) coefs <- future_lapply(1:trials, model, future.seed = TRUE) ``` --- ## Job script for future plan(multicore) example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=2G #SBATCH --time=5:00 module purge module load rstats/4.5.1 Rscript future-multicore.R ``` --- ## future plan(cluster) example ```r # R future cluster test (bootstrapping a GLM) library(parallelly) library(future) library(future.apply) trials <- 100000 data <- iris[iris$Species != "setosa", c(1, 5)] data$Species <- factor(data$Species) model <- function(i, samp = data) { ind <- sample(nrow(samp), nrow(samp), replace = TRUE) results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit")) coef(results) } cl <- makeClusterPSOCK(availableWorkers(), revtunnel = FALSE) plan(cluster, workers = cl) coefs <- future_lapply(1:trials, model, future.seed = TRUE) ``` --- ## Job script for future plan(cluster) example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=2G #SBATCH --time=5:00 module purge module load rstats/4.5.1 Rscript future-cluster.R ``` --- ## Parallel with pbdMPI - `pbdMPI` package provides an interface to an MPI library - Message Passing Interface (MPI) - Optimized for multi-node communications - On CARC HPC clusters, R modules use OpenMPI - Provides best performance when scaling to many nodes - For using many CPU cores across multiple nodes - Or using few CPU cores but lots of memory on multiple nodes - Main function to use is `pbdLapply()` --- ## pbdLapply() example ```r # R pbdMPI multinode test (bootstrapping a GLM) library(pbdMPI) init() trials <- 400000 data <- iris[iris$Species != "setosa", c(1, 5)] data$Species <- factor(data$Species) model <- function(i, samp = data) { ind <- sample(nrow(samp), nrow(samp), replace = TRUE) results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit")) coef(results) } coefs <- pbdLapply(1:trials, model, pbd.mode = "spmd") comm.print(coefs[[1]]) finalize() ``` --- ## Job script for pbdLapply() example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=2 #SBATCH --ntasks-per-node=16 #SBATCH --cpus-per-task=1 #SBATCH --mem=64G #SBATCH --time=1:00:00 module purge module load rstats/4.5.1 srun Rscript script.R ``` --- ## Other relevant packages - `foreach` for parallel for loops - `BiocParallel` for parallel processing with Bioconductor objects - `Rmpi` for parallel processing using MPI --- ## Job parallelism and many-task computing - Running many jobs/tasks in parallel (independent or dependent) - `rslurm`, `slurmR`, or `batchtools` for parallel processing by submitting Slurm jobs - `clustermq` or `mirai` for parallel processing of many tasks (locally or as Slurm jobs) - For defined workflows, use `targets` package with parallel processing - For running many similar jobs, use Slurm job arrays - For running many short-running jobs, use [HyperShell](https://hypershell.readthedocs.io/) --- ## Slurm job arrays - For submitting many similar Slurm jobs using a single job script - Some use cases - Varying simulation or model parameters - Running the same processing steps on different datasets - Running the same models on different datasets - Setting up a job array - Add `#SBATCH --array=
` option to job script - Each job task will use the same resource requests - Modify job script or R script to use the array index - Slurm job limits may require submitting array in chunks - [Slurm job array docs](https://slurm.schedmd.com/job_array.html) --- ## Slurm job array example ```bash #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=16G #SBATCH --time=1:00:00 #SBATCH --array=1-3 module purge module load rstats/4.5.1 echo "Task ID: $SLURM_ARRAY_TASK_ID" Rscript script.R ``` --- ## Slurm job array example (continued) ```r # R script to process data (job array) library(data.table) files <- list.files("data", full.names = TRUE) task <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID")) file <- files[task] file data <- fread(file) summary(data) ``` --- ## Many-task computing with HyperShell - [HyperShell](https://hypershell.readthedocs.io/) can be used like a mini-scheduler within a Slurm job - Submitting lots of jobs (e.g., > 1000) negatively impacts the Slurm job scheduler - And there are limits on the number of jobs a user can submit at the same time - So pack short-running jobs (e.g., < 1 hour) into one Slurm job - Lots of serial jobs that could be run in parallel on different cores - Lots of parallel jobs that could be run sequentially or in parallel ```bash # Load modules module purge module load hypershell/2.7.2 module load rstats/4.5.1 ``` --- ## Using GPUs - No well-maintained R packages for using GPUs - May not be worth using GPUs (compared to multi-threaded BLAS) - Mostly useful for machine learning tasks - `torch` - `keras` - `tensorflow` --- ## Interfacing to a compiled language - If your R code is not fast enough, consider rewriting it in a compiled language - R has a native interface for C and Fortran programs - Use `Rcpp` family of packages as interface for C++ programs - Could also interface to Julia via `JuliaCall` - Which language to use depends on the data types, computations, etc. --- class: center, middle, inverse ## Section 3 ### Using memory efficiently --- ## Using memory efficiently - If R runs out of memory - Within R, error message like "cannot allocate vector" - For Slurm jobs, error message like "oom-kill event" (out-of-memory) - May just need to request more memory if available - Avoid copying data and modify-in-place when possible - Remove objects from environment when no longer needed - Store in simpler formats (e.g., use matrix instead of data frame when possible) - Store data in alternative efficient formats --- ## Avoiding object duplication - R tries not to copy objects (copy-on-modify) - Copying slows down run time and uses memory - Functions that modify objects will typically copy objects before modifying - Working with large data objects can lead to large memory use because of this - Growing objects will duplicate objects in memory - Pre-allocate objects when possible - Use modify-in-place operations when possible --- ## Object duplication example ```r # Create object a <- c(1, 2, 3, 4) tracemem(a) # Same object, different name b <- a tracemem(b) # Modify object with different name, so gets duplicated b[1] <- 5 # Modify same object in place, so no duplication b[3] <- 6 tracemem(b) ``` --- ## When does R copy objects? - Depends on how objects are modified and which functions are used - Can be difficult to predict when copies are made - Use `tracemem()` and memory profiling to collect data --- ## Memory-efficient data formats - Alternative data formats can reduce memory use - Smaller object size - Faster data I/O and data operations - For on-disk formats, performance limited by read/write speeds - For data frames - `data.table` package for modify-in-place operations - `arrow` package for on-disk binary format (columnar format) - `fst` package for on-disk binary format (efficient reading and subsetting) - For other data structures - `bigmemory` for big matrices - `bigstatsr` for on-disk big matrices - `pbdDMAT` for big matrices in MPI jobs - `ncdf4` or `RNetCDF` for NetCDF files (arrays) - `pbdNCDF4` for NetCDF files (arrays) in MPI jobs - `hdf5r` for HDF5 files --- class: center, middle, inverse ## Section 4 ### Processing data efficiently --- ## Fast data input/output (I/O) - Base R functions for I/O are relatively slow - For faster I/O (based on format) - `data.table` for tabular data - `vroom` for tabular data - `readr` for tabular data - `arrow` for binary Arrow files - `fst` for binary fst files (data frames) - `qs` for binary qs files - Minimize I/O when possible to improve performance - On CARC HPC clusters, `/project2` and `/scratch1` are high-performance, parallel I/O file systems --- ## Fast data processing - Base R and `tidyverse` packages are relatively slow - For faster data processing - `data.table` in general - `dtplyr` for drop-in `dplyr` substitute (`data.table` backend) - `tidytable` for tidy data manipulation (`data.table` backend) - `tidyfst` for tidy data manipulation (`data.table` backend) - `multidplyr` for big data `dplyr` substitute (> 10M obs) - `bigstatsr` for big matrices (larger than memory) - `collapse` for data transformation and exploration --- class: center, middle, inverse ## Section 5 ### Profiling and benchmarking --- ## Profiling and benchmarking - Aim for code that is *fast enough* - Programmer time is more expensive than compute time - Basic profiling workflow: 1. Profile code to understand run time and memory use 2. Identify bottlenecks (i.e., parts of code that take the most time) 3. Try to improve performance of bottlenecks by modifying code 4. Benchmark alternative code to identify best alternative --- ## Profiling R code - Base R: `Rprof()`, `summaryRprof()`, and `Rprofmem()` - `profvis` for RStudio (uses `Rprof()` output file) - `proftools` (more features, better interface) - `profmem` for memory profiling - `pbdPROF`, `pbdPAPI`, and `hpcvis` for MPI and low-level profiling - Profiling output can be difficult to interpret - Note that C/C++/Fortran code is not profiled (except with `pbdPAPI`) - Use [CARC OnDemand](https://ondemand.carc.usc.edu) apps like RStudio Server to view graphics --- ## Profiling using proftools - Use `profileExpr()` to profile R code or script - Profiles line-by-line and saves output - Then use other functions to summarize output - `srcSummary()` - `flatProfile()` - `hotPaths()` --- ## proftools example ```r library(proftools) srcfile <- system.file("samples", "bootlmEx.R", package = "proftools") system(paste("cat", srcfile)) pd <- profileExpr(source(srcfile)) srcSummary(pd) ``` --- ## Benchmarking R code - Base R: `system.time()` - `bench` (more features) - `microbenchmark` for short-running code - `benchmarkme` for benchmarking hardware --- ## system.time() examples ```r data <- data.frame(x = rexp(200000000), y = rexp(200000000)) system.time(data[data$x > 1, ]) system.time(data[which(data$x > 1), ]) system.time(subset(data, x > 1)) mat <- matrix(rexp(200000000), ncol = 800000) data <- as.data.frame(mat) system.time(for (i in seq_along(data)) mean(data[[i]])) system.time(apply(data, 2, FUN = mean)) system.time(sapply(data, mean)) system.time(colMeans(data)) system.time(colMeans(mat)) ``` --- class: center, middle, inverse ## Section 6 ### Resources and support --- ## Additional resources - [R Project](https://www.r-project.org) - [R Manuals](https://cran.r-project.org/manuals.html) - [R package documentation](https://rdrr.io) - [CRAN Task View on High-Performance and Parallel Computing with R](https://cran.r-project.org/web/views/HighPerformanceComputing.html) - [HPCRAN](https://hpcran.org/) - [Programming with Big Data in R](https://pbdr.org/) - [Futureverse](https://www.futureverse.org/) - [Fastverse](https://fastverse.github.io/fastverse/) - [rOpenSci](https://ropensci.org/) - [Bioconductor](https://www.bioconductor.org/) --- ## Web books - [The R Inferno](https://www.burns-stat.com/documents/books/the-r-inferno/) - [Advanced R](https://adv-r.hadley.nz/) - [Efficient R Programming](https://csgillespie.github.io/efficientR/) --- ## CARC support - [Workshop materials](https://github.com/uschpc/workshop-r-hpc) - [Submit a support ticket](https://www.carc.usc.edu/user-support/submit-ticket) - Office Hours - Every Tuesday 2:30-5pm - Get Zoom link [here](https://www.carc.usc.edu/user-support/office-hours-and-consultations)