HPC with R

class: left, top, title-slide

.title[
# HPC with R
]
.author[
### <img src="R.png"> Derek Strong / dstrong[at]usc.edu Center for Advanced Research Computing University of Southern California 
]
.date[
### Last updated on 2023-09-29
]

---

## Outline

1. What is HPC?
2. Limitations of R
3. Profiling and benchmarking
4. Vectorizing code
5. Efficient memory use
6. Data I/O
7. Parallel programming

---

## What is HPC?

- High-performance computing
- Relative to laptop and desktop computers
- More processors (cluster of compute nodes)
- More memory (shared and distributed memory)
- High-capacity, fast-access disk storage
- High-speed, high-bandwidth networks

---

## What does HPC enable?

- **Scaling** (using more compute resources)
- **Speedup** (faster run times)
- **Throughput** (running many jobs at the same time)
- **Duration** (running jobs for days or weeks)

---

## Why use R on HPC clusters?

- Limitations of a laptop or workstation computer
- Space (data does not fit in memory)
- Time (takes too long to run)
- Optimized software stacks available

---

## Simplified computing model

```sh
CPU <--> RAM <--> Storage
```

- A compute node (individual computer) has four main components
  - Central Processing Unit (CPU with multiple cores/logical CPUs)
  - Random Access Memory (RAM)
  - I/O buses
  - Storage drives (direct-attached and/or network-attached storage)

---

## Computing constraints

- CPU cores = CPU-bound jobs
- Memory = Memory-bound jobs
- I/O speed = I/O-bound jobs
- Network speed = Network-bound jobs

---

## Compute node specs

- HPC clusters may have a mix of different compute nodes
  - Different types of CPUs (e.g., Intel vs. AMD)
  - Varying numbers of **cores per node** (time)
  - Varying amounts of **memory per node** (space)
  - Some nodes may also have GPUs
- On CARC HPC clusters
  - Use `nodeinfo` command to display cluster partition and node status
  - 1 logical CPU = 1 core = 1 thread (`--cpus-per-task`)
  - See more node details with: `module load gcc/11.3.0 hwloc && lstopo`

---

## CPU specs

- Different CPU models have different capabilities and features
  - Instruction set architectures (e.g., x86-64, ARM64, etc.)
  - Microarchitecture designs (e.g., AMD Zen 3, Intel Cascade Lake, etc.)
  - Vector extensions (e.g., AVX, AVX2, AVX-512, etc.)
  - Clock speed (aka frequency)
  - Local memory cache size
- Simply running jobs on a better CPU will typically reduce run time
- Use `lscpu` to display CPU specs on a node

---

## Requesting vs. using HPC resources

- For Slurm jobs, requesting resources does not mean that the job actually uses them
- You likely need to modify your R code to make use of multiple cores or nodes
  - There are different methods to do this (implicit vs. explicit parallelism)
  - Requesting more cores/nodes does not necessarily lead to speedups
  - There is an optimal number or cores depending on the computations

---

## Limitations of R

- R is an interpreted language
  - Can be slow relative to compiled languages (C/C++/Fortran)
  - But easier to program
- R uses a single core by default
- R stores objects in memory by default
- Maximum length of a vector is 2^31 - 1 (2,147,483,647) (32-bit integer)
- **These limitations can be overcome**

---

## Software optimization

- R itself is mostly written in compiled languages (C/C++/Fortran)
- Many R functions and packages are actually interfaces to compiled programs
- Compiled programs can be optimized based on CPU architectures
- Compiler optimization flags can be used to improve performance of R and some R packages
  - Primarily for installing packages on Linux from source
  - Specify optimization flags in `~/.R/Makevars`
  - If targeting a specific CPU type, the program will only run on that CPU type
  - Examples: `data.table`, `RStan`

---

## General recommendations to improve performance of R code

- Code first, optimize later (if needed)
- Profile code to identify bottlenecks
- Use existing optimized solutions
- Consult package and function documentation
- Simplify when possible (do less)
- Use vectorized functions
- Modify-in-place (avoid duplicating data in memory)
- Parallelize when appropriate

---

## Profiling and benchmarking

- Aim for code that is *fast enough*
- Programmer time is more expensive than compute time
- Basic profiling workflow:

1. Profile code to understand run time and memory use
2. Identify bottlenecks (i.e., parts of code that take the most time)
3. Try to improve performance of bottlenecks by modifying code
4. Benchmark alternative code to identify best alternative

---

## Profiling R code

- Base R: `Rprof()`, `summaryRprof()`, and `Rprofmem()`
- `profvis` for RStudio (uses `Rprof()` output file)
- `proftools` (more features, better interface)
- `profmem` for memory profiling
- `pbdPROF`, `pbdPAPI`, and `hpcvis` for MPI and low-level profiling
- Profiling output can be difficult to interpret
- Note that C/C++/Fortran code is not profiled (except with `pbdPAPI`)
- Use [CARC OnDemand](https://ondemand.carc.usc.edu) apps to view graphics

---

## Profiling using proftools

- Use `profileExpr()` to profile R code or script
- Profiles line-by-line and saves output
- Then use other functions to summarize output
  - `srcSummary()`
  - `flatProfile()`
  - `hotPaths()`

---

## proftools example

```r
library(proftools)

srcfile <- system.file("samples", "bootlmEx.R", package = "proftools")
system(paste("cat", srcfile))

pd <- profileExpr(source(srcfile))

srcSummary(pd)
```

---

## Benchmarking R code

- Base R: `system.time()`
- `bench` (more features)
- `microbenchmark` for short-running code
- `benchmarkme` for benchmarking hardware

---

## system.time() examples

```r
data <- data.frame(x = rexp(200000000), y = rexp(200000000))

system.time(data[data$x > 1, ])
system.time(data[which(data$x > 1), ])
system.time(subset(data, x > 1))

mat <- matrix(rexp(200000000), ncol = 800000)
data <- as.data.frame(mat)

system.time(for (i in seq_along(data)) mean(data[[i]]))
system.time(apply(data, 2, FUN = mean))
system.time(sapply(data, mean))
system.time(colMeans(data))
system.time(colMeans(mat))
```

---

## Vectorizing code

- R is designed around vectors (objects stored in column-major order)
- Vectorize code when possible to improve performance
  - Think in terms of vectors (or columns), not scalars
  - Perform operations on vectors, not individual elements
- Use vectorized functions that already exist
  - These functions are typically for loops written in C/C++/Fortran
  - Examples: arithmetic and matrix operators, `colMeans()`, `ifelse()`
- A vectorized function in R is easier to read and write

---

## Vectorizing example

```r
vec <- rexp(200000000)

system.time({
 test <- numeric(length(vec))
 for (i in seq_along(vec)) {
 test[i] <- vec[i] * 3
 }
})

system.time(test2 <- vec * 3)
```

---

## Efficient memory use

- If R runs out of memory
  - Within R, error message like "cannot allocate vector"
  - For Slurm jobs, error message like "oom-kill event" (out-of-memory)
  - May just need to request more memory if available
- Avoid copying data and modify-in-place when possible
- Remove objects from environment when no longer needed
- Store in simpler formats (e.g., use matrix instead of data frame when possible)
- Store data in alternative efficient formats

---

## Avoiding object duplication

- R tries not to copy objects (copy-on-modify)
- Copying slows down run time and uses memory
- Functions that modify objects will typically copy objects before modifying
- Working with large data objects can lead to large memory use because of this
- Growing objects will duplicate objects in memory
- Pre-allocate objects when possible
- Use modify-in-place operations when possible

---

## Copy-on-modify and modify-in-place

```r
a <- c(1, 2, 3, 4)
tracemem(a)
b <- a
tracemem(b)
b[1] <- 5
b[3] <- 6
tracemem(b)
```

---

## When does R copy objects?

- Depends on how objects are modified and functions used
- Can be difficult to predict when copies are made
- Use `tracemem()` and memory profiling to collect data

---

## Object duplication example

```r
n <- 50000

# Create empty vector and grow
system.time({
 vec <- numeric(0)
 tracemem(vec)
 tr <- character(0)
 for (i in 1:n) {
 vec <- c(vec, i)
 tr <- c(tr, tracemem(vec))
 }
})
```

---

## Object duplication example (continued)

```r
n <- 50000

# Pre-allocate empty vector and replace values
system.time({
 vec <- numeric(n)
 tracemem(vec)
 tr <- character(n)
 for (i in 1:n) {
 vec[i] <- i
 tr[i] <- tracemem(vec)
 }
})

# In this case, just create vector directly
system.time(vec <- 1:n)
```

---

## Memory-efficient data formats

- Alternative data formats can reduce memory use
  - Smaller object size
  - Faster data I/O and data operations
  - For on-disk formats, performance limited by read/write speeds
- For data frames
  - `data.table` package for modify-in-place operations
  - `arrow` package for on-disk binary format (columnar format)
  - `fst` package for on-disk binary format (efficient reading and subsetting)
- For other data structures
  - `bigmemory` for big matrices
  - `bigstatsr` for on-disk big matrices
  - `pbdDMAT` for big matrices in MPI jobs
  - `ncdf4` or `RNetCDF` for NetCDF files (arrays)
  - `pbdNCDF4` for NetCDF files (arrays) in MPI jobs
  - `hdf5r` for HDF5 files

---

## Fast data I/O

- Base R functions for I/O are relatively slow
- For faster I/O (based on format)
  - `vroom` for tabular data
  - `readr` for tabular data (v2.0.0+ based on `vroom`)
  - `data.table` for tabular data
  - `arrow` for binary Arrow files
  - `fst` for binary fst files (data frames)
  - `qs` for binary qs files
- Minimize I/O when possible to improve performance
- On CARC HPC clusters, `/project` and `/scratch1` are high-performance, parallel I/O file systems

---

## Fast data processing

- Base R and `tidyverse` packages are relatively slow
- For faster data processing
  - `data.table` in general
  - `dtplyr` for drop-in `dplyr` substitute
  - `tidytable` for tidy data manipulation
  - `tidyfst` for tidy data manipulation
  - `multidplyr` for big data `dplyr` substitute (> 10M obs)
  - `bigstatsr` for big matrices (larger than memory)
  - `collapse` for data transformation and exploration

---

## Improving performance of serial (single-core) jobs

- Use optimized software (e.g., OpenBLAS, data.table, etc.)
- Use vectorized functions
- Modify-in-place (avoid duplicating data in memory)
- Use a better CPU
  - Higher frequency
  - Larger cache sizes

---

## Parallel programming

- Simultaneous execution of different parts of a larger computation
- Data vs. task parallelism
- Tightly coupled (interdependent) vs. loosely coupled (independent) computations
- Implicit vs. explicit parallel programming
- Using one (multicore) compute node is easier than using multiple nodes
- **Scaling** computation to more processors (cores)
- Focus on **speedup** (decrease in run time)

---

## Costs of parallelizing

- Some computations are not worth parallelizing
- Some costs to parallelizing (overhead)
  - Changing code
  - Spawning child processes
  - Copying data and environment
  - Communications
  - Load balancing
- Speedup not proportional to number of cores (Amdahl's law)
- Optimal number of cores
  - Depends on specific computations
  - Experiment to find

---

## Implicit parallelism

- Parallel programming details are abstracted away (low-effort parallelism)
- Limited to one compute node (maximize cores)
- Using optimized, multi-threaded BLAS/LAPACK library (not default reference)
  - BLAS = Basic Linear Algebra Subprograms
  - LAPACK = Linear Algebra Package
  - For example, OpenBLAS, Intel MKL, or AMD BLIS
- Using multi-threaded packages
  - Typically packages written in C/C++ and using OpenMP for multi-threading
  - If needed, simply set number of cores to use
  - Example: `data.table` is multi-threaded via OpenMP

---

## Optimized BLAS library

- On CARC HPC clusters, R modules use multi-threaded OpenBLAS
- Optimized linear algebra library used for linear algebra operations
- Will automatically use available number of cores if needed
- On CARC HPC clusters, OpenBLAS modules are multi-threaded via OpenMP
- Also built with dynamic CPU architecture optimizations
- Explicitly set number of cores to use with environment variable `OMP_NUM_THREADS`
- Or use `openblasctl::openblas_set_num_threads()`

---

## OpenBLAS example

```r
library(openblasctl)

mat <- matrix(rexp(4000000), ncol = 2000)

openblas_set_num_threads(1)

system.time(eigen(mat))
system.time(svd(mat))

openblas_set_num_threads(4)

system.time(eigen(mat))
system.time(svd(mat))
```

---

## data.table example

```r
library(data.table)

getDTthreads()

setDTthreads(16)

threads <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK"))
setDTthreads(threads)
```

---

## Explicit parallelism

- Explicitly set up cluster of cores or nodes to parallelize over
- Single node (shared memory) vs. multiple nodes (distributed memory)
- Easier to set up with single (multicore) node
- Different types of explicit parallelism
- Which one to use depends on specific computations

---

## Conflicts with implicit and explicit parallelism

- Be careful mixing implicit and explicit parallelism
- Implicit parallel code may use more resources than intended
- Turn off implicit OpenBLAS parallelism with `export OMP_NUM_THREADS=1`
- Or use `openblasctl::openblas_set_num_threads(1)`

---

## Some use cases for explicit parallelism

- Looping over large number of objects and applying same operations
- Running same model on many datasets
- Running many alternative models on same dataset
- Processing and analyzing data larger than memory available on single node

---

## Base R parallel

- `library(parallel)`
- Various functions for parallel computing
  - `mclapply()`, `pvec()`, `mcmapply()`
  - `makeCluster()`
  - `parApply()`, `parLapply()`, `parSapply()`
  - `mcparallel()`, `mccollect()`

---

## Using mclapply() from parallel

- Parallel version of `lapply()` using forking
  - Works on Linux or macOS
  - Will only use 1 core on Windows
- Forking is faster than making socket clusters
- Use on a single node with multiple cores
- For loosely coupled (independent) tasks (no communication needed between tasks)
- Apply same function to multiple inputs simultaneously using multiple cores

---

## mclapply() example

```r
# R multicore test (bootstrapping a GLM)

library(parallel)

trials <- 100000
cores <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK"))

data <- iris[iris$Species != "setosa", c(1, 5)]
data$Species <- factor(data$Species)

model <- function(i, samp = data) {
 ind <- sample(nrow(samp), nrow(samp), replace = TRUE)
 results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit"))
 coef(results)
}

coefs <- mclapply(1:trials, model, mc.cores = cores)
```

---

## Job script for mclapply() example

```sh
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=2G
#SBATCH --time=5:00

module purge
module load gcc/11.3.0
module load openblas/0.3.20
module load r/4.3.1

export OMP_NUM_THREADS=1

Rscript multicore.R
```

---

## Parallel with the futureverse

- A unified parallelization framework for R
- Abstracts away the technical aspects of parallel processing
- Packages include `future`, `future.apply`, `furrr`, `doFuture`, etc.
- Set up parallel resources with `plan()`
  - Single node: `plan(multicore)`
  - Multiple nodes: `plan(cluster)`
- Parallelize via `future_lapply()`, `future_map()`, etc.

---

## future plan(multicore) example

```r
# R future multicore test (bootstrapping a GLM)

library(parallelly)
library(future)
library(future.apply)

trials <- 100000
cores <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK"))

data <- iris[iris$Species != "setosa", c(1, 5)]
data$Species <- factor(data$Species)

model <- function(i, samp = data) {
 ind <- sample(nrow(samp), nrow(samp), replace = TRUE)
 results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit"))
 coef(results)
}

plan(multicore, workers = cores)

coefs <- future_lapply(1:trials, model, future.seed = TRUE)
```

---

## Job script for future plan(multicore) example

```sh
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=2G
#SBATCH --time=5:00

module purge
module load gcc/11.3.0
module load openblas/0.3.20
module load r/4.3.1

export OMP_NUM_THREADS=1

Rscript future-multicore.R
```

---

## future plan(cluster) example

```r
# R future cluster test (bootstrapping a GLM)

library(parallelly)
library(future)
library(future.apply)

trials <- 100000

data <- iris[iris$Species != "setosa", c(1, 5)]
data$Species <- factor(data$Species)

model <- function(i, samp = data) {
 ind <- sample(nrow(samp), nrow(samp), replace = TRUE)
 results <- glm(samp[ind, 2] ~ samp[ind, 1], family = binomial(link = "logit"))
 coef(results)
}

cl <- makeClusterPSOCK(availableWorkers(), revtunnel = FALSE)
plan(cluster, workers = cl)

coefs <- future_lapply(1:trials, model, future.seed = TRUE)
```

---

## Job script for future plan(cluster) example

```sh
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=2G
#SBATCH --time=5:00

module purge
module load gcc/11.3.0
module load openblas/0.3.20
module load r/4.3.1

export OMP_NUM_THREADS=1

Rscript future-cluster.R
```

---

## Creating workflows with targets

- Workflows define a set of steps to achieve some outcome
- E.g., a generic data analysis workflow
  - Import data
  - Clean data
  - Model data
  - Analysis results
- Some of these steps are independent and can be run simultaneously
- Other steps are dependent on previous steps
- The `targets` package provides utilities for defining R workflows and running steps in parallel

---

## Other packages for explicit parallelism

- `foreach` for parallel for loops
- `pbdMPI` for multi-node computing using MPI
- `BiocParallel` for Bioconductor objects
- `rslurm` or `slurmR` for submitting Slurm jobs from within R
- `targets` for defining and running workflows
- Slurm job arrays or Launcher job packing can be useful too

---

## Slurm job arrays

- For submitting and managing collections of similar jobs quickly and easily
- Some use cases
 - Varying simulation or model parameters
 - Running the same processing steps on different datasets
 - Running the same models on different datasets
- Setting up a job array
 - Add `#SBATCH --array=<index>` option to job script
 - Each job task will use the same resource requests
 - Modify job script or R script to use the array index
- [Slurm job array docs](https://slurm.schedmd.com/job_array.html)

---

## Job array example

```sh
#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=1:00:00
#SBATCH --array=1-3

module purge
module load gcc/11.3.0
module load openblas/0.3.20
module load r/4.3.1

echo "Task ID: $SLURM_ARRAY_TASK_ID"

Rscript script.R
```

---

## Job array example (continued)

```r
# R script to process data (job array)

library(data.table)

files <- list.files("./data", full.names = TRUE)
task <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))
file <- files[task]
file

data <- fread(file)

summary(data)
```

---

## Many-task computing

- Lots of short-running jobs (< 15 minutes)
 - Lots of serial jobs that could be run in parallel on different cores
 - Lots of parallel jobs that could be run sequentially or in parallel
- Submitting lots of jobs (> 1000) negatively impacts job scheduler
- Pack short-running jobs into one job
- Use a program like [Launcher](https://www.carc.usc.edu/user-information/user-guides/software-and-programming/launcher)

---

## Using GPU acceleration

- May not be worth using GPUs (compared to multi-threaded BLAS)
- Not many well-maintained packages
- Mostly useful for machine learning packages
  - `torch`
  - `keras`
  - `tensorflow`

---

## Interfacing to a compiled language

- If your R code is still not fast enough, consider rewriting it in a compiled language
- R has a native interface for C and Fortran programs
- Use `Rcpp` family of packages for interface for C++ programs
- Could also interface to Julia via `JuliaCall`
- Which language to use depends on the data types, computations, etc.

---

## Additional resources

- [R Project](https://www.r-project.org)
- [R Manuals](https://cran.r-project.org/manuals.html)
- [R package documentation](https://rdrr.io)
- [CRAN Task View on High-Performance and Parallel Computing with R](https://cran.r-project.org/web/views/HighPerformanceComputing.html)
- [HPCRAN](https://hpcran.org/)
- [Programming with Big Data in R](https://pbdr.org/)
- [Fastverse](https://fastverse.github.io/fastverse/)
- [Futureverse](https://www.futureverse.org/)
- [rOpenSci](https://ropensci.org/)
- Web books
  - [The R Inferno](https://www.burns-stat.com/documents/books/the-r-inferno/)
  - [Advanced R](https://adv-r.hadley.nz/)
  - [Efficient R Programming](https://csgillespie.github.io/efficientR/)

---

## CARC support

- [Using R on CARC HPC clusters](https://www.carc.usc.edu/user-information/user-guides/software-and-programming/r)
- [R package installation guide](https://hpc-discourse.usc.edu/t/r-package-installation-guide/653)
- [Submit a support ticket](https://www.carc.usc.edu/user-information/ticket-submission)
- Office Hours
  - Every Tuesday 2:30-5pm
  - Get Zoom link [here](https://www.carc.usc.edu/education-and-resources/office-hours)