class: middle, center, title-slide # HPC with Julia
#### Derek Strong / dstrong[at]usc.edu #### Center for Advanced Research Computing
University of Southern California
##### Last updated on 2024-04-05 --- ## Outline 1. What is HPC? 2. Julia on CARC HPC clusters 3. Julia's high-performance design 4. Profiling and benchmarking 5. Parallel computing - Multi-threading - Multi-processing and distributed computing - GPU computing --- ## What is HPC? - High-performance computing - Relative to laptop and desktop computers - More processors (cluster of compute nodes) - More memory (shared and distributed memory) - High-capacity, fast-access storage - High-speed, high-bandwidth networks --- ## What does HPC enable? - **Scaling** (using more compute resources) - **Speedup** (faster run times) - **Throughput** (running many jobs at the same time) - **Duration** (running jobs for days or weeks) --- ## Why use Julia on HPC clusters? - Limitations of a laptop or workstation computer - Space (data does not fit in memory or storage) - Time (program takes too long to run) - A high-level, high-performance dynamic programming language for numerical computing - In general, faster than other dynamic languages (Python, R, Matlab, etc.) - Easier to learn and use than static, compiled languages (C, C++, Fortran) --- ## Simplified computing model ```sh CPU <--> RAM <--> Storage ``` - A compute node (individual computer) has four main components - Central Processing Unit (CPU with multiple cores/logical CPUs) - Random Access Memory (RAM) - Storage drives (direct-attached and/or network-attached storage) - Buses (memory, I/O) - Some nodes may have Graphics Processing Units (GPUs) attached (via I/O bus) ```sh Node <--> Node <--> Node <--> ... ``` - A computing cluster has multiple compute nodes networked together - Different types of networks are possible --- ## Compute node specs - HPC clusters may have a mix of different compute nodes - Different types of CPUs (e.g., Intel vs. AMD) - Varying numbers of **cores per node** (time) - Varying amounts of **memory per node** (space) - Different types of GPUs (e.g., A100 vs A40) - Varying numbers of **GPUs per node** (time) - Varying amounts of **memory per GPU** (space) - On CARC HPC clusters - Use `nodeinfo` command to display cluster partition and node status - 1 logical CPU = 1 core = 1 thread (`--cpus-per-task`) - See more node details with: `module load gcc/11.3.0 hwloc && lstopo` - See Discovery cluster resources [here](https://www.carc.usc.edu/user-information/user-guides/hpc-basics/discovery-resources) --- ## CPU specs - Different CPU models have different capabilities and features - Instruction set architectures (e.g., x86-64, ARM64, etc.) - Microarchitecture designs (e.g., AMD Zen 3, Intel Cascade Lake, etc.) - Vector extensions (e.g., AVX, AVX2, AVX-512, etc.) - Clock speed (aka frequency) - Local memory cache size - Simply running jobs on a better CPU will typically reduce run time - Use `lscpu` to display CPU specs on a node --- ## GPU specs - Different GPU models have different capabilities and features - Architecture (Volta vs. Ampere) - Clock speed (aka frequency) - Amount of GPU memory - GPU memory bandwidth - Number of CUDA cores - Number of tensor cores - Simply running jobs on a better GPU will typically reduce run time --- ## Computing constraints for jobs - Different types of constraints - CPU cores = CPU-bound jobs - Memory = Memory-bound jobs - I/O speed = I/O-bound jobs - Network speed = Network-bound jobs - In general, focus on the following - Number of CPU cores - CPU clock speed - Amount of memory - GPU model - Number of GPUs - Amount of GPU memory --- ## Reserving vs. using HPC resources - For Slurm jobs, reserving resources does not mean that the job actually uses them - Your Julia code has to make use of multiple cores and nodes - Some kind of parallel programming - Requesting more cores and nodes does not necessarily lead to speedups - There is an optimal number of cores depending on the computations --- ## Julia on CARC HPC clusters - Julia is available via the software module system ```sh module purge module load julia/1.10.2 ``` - Modules for new stable versions are added as they are released - [CARC OnDemand](https://ondemand.carc.usc.edu/) interactive apps provide a graphical interface - JupyterLab - VS Code - Terminal (e.g., [Pluto.jl](https://plutojl.org/)) --- ## Where to store Julia depot? - Julia depot = packages, artifacts, compiled cache, etc. - By default, the depot is stored in your home directory: `~/.julia` - The /home1 file system is NFS (no parallel I/O) - The /project and /scratch1 file systems are BeeGFS (parallel I/O) - For large parallel jobs, performance may be improved by moving depot to /project or /scratch1 - Specify depot path by setting environment variable `JULIA_DEPOT_PATH` - Make sure to use a unique path per Julia minor version - Add to `~/.bashrc` or `~/.zshrc` to automatically set every time you log in ```sh export JULIA_DEPOT_PATH=/scratch1/ttrojan/julia/depot ``` --- ## Julia's high-performance design - Designed to solve the two-language problem 1. Prototype in high-level, interpreted langauge (Python, R, Matlab, etc.) 2. Improve performance by re-writing in low-level, compiled language (C, C++, Fortran) - A dynamically typed language that runs fast - "walks like Python, runs like C" - Compiled, but feels like a scripting language - Can be used interactively - Productive for developing and running code - Two important design features - Multiple dispatch - Just-in-time (JIT) compiling (using [LLVM](https://www.llvm.org/)) --- ## Other Julia features - Package manager and ecosystem - Diverse ecosystem of packages - Efficient and reproducible package environments - Parallel pre-compilation of packages - Pre-built binaries for external dependencies (e.g., FFTW, HDF5, MPI, etc.) - Metaprogramming - Asynchronous programming - Native logging, debugging, profiling, and benchmarking - Open source ([MIT license](https://github.com/JuliaLang/julia/blob/master/LICENSE.md)) --- ## Multiple dispatch - Ability to define generic functions with different methods applied to different argument types ```julia function cube(x) x^3 end cube(5) cube(5.5) cube("julia") ``` - Type inference occurs at runtime; no explicit or restrictive typing needed - Julia will JIT compile different versions of the same function to efficient, specialized machine code based on different combinations of argument types - If type does not work with function actions, will get MethodError - Can restrict function inputs using subtyping (abstract types) - Enables high degree of code reuse - Ability to define new types that can be used with existing functions - Ability to define new functions that apply to existing types --- ## Types in Julia - Abstract and concrete types - Any - Number - Real - AbstractFloat - Float64, Float32 - Integer - Int64, Int32, Bool - Complex - AbstractChar - Char - AbstractString - String - SubString --- ## Just-in-time (JIT) compiling - A compiler translates code from one language to another - How Julia's JIT compiling works 1. Call Julia function with specific argument inputs 2. Infers type information from function inputs (e.g., Int64) 3. Translates Julia code into LLVM intermediate representation (IR) code (using specific types) 4. Then LLVM compiles IR code into static program (efficient, specialized machine code) - Compiling incurs an initial startup cost (first function call) - Useful macros for compiler introspection - Use `@code_warntype` to see inferred types and any issues - Use `@code_llvm` to see LLVM IR code - Use `@code_native` to see native assembly code --- ## Examples for compiler introspection ```julia function cube(x) x^3 end @code_warntype cube(5) @code_warntype cube(5.5) @code_warntype cube("julia") @code_llvm cube(5) @code_llvm cube(5.5) @code_llvm cube("julia") @code_native cube(5) @code_native cube(5.5) @code_native cube("julia") ``` --- ## General recommendations to improve performance of Julia code - Code first, optimize later (if needed) - Profile code to identify bottlenecks - Use existing optimized solutions - Consult package and function documentation - Simplify when possible (do less) - [Official Julia performance tips](https://docs.julialang.org/en/v1/manual/performance-tips/) - Write generic functions (modular and testable) - Keep types stable within function operations - Access arrays in memory order (columns) - Pre-allocate output objects - "Fuse" vectorized operations --- ## Profiling and benchmarking - Aim for code that is *fast enough* - Programmer time is more expensive than compute time - Basic profiling workflow: 1. Profile code to understand execution time and memory use 2. Identify bottlenecks (i.e., parts of code that take the most time) 3. Try to improve performance of bottlenecks by modifying code 4. Benchmark alternative code to identify best alternative --- ## Profiling Julia code - `@profile` macro with `using Profile` (sampling/statistical profiler) - Packages for visualizing `@profile` output - [FlameGraphs.jl](https://github.com/timholy/FlameGraphs.jl) - [ProfileView.jl](https://github.com/timholy/ProfileView.jl) - [ProfileSVG.jl](https://github.com/kimikage/ProfileSVG.jl) - [ProfileVega.jl](https://github.com/davidanthoff/ProfileVega.jl) - [StatProfilerHTML.jl](https://github.com/tkluck/StatProfilerHTML.jl) - [PProf.jl](https://github.com/JuliaPerf/PProf.jl) - Can also use external profiling tools: perf, OProfile, or Intel VTune - Profiling output can be difficult to interpret --- ## Profiling example ```julia using Profile using LinearAlgebra function test() A = rand(2000, 2000) B = inv(A) svd(B) end Profile.clear() @profile test() Profile.print() ``` --- ## Benchmarking Julia code - `@time` or `@timev` macros for elapsed time and memory allocations - To measure memory allocation line-by-line, start Julia with the `--track-allocation=
` option where setting can be one of none (default), user (everything except Julia's core code), or all (everything) (upon exit, results are written to .mem text files) - Alternatives to consider - `@btime` or `@benchmark` macros from [BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl) - `@timeit` macro from [TimerOutputs.jl](https://github.com/KristofferC/TimerOutputs.jl) - [BaseBenchmarks.jl](https://github.com/JuliaCI/BaseBenchmarks.jl) for benchmarking suites - [PkgBenchmark.jl](https://github.com/JuliaCI/PkgBenchmark.jl) for tracking package performance --- ## Benchmarking example ```julia using LinearAlgebra function test() A = rand(2000, 2000) B = inv(A) svd(B) end @time test() @timev test() using BenchmarkTools @btime test() @benchmark test() ``` --- ## Parallel computing - Simultaneous execution of different parts of a larger computation - Different kinds of parallel computing - Data vs. task parallelism - Tightly coupled (interdependent) vs. loosely coupled (independent) computations - Multi-threading vs. multi-processing - Shared memory vs. distributed memory - CPU vs. GPU computing - Using one (multicore) compute node is easier than using multiple nodes - **Scaling** computation to more processors (cores) - Focus on **speedup** (decrease in run time) - [Official docs on parallel computing](https://docs.julialang.org/en/v1/manual/parallel-computing/) --- ## Some use cases for parallel computing - Looping over large number of objects and applying same operations - Running same model on many datasets - Running many alternative models on same dataset - Processing and analyzing data larger than memory available on a single node --- ## Costs of parallelizing - Some computations are not worth parallelizing - Some costs to parallelizing (overhead) - Changing code - Spawning child processes - Copying data and environment - Communications - Speedup not proportional to number of cores (Amdahl's law) - Optimal number of cores - Depends on specific computations - Experiment with scaling tests to find --- ## Implicit parallel programming - Parallel details are abstracted away from end user (low-effort parallelism) - Typically based on multi-threading, sometimes multi-processing - Just need to specify how many threads or cores to use - Typically limited to one compute node (so maximize cores if needed) --- ## Existing solutions with multi-threading - When using packages with threaded functions - End user simply sets number of threads to use - `julia --threads 4` - `export JULIA_NUM_THREADS=4` - Check number of threads in use - `Threads.nthreads()` --- ## DataFrames.jl example - Functions in [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl) are multi-threaded - Just need to specify number of threads - Optimal number of threads may depend on specific function, data, etc. ```sh julia data.jl ``` vs. ```sh julia --threads 4 data.jl ``` --- ## Multi-threaded BLAS/LAPACK library - BLAS = Basic Linear Algebra Subprograms - LAPACK = Linear Algebra PACKage - By default, Julia uses multi-threaded [OpenBLAS](https://github.com/JuliaBinaryWrappers/OpenBLAS_jll.jl) library - Can switch to alternatives via [libblasttrampoline](https://github.com/JuliaLinearAlgebra/libblastrampoline) - BLAS/LAPACK threads (external library) are different from Julia threads - Check library and threads in use ```julia using LinearAlgebra BLAS.get_config() BLAS.get_num_threads() Sys.CPU_THREADS ``` - Good idea to set number of threads ```sh export OPENBLAS_NUM_THREADS=$SLURM_CPUS_PER_TASK julia ``` ```julia using LinearAlgebra BLAS.set_num_threads(8) ``` --- ## Conflicts between BLAS and Julia parallelism - Nested parallelism - BLAS threads vs. Julia threads - Combinining multi-threading and multi-processing - Set OPENBLAS_NUM_THREADS=1 --- ## Explicit parallel programming - End user explicitly sets up parallelism - Multi-threading - Multi-processing or distributed computing - GPU computing - Which kind to use depends on specific computations - Single node (shared memory) vs. multiple nodes (distributed memory) - Easier to set up with single (multicore) node --- ## Multi-threading - `using Base.Threads` - `@threads` macro - Composable task parallelism using shared memory (similar to OpenMP) - Try to balance workload between threads (static vs. dynamic scheduling) - Need to avoid data races - Problems typically stem from thread interactions - Limited to a single compute node - Set number of threads to use - `julia --threads 4` - `export JULIA_NUM_THREADS=4` - Check number of threads in use - `Threads.nthreads()` - [Official docs on multi-threading](https://docs.julialang.org/en/v1/manual/multi-threading/) - Also see [FLoops](https://github.com/JuliaFolds/FLoops.jl) for an alternative approach --- ## Multi-threading examples ```julia using Base.Threads # For loop @threads for i = 1:16 println("From thread $(threadid()) // i = $i") end # Task spawning function fib(n::Int) if n < 2 return n end t = @spawn fib(n - 2) return fib(n - 1) + fetch(t) end @timev fib(5) @timev fib(32) ``` --- ## Thread pinning - [ThreadPinning.jl](https://github.com/carstenbauer/ThreadPinning.jl/) - May improve performace by pinning threads to cores (task affinity) - Compact vs. scattered pinning --- ## Multi-threading job script example ```sh #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=16G #SBATCH --time=1:00:00 module purge module load julia/1.10.2 julia --threads $SLURM_CPUS_PER_TASK script.jl ``` --- ## Multi-processing and distributed computing - Enables scaling to multiple compute nodes - Different approaches available - [Distributed](https://docs.julialang.org/en/v1/stdlib/Distributed/) library in base Julia - [MPI.jl](https://github.com/JuliaParallel/MPI.jl) for traditional MPI operations - [Dagger.jl](https://github.com/JuliaParallel/Dagger.jl) for dynamic task scheduling - [Pipelines.jl](https://github.com/cihga39871/Pipelines.jl) for running pipelines and workflows - [JobSchedulers.jl](https://github.com/cihga39871/JobSchedulers.jl) for managing tasks and workloads - [Official docs on distributed computing](https://docs.julialang.org/en/v1/manual/distributed-computing/) --- ## Distributed computing via Distributed - `using Distributed` - Manager process and worker processes - One-sided message passing that allows programs to run on multiple processes in separate memory domains at the same time - Try to minimize sending messages and moving data for best performance - Can also use [ClusterManagers.jl](https://github.com/JuliaParallel/ClusterManagers.jl) to submit Slurm jobs from within Julia --- ## Setting up Distributed - On a single node ```sh julia --procs 4 ``` - On multiple nodes ```sh julia --machine-file nodes.txt ``` - Uses passwordless `ssh` key login - TCP/IP transport for communications - By default, all workers are connected to each other - Does not scale well - Optimal number of nodes is 2 or 3 - If you need more, use MPI.jl instead --- ## Distributed example on a single node ```julia using Distributed @everywhere function test(n::Int) A = rand(n, n) B = inv(A) end proc2 = remotecall(test, 2, 5000) proc3 = remotecall(test, 3, 5000) proc4 = remotecall(test, 4, 5000) sum = @spawnat :any fetch(proc2) + fetch(proc3) + fetch(proc4) fetch(sum) ``` --- ## Parallel loops and map examples - For small operations, use `@distributed` macro - For large operations, use `pmap` ```julia using Distributed @everywhere function test(n::Int) A = rand(n, n) B = inv(A) end total = @distributed (+) for i = 1:16 test(5000) end @everywhere using LinearAlgebra @everywhere function test(x::Matrix) A = inv(x) B = svdvals(A) end M = Matrix{Float64}[rand(5000, 5000) for i = 1:4] pmap(test, M) ``` --- ## Distributed example on multiple nodes - Need machine file that lists hostnames (e.g., nodes.txt) ```sh a01-02 a01-03 ``` - Then launch ```sh julia --machine-file nodes.txt ``` - For Slurm jobs, machine file can be auto-generated ```sh scontrol show hostnames $SLURM_NODELIST > $TMPDIR/nodes.txt julia --machine-file $TMPDIR/nodes.txt script.jl ``` --- ## Distributed job script example ```sh #!/bin/bash #SBATCH --account=
#SBATCH --partition=epyc-64 #SBATCH --constraint=epyc-7513 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=64 #SBATCH --cpus-per-task=1 #SBATCH --mem=0 #SBATCH --time=1:00:00 module purge module load julia/1.10.2 ulimit -s unlimited scontrol show hostnames $SLURM_NODELIST > $TMPDIR/nodes.txt julia --machine-file $TMPDIR/nodes.txt distributed.jl ``` --- ## Distributed computing via MPI.jl - [MPI.jl](https://github.com/JuliaParallel/MPI.jl) provides interface to an MPI implementation (OpenMPI, MPICH, MVAPICH2, Intel MPI) - By default, uses pre-built MPICH binary - May improve performance by linking to CARC-provided MPI module - Single program, multiple data (SPMD) model - Multiple processes running independent programs, communicate as necessary via messages --- ## Installing MPI.jl 1 - Load modules ```sh module purge module load julia/1.10.2 module load gcc/12.3.0 module load openmpi/4.1.6 ``` 2 - Set preferences to use system binary ```julia pkg> add MPIPreferences julia> using MPIPreferences julia> MPIPreferences.use_system_binary() ``` 3 - **Restart Julia** and then install ```julia pkg> add MPI ``` --- ## MPI hello world example - Develop MPI code ```julia using MPI MPI.Init() comm = MPI.COMM_WORLD println("Hello world from rank $(MPI.Comm_rank(comm)) of $(MPI.Comm_size(comm))") MPI.Barrier(comm) MPI.Finalize() ``` - Then launch with `srun` ```sh srun --mpi=pmix_v2 -n $SLURM_NTASKS julia mpi.jl ``` --- ## MPI reduce example ```julia using MPI MPI.Init() comm = MPI.COMM_WORLD root = 0 function test(n::Int) A = rand(n, n) B = inv(A) end res = MPI.Reduce(test(5000), MPI.SUM, root, comm) if(MPI.Comm_rank(comm) == root) println(size(res)) end MPI.Finalize() ``` --- ## MPI job script example ```sh #!/bin/bash #SBATCH --account=
#SBATCH --partition=epyc-64 #SBATCH --constraint=epyc-7513 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=64 #SBATCH --cpus-per-task=1 #SBATCH --mem=0 #SBATCH --time=1:00:00 module purge module load julia/1.10.2 module load gcc/12.3.0 module load openmpi/4.1.6 ulimit -s unlimited srun --mpi=pmix_v2 -n $SLURM_NTASKS julia mpi.jl ``` --- ## GPU computing - A few options for programming interfaces - [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) - [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) - [oneAPI.jl](https://github.com/JuliaGPU/oneAPI.jl) - [OpenCL.jl](https://github.com/JuliaGPU/OpenCL.jl) - Packages for specific applications - [Flux.jl](https://fluxml.ai/) for machine learning - [DiffEqGPU.jl](https://github.com/SciML/DiffEqGPU.jl) for neural differential equations - [Turing.jl](https://github.com/TuringLang/Turing.jl) for probabilistic programming - [Metalhead.jl](https://github.com/FluxML/Metalhead.jl) for computer vision --- ## Installing CUDA.jl 1 - Log on to a GPU node and load module ```sh module purge module load julia/1.10.2 ``` 2 - Then install ```julia pkg> add CUDA ``` 3 - Check CUDA ```julia julia> using CUDA julia> CUDA.versioninfo() ``` - Installs compatible CUDA toolkit based on NVIDIA driver version installed - Same NVIDIA driver version installed across GPU nodes --- ## CUDA example ```julia using CUDA W = cu(rand(1000, 1000)) b = cu(rand(1000)) predict(x) = W*x .+ b loss(x, y) = sum((predict(x) .- y).^2) x, y = cu(rand(1000)), cu(rand(1000)) loss(x, y) ``` --- ## GPU job script example ```sh #!/bin/bash #SBATCH --account=
#SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --gpus-per-task=a100:1 #SBATCH --mem=16G #SBATCH --time=1:00:00 module purge module load julia/1.10.2 julia --threads $SLURM_CPUS_PER_TASK cuda.jl ``` --- ## Slurm job arrays - For submitting and managing collections of similar jobs quickly and easily - Some use cases - Varying simulation or model parameters - Running the same models on different datasets - Setting up a job array - Add `#SBATCH --array=
` option to job script - Each job task will use the same resource requests - Modify job or Julia script to use the array index - [Slurm job array docs](https://slurm.schedmd.com/job_array.html) --- ## Slurm job array script example ```sh #!/bin/bash #SBATCH --account=
#SBATCH --partition=main #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=16G #SBATCH --time=1:00:00 #SBATCH --array=1-5 module purge module load julia/1.10.2 julia --threads $SLURM_CPUS_PER_TASK array.jl ``` --- ## Julia script example for Slurm job array ```sh using CSV using DataFrames files = readdir("./data", join = true) task = parse(Int64, ENV["SLURM_ARRAY_TASK_ID"]) file = files[task] data = CSV.read(file, DataFrame) describe(data) ``` --- ## Many-task computing - Lots of short-running jobs (< 15 minutes) - Lots of serial jobs that could be run in parallel on different cores - Lots of parallel jobs that could be run sequentially or in parallel - Submitting lots of jobs (> 1000) negatively impacts job scheduler - Pack short-running jobs into one job - Use a program like [Launcher](https://www.carc.usc.edu/user-information/user-guides/software-and-programming/launcher) --- ## Additional resources - [Julia language](https://www.julialang.org) - [Julia documentation](https://docs.julialang.org/en/v1/) - [Julia style guide](https://docs.julialang.org/en/v1/manual/style-guide/) - [Julia performance tips](https://docs.julialang.org/en/v1/manual/performance-tips/) - [JuliaHub](https://juliahub.com/ui/Home) - [JuliaParallel](https://github.com/JuliaParallel) - [JuliaGPU](https://juliagpu.org/) - [Flux](https://fluxml.ai/) Tutorials: - [Julia Learning](https://julialang.org/learning/) - [JuliaAcademy](https://juliaacademy.com/courses) Web books: - [Think Julia: How to Think Like a Computer Scientist](https://benlauwens.github.io/ThinkJulia.jl/latest/book.html) - [Julia Data Science](https://juliadatascience.io/) --- ## CARC support - [Using Julia on CARC HPC clusters](https://www.carc.usc.edu/user-information/user-guides/software-and-programming/julia) - [Submit a support ticket](https://www.carc.usc.edu/user-information/ticket-submission) - Office Hours - Every Tuesday 2:30-5pm - Get Zoom link [here](https://www.carc.usc.edu/education-and-resources/office-hours)