class: left, top, title-slide .title[ # Introduction to R ] .author[ ### Derek Strong
dstrong[at]usc.edu
Center for Advanced Research Computing
University of Southern California
] .date[ ### Last updated on 2023-02-09 ] --- ## Outline 1. What is R? 2. Base R and packages 3. Objects and functions 4. Importing and exporting data 5. Summarizing data 6. Visualizing data 7. Modeling data 8. Control flow 9. Iteration 10. Scripting --- ## What is R? ![R](./images/Rlogo.png) - A programming environment and language for statistical computing and graphics - A multi-paradigm programming language - A high-level (human-friendly) interpreted scripting language --- ## History of R - Initially developed by statisticians at University of Auckland (1993) - Imitates syntax of older S language developed at Bell Labs (1976) - Uses open-source licenses (GNU GPL) - Intended as a high-level interface to numerous algorithms - Designed to make data analysis easier - Relative to compiled languages (C/C++/Fortran) - Emphasis on interactive data analysis - Rapid prototyping, high productivity - Ability to move from user to developer - Base R is mostly written in C as well as Fortran and R - R packages are mostly written in R and C/C++ --- ## Why R? - Main alternatives - Python - Julia - Stata - SAS - SPSS - Advantages - Designed for statistical computing - Free/libre/open source - Diverse community of users - Interpreted language (easy to write code compared to compiled language) - Disadvantages - Interpreted language (performance penalty) - Two-language problem (improving performance with compiled language) --- ## Installing R - Find instructions on [CRAN](https://cran.r-project.org/) - Multiple versions installed on CARC HPC clusters - `module purge` - `module load gcc/11.3.0 openblas/0.3.20 r/4.2.1` - Save a custom default module collection with `module save` - Then load it with `module restore` when you log in - Also consider using an integrated development environment (IDE) - [RStudio](https://posit.co/products/open-source/rstudio/) - [JupyterLab](https://jupyter.org/) - Available to use via [CARC OnDemand](https://ondemand.carc.usc.edu) --- ## Using R Two modes for the command-line interface (CLI): | Mode | Command | |---|---| | Interactive (console) | `R` | | Batch | `Rscript` | --- ## R console Focus on the R console for learning R: ```r $ R R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid" Copyright (C) 2022 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > ``` --- ## Arithmetic operators | Description | Operator | |---|---| | Addition | `+` | | Subtraction | `-` | | Multiplication | `*` | | Division | `/` | | Exponentiation | `^` or `**` | | Integer division | `%/%` | | Modulo | `%%` | --- ## Base R and packages - A **package** is a formal collection of related R functions, objects, classes, etc. - Base R is a stable set of core packages - Some base packages are loaded automatically when starting a new R session - Other packages that extend base R need to be installed - On Linux, most packages are compiled (if they contain C/C++/Fortran programs) - On macOS or Windows, binary versions are downloaded by default - Over 18,000 packages available - See [`renv`](https://rstudio.github.io/renv/) package for reproducible package environments *There's more than one way to do something in R* --- ## Installing and using packages Main repositories: - [CRAN](https://cran.r-project.org/) - [Bioconductor](https://www.bioconductor.org/) - Development versions from GitHub, GitLab, etc. Install with: ```r install.packages("package") BiocManager::install("package") remotes::install_github("org/package") ``` Load with: ```r library(package) ``` Remove with: ```r remove.packages("package") ``` --- ## Some R tips - R has case sensitive syntax - Use up/down arrow keys to cycle through command history - Use `tab` for auto-completion - Spacing can be important (mostly for readability) --- ## Objects and functions Two fundamental principles: 1. Everything that exists in R is an **object** 2. Everything that happens in R is a **function** call --- ## Objects - An object is anything stored in the R environment - Typically a data structure with associated attributes - A data structure is composed of data types - Could also be a function, a plot, etc. --- ## Data types Atomic types: - double (aka numeric / double precision float) - integer (aka numeric) - complex - character (aka string) - logical (`TRUE` or `FALSE`) - raw Other types based on atomic types: - factor (categorical, based on integer) - Date, POSIXct, POSIXlt, difftime, ts (based on double) Some packages provide other types: - [float](https://cran.r-project.org/web/packages/float/index.html) for single precision float - [lubridate](https://lubridate.tidyverse.org/) and [hms](https://hms.tidyverse.org/) for dates and times --- ## Data structures | Dimensions | Homogeneous data type | Heterogeneous data types | |---|---|---| | 1d | atomic vector | list | | 2d | matrix | data frame | | nd | array | | --- ## Functions for creating data structures | Data structure | Function | Coercion | Test | |---|---|---|---| | atomic vector | `c()` or `vector()` | `as.vector()` | `is.vector()` | | matrix | `matrix()` | `as.matrix()` | `is.matrix()` | | array | `array()` | `as.array()` | `is.array()` | | list | `list()` or `c()` or `vector()` | `as.list()` | `is.list()` | | data frame | `data.frame()` | `as.data.frame()` | `is.data.frame()` | --- ## Creating objects Create an object by assigning a value to a name Example for a simple vector: ```r vec <- c(1, 2, 3, 4) ``` - `vec` is the name of the object - `<-` is the assignment operator - `c()` creates a vector object - `c(1, 2, 3, 4)` is the value of the object - Names and values are different - Names are assigned to values --- ## Inspecting objects Various functions: - `ls()` to list objects - `rm()` to remove objects - `class()` to print class of object - `str()` to display the structure of an object Example for a simple vector: ```r vec <- c(1, 2, 3, 4) ls() class(vec) str(vec) rm(vec) ``` --- ## Missing values Missing values are represented by `NA` (considered logical) ```r missing <- c(1, 2, 3, NA) is.na(missing) anyNA(missing) missing[!is.na(missing)] ``` *Many functions have an `na.rm` argument to ignore/remove missing values* --- ## The data frame - Probably the most commonly used data structure - 2d with a mix of numeric, character, factor, etc. columns - Technically implemented as a list (set of vectors with the same length) - Enhanced versions of data frames - data table from `data.table` - tibble from `tidyverse` Example: ```r iris head(iris) tail(iris) str(iris) is.data.frame(iris) is.list(iris) ``` --- ## The list - A generic vector for a collection of objects - Useful for iteration purposes Example: ```r mix <- list(iris, uspop) str(mix) is.list(mix) as.list(iris) ``` --- ## The functional perspective R is mostly used as a functional programming language Inputs and outputs: .center[input object (argument) → `func()` → output object (value)] Example: .center[data frame → `tail()` → last rows of data frame] --- ## Functions A **function** is a unit of code that takes an input and produces an output Generic structure of functions in R: ```r func(arg1, arg2, ...) ``` - `func` is the function name - `arg1`, `arg2` are arguments passed to function - `...` represents other optional arguments Notes on using functions: - Arguments can be optional and may have default values - Argument names may need to be made explicit (e.g., `arg3 = value`) - Can also use syntax `package::func()` (if package is not loaded) - `print()` is the default function --- ## Function example Variations of a `tail()` function call: ```r tail(iris) tail(x = iris) tail(iris, 10) tail(iris, n = c(5, 2)) utils::tail(iris) ``` --- ## Nesting functions Function within a function example: ```r str(tail(iris)) ``` Or explicitly create intermediate objects: ```r last <- tail(iris) str(last) rm(last) ``` Or use pipe operator `|>` (may improve code readability): ```r iris |> tail() |> str() ``` --- ## Creating functions Define a function and assign it a name Example: ```r math <- function(x, y) { z <- x + y z * x } ``` - Create functions when reusing a set of code frequently - DRY principle (do not repeat yourself) --- ## Getting help For specific function: ```r ?func ``` For general search: ```r ??keyword ``` For specific package: ```r library(help = "pkg") ``` --- ## Working with files Main functions: - `getwd()` - `setwd()` - `list.files()` *Best practice is to organize R project files in unique directories, use relative paths* --- ## Importing and exporting data - Best to avoid base R and use other packages (better performance and features) - Some options depending on data format: | Data format | Package | |---|---| | Tabular | `readr`, `vroom`, `data.table` | | Spreadsheets | `xlsx`, `readxl`, `writexl`, `googlesheets4` | | Binary | `arrow`, `fst`, `blob` | | Foreign | `foreign`, `haven` | | Database | `odbc`, `DBI`, `RMariaDB`, `RPostgreSQL`, `RSQLite` | | HDF5 | `hdf5r` | | NetCDF | `ncdf4`, `RNetCDF` | | Spatial | `raster`, `sf`, `terra`, `stars` | --- ## Indexing operators | Description | Operator | |---|---| | Select one or more elements or rows/columns | `x[ ]` | | Select one element or column | `x[[ ]]` | | Select one element or column by name | `x$name` | - Use indexing operators for replacing values or subsetting (aka slicing) - Subsetting will typically simplify data structure by default Some examples: ```r iris[1, 2] <- 3 iris[ , 1:2] iris[1:2, ] iris[1:2, 1:2] iris[[1]] iris[[2]][1:4] iris$Sepal.Length iris$Sepal.Length <- NULL iris[["Species"]] ``` --- ## Relational operators | Relation | Operator | |---|---| | Equal to | `==` | | Not equal to | `!=` | | Lesser than | `<` | | Greater than | `>` | | Lesser than or equal to | `<=` | | Greater than or equal to | `>=` | Some examples: ```r iris[1, 2] == 2 iris[1, 2] != 2 iris[1, 2] >= 2 ``` --- ## Logical operators | Logic | Operator | |---|---| | NOT (negation) | `!` | | Element-wise AND (multiple TRUE) | `&` | | AND (first element) | `&&` | | Element-wise OR (one TRUE) | | | | OR (first element) | || | Some examples: ```r !TRUE TRUE & TRUE TRUE & FALSE FALSE & FALSE TRUE | FALSE FALSE | FALSE ``` --- ## Subsetting examples ```r iris[iris$Species == "setosa", ] iris[iris$Species != "setosa", ] iris[iris$Sepal.Length > 5 & iris$Sepal.Width < 3, ] iris[iris$Sepal.Length > 5 & iris$Sepal.Width < 3, 1:2] iris[iris$Sepal.Length > 5 | iris$Sepal.Width < 3, ] ``` Other packages with subsetting functions: - `data.table` - `dplyr` or `dtplyr` --- ## Summarizing data Some examples: ```r str(iris) summary(iris) summary(iris[ , 1:2]) library(skimr) skim(iris) ``` --- ## Visualizing data Many graphics functions: - `plot()`, `pairs()`, `boxplot()`, `hist()`, `dotchart()`, `mosaicplot()`, `contour()` - `png()`, `jpeg()`, `tiff()`, `pdf()` - Many options for customizing graphics - Other packages: - `ggplot2` — for advanced graphics - `leaflet` — for web-based interactive maps - `plotly` — for web-based interactive visualizations - `shiny` — for web-based interactive data apps --- ## Scatter plot example ```r plot(iris$Sepal.Length, iris$Sepal.Width) ``` <img src="index_files/figure-html/unnamed-chunk-24-1.png" width="400px" height="400px" /> --- ## Line plot example ```r plot(uspop) ``` <img src="index_files/figure-html/unnamed-chunk-25-1.png" width="400px" height="400px" /> --- ## Histogram example ```r hist(iris$Sepal.Length) ``` <img src="index_files/figure-html/unnamed-chunk-26-1.png" width="400px" height="400px" /> --- ## Saving graphics For base R: ```r png("uspop.png") plot(uspop) dev.off() ``` --- ## Modeling data The base `stats` package has many statistical functions: - `lm()` for linear models - `glm()` for generalized linear models - See more: `library(help = "stats")` - Many other packages for statistical modeling --- ## Model example Simple linear model: ```r model <- lm(Sepal.Length ~ Sepal.Width + Species, iris) summary(model) AIC(model) BIC(model) ``` --- ## Control flow Controlling the flow of execution If statements: ```r if (cond) {expr} if (cond) {expr} else {alt.expr} ``` Loops: ```r for (var in seq) {expr} while (cond) {expr} repeat {expr} break next ``` --- ## If statement example Within a function: ```r math <- function(x, y) { z <- x + y if (z > 10) { z * 2 } else { z / 2 } } math(2, 2) math(7, 7) ``` --- ## While loop example ```r i <- 1 while (i < 100) { print(i) i = i * 2 } ``` --- ## Iteration Loop over multiple inputs and perform some task Two approaches: 1. `for` loops 2. `*apply()` family of functions - `lapply()`, `sapply()`, `vapply()`, `tapply()`, `mapply()` - different use cases --- ## Iteration examples ```r mats <- list(matrix(rnorm(25), ncol = 5), matrix(rnorm(25), ncol = 5), matrix(rnorm(25), ncol = 5)) for (i in seq_along(mats)) { print(eigen(mats[[i]])) } lapply(mats, eigen) mats2 <- list(matrix(rnorm(25), ncol = 5), matrix(rnorm(25), ncol = 5), matrix(rnorm(25), ncol = 5)) mapply(crossprod, mats, mats2, SIMPLIFY = FALSE) ``` --- ## Scripting - A script is a file with a set of commands to be run in sequence - Text file saved with `.R` extension - Documents your analysis and workflows - Enables reproducibility - Write comments by starting line with `#` - Typically best to organize scripts in a modular way - To run R scripts: - Within R console: `source()` - In the shell: `Rscript` --- ## Script example ```r # Summarize subset of iris data setosa <- iris[iris$Species == "setosa", ] summary(setosa) ``` Save as `setosa.R` In the shell, run with `Rscript setosa.R` --- ## Packages to consider - `data.table` for fast, efficient data processing - `tidyverse` family of packages - `readr`, `tidyr`, `dplyr`, `dtplyr`, `purrr` - `tidymodels` family of packages - `ggplot2` for advanced graphics - `sf` for spatial data - `renv` for reproducible package environments --- ## Additional resources - [R Project](https://www.r-project.org) - [R Manuals](https://cran.r-project.org/manuals.html) - [R package documentation](https://rdrr.io/) - [R cheatsheets](https://posit.co/resources/cheatsheets/) - Tutorials - [Swirl - Learn R in R](https://swirlstats.com/) - [R for Reproducible Scientific Analysis lessons](https://swcarpentry.github.io/r-novice-gapminder/) - [Posit Primers](https://posit.cloud/learn/primers) - Web books - [Hands-on Programming with R](https://rstudio-education.github.io/hopr/) - [An Introduction to R](https://intro2r.com/) - [R Programming for Data Science](https://bookdown.org/rdpeng/rprogdatascience/) - [R for Data Science](https://r4ds.had.co.nz/) --- ## CARC support - [Using R on CARC HPC clusters](https://www.carc.usc.edu/user-information/user-guides/software-and-programming/r) - [R package installation guide](https://hpc-discourse.usc.edu/t/r-package-installation-guide/653) - [Submit a support ticket](https://www.carc.usc.edu/user-information/ticket-submission) - [User Forum](https://hpc-discourse.usc.edu/) - Office Hours - Every Tuesday 2:30-5pm - Get Zoom link [here](https://www.carc.usc.edu/education-and-resources/office-hours) --- ## Exercises 1. Install and load the `data.table` package 2. Get R session info with `sessionInfo()` 3. Create a list of two atomic vectors of different types and lengths 4. Display the structure of that list 5. Create a function that performs some sequence of arithmetic 6. Import and export a CSV file using the `vroom` package 7. Summarize the airquality dataset 8. Plot the discoveries dataset 9. Loop over a list of datasets and summarize them