This document is a copy of my personal cookbook of R data processing and manipulation recipes. Most recipes are built upon the tidyverse dplyr and tidyr packages.
Examples use the following variables:
mytbl
: Arbitrary data.frame
or tibble
table with one or
more variablesdate
(only for certain examples), var1
, var2
,
etc.: arbitrary table variablesPROCESSING_STEPS
: Any number of dplyr processing steps such
as select()
, mutate()
, filter()
, etc.Each example subsection includes the package versions and R version used at the time of writing.
Last updated: 08/17/2021
Useful, for example, when we need the first or last event by date or time per grouping variable.
# Get the first row by date per groups defined by variables 'var1' and 'var2'.
mytbl %>%
group_by(var1, var2) %>%
arrange(date) %>%
filter(row_number()==1)
# As above but get the last row
mytbl %>%
group_by(var1, var2) %>%
arrange(date) %>%
filter(row_number()==n())
⇧ Back to TOC ◊ R: 4.0.2 ◊ dplyr: 1.0.1
Use case: Extract a vector at the end of a dplyr %>%
pipeline
(if we just need to extract a variable from a table, we have the obvious
mytbl$var
solution).
Method 1: Use dplyr::pull()
(preferred)
mytbl %>% PROCESSING_STEPS %>% pull(var1)
Method 2: Use dplyr
period (.
) and the R extraction and
replacement operator ($
).
mytbl %>% PROCESSING_STEPS %>% .$var1
The dot (.
) is commonly used as a placeholder to refer to the LHS in
an processing step, e.g. x %>% f(y, .)
is equivalent to
x %>% f(y, x)
and x %>% f(y, z, option = .)
is equivalent to
x %>% f(y, z, option = x)
. In the example above, the dot represents
the table mytbl %>% PROCESSING_STEPS
(the stuff to the LHS of the
operator). For additional detail, see this StackOverflow
response.
Method 3: Use the magtrittr
exposition operator %$%
require(magrittr)
mytbl %>% PROCESSING_STEPS %$% var1
The exposition operator %$%
exposes variables on the LHS of the
operator to the RHS. For further information, see the magrittr vignette
page.
Method 4: Use the R extraction and replacement operator ($
)
# Option 1
mytbl_new <- mytbl %>% PROCESSING_STEPS
mytbl_new$var1
# Option 2 (not preferred)
(mytbl %>% PROCSSING_STEPS)$myvar
⇧ Back to TOC ◊ R: 4.0.2 ◊ dplyr: 1.0.1 ◊ magrittr: 1.5
Use floor_date()
from the
lubridate
package.
library(lubridate)
dataset %>%
# We could substitute in 'week' or 'year'. Other accepted units are
# 'seconds', 'minute', 'hour', 'day', 'bimonth', 'quarter', 'season',
# 'halfyear', and 'year'.
group_by(month=lubridate::floor_date(date, "month")) %>%
summarize(nevents = n())
⇧ Back to TOC ◊ R: 4.0.2 ◊ dplyr: 1.0.1 ◊ lubridate: 1.7.9
Left join multiple tables using purr::reduce()
. The library
purrr (functional programming tools) is
loaded by the standard import of tidyverse.
# Tables (tbl_a, etc.) share the common variable "date"
merged_tbl <- list(tbl_a, tbl_b, tbl_c, tbl_d) %>%
purrr::reduce(left_join, by = "date")
⇧ Back to TOC ◊ R: 4.0.2 ◊ dplyr: 1.0.1 ◊ purr: 0.3.4
Apply the same filter to a list of tables and then join them. The recipe proceeds in two steps:
lapply
to apply the same filtering function to each dataset in
a list of datasets. We use lapply
rather than some other apply
function (e.g. vapply
) because it returns a list. A list is
required by Step 2.left_join()
using purrr::reduce()
on the list of filtered
datasets. See recipe left join multiple dataframes from a
list.# datsets have the variables DAY_DATE, GROUP_NAME, and N_COUNT
datasets <- list(dataset1, dataset2, dataset3, dataset4)
dataset_names <- list("name1", "name2", "name3", "name4")
group_name <- "my_group"
# Filter out all groups except the group of interest, "my_group"
# Define a lambda function to do the filtering
datasets_joined <- lapply(
datasets,
function(dataset){
dataset %>%
filter(GROUP_NAME == "my_group")}
) %>%
# Join the datasets
purrr::reduce(full_join, by = "DAY_DATE") %>%
# Optionally put data in longer format
pivot_longer(-day_date, names_to = "Dataset", values_to = "N_COUNT")
⇧ Back to TOC ◊ R: 4.0.2 ◊ dplyr: 1.0.1 ◊ purr: 0.3.4
Use method do.call
to row bind rbind
each element in the processed
list. The method do.call(FCN, ARGS_LIST)
executes function FCN
and
passes to it additional arguments list ARGS_LIST
. FCN
can either be
the function object or a character string naming the function "FCN"
.
do.call
unnests each element of the list allowing rbind
to operate
on a set of dataframes and not a set of dataframes as list elements.
do.call("rbind",
# Apply function my_fn to each element in the list
lapply(list(tbl_a, tbl_b, tbl_c, tbl_d), my_fn))
⇧ Back to TOC ◊ R: 4.0.3
The base function expand.grid()
can be used to create a dataframe from
all possible combinations the supplied vectors or factors.
I find this useful when creating a complete, multiple variable index to LEFT JOIN in incomplete observation data.
# Create a two-column index of date and household room to LEFT JOIN potentially
# incomplete temperature data.
daily_house_temp <- expand.grid(
# List of dates between 8/1/2021 and 9/1/2021 (inclusive)
date = seq(as.Date('2021-08-01'), as.Date('2021-09-01'), by="day"),
room = c('kitchen', 'beddroom', 'garadge')
)
⇧ Back to TOC ◊ R: 4.1.1
Use the sapply
sapply(var1, 1, FN)
sapply(var2, 2, FN)
⇧ Back to TOC ◊ R: 4.1.1
Substitute one or more values for other values in single dataframe variable. This can be done explicitly or by providing a list of mappings.
Use dplyr::recode()
Example 1: Find and replace observations for a single variable
# Explicit method
recoded_data <- original_data %>%
mutate(myvar = dplyr::recode(myvar,
oldvalue1 = newvalue1,
oldvalue2 = newvalue2,
oldvalue3 = newvalue3))
# List method. Use !!! (unquote-splice) to insert the mapping.
valuemap <- c(oldvalue1 = newvalue1,
oldvalue2 = newvalue2,
oldvalue3 = newvalue3)
recoded_data <- original_data %>%
mutate(myvar = dplyr::recode(myvar, !!!value_map))
⇧ Back to TOC ◊ R: 4.0.2 ◊ dplyr: 1.0.0
NA
values for a single variableExample: Replace NA values with 0 for a single variable
dataset %>% mutate(myvar = tidyr::replace_na(myvar, 0))
⇧ Back to TOC ◊ R: 4.0.2 ◊ dplyr: 1.0.0
Both examples use outlier detection and replacement algorithm
forecast::tsclean()
to several variables (note that the
forecast
package has been superseded by the
fable1 package
but fable doesn’t seem to include an updated version of
tsclean()
).
Example 1: Identify and replace outliers in a timeseries using
tsclean()
and mutate_at()
# dataset has variables 'date', `group_var`, 'incoming', 'outgoing', and
# 'difference'. Apply `forecast::tsclean` to all variables besides 'date' and
# `group_var`
dataset %>% mutate_at(vars(-date,-group_var), forecast::tsclean)
Example 2: Identify and replace outliers in a timeseries using
tsclean()
, mutate()
, and across()
# Symmetrical moving average mean
dataset %>%
mutate(across(
.cols = c(-date, -group_var),
.funs = forecast::tsclean
))
⇧ Back to TOC ◊ R: 4.0.2 ◊ dplyr: 1.0.0
Uses the slider package.
# Symmetrical moving average mean
dataset %>%
mutate(across(
.cols = c(myvar_1, myvar_2),
# All arguments after the function `.fns` (in this example
# `slider::slide_dbl`) are passed to the function
.fns = slider::slide_dbl, ~ mean(.x), .before = 3, .after = 3
))
⇧ Back to TOC ◊ R: 4.0.2 ◊ dplyr: 1.0.0 ◊ slider: 0.1.5
Example 1: Convert several variables to factors
Use the dplyr across()
function to apply the same transformation to
multiple columns.
factor_vanames <- c("myvar1", "myvar2", "myvar3")
df <- df %>%
mutate(across(.cols = factor_varnames, .fns = as.factor))
⇧ Back to TOC ◊ R: 4.1.0 ◊ dplyr: 1.0.6
Using diff()
directly won’t work because diff()
returns a vector of
length 1 smaller than the input vector. Use the lag()
function
instead.
dataset %>%
mutate(dmyvar = myvar - lag(myvar))
⇧ Back to TOC ◊ R: 4.0.2 ◊ dplyr: 1.0.0 slider: 0.1.5
The fable package “provides methods and tools for displaying and analysing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling. Data, model and forecast objects are all stored in a tidy format.” (excerpt from fable documentation) The difference with respect to the forecast package is the second sentense: all fable objects are stored in a tidy format. ↩