R data processing cookbook


About this document

This document is a copy of my personal cookbook of R data processing and manipulation recipes. Most recipes are built upon the tidyverse dplyr and tidyr packages.

Examples use the following variables:

  • mytbl: Arbitrary data.frame or tibble table with one or more variables
  • date (only for certain examples), var1, var2, etc.: arbitrary table variables
  • PROCESSING_STEPS: Any number of dplyr processing steps such as select(), mutate(), filter(), etc.

Each example subsection includes the package versions and R version used at the time of writing.

Last updated: 08/17/2021

Table of contents

Click to expand table of contents



Filter and extract dataframe data

Get the first or last row per grouping variable(s)

Useful, for example, when we need the first or last event by date or time per grouping variable.

# Get the first row by date per groups defined by variables 'var1' and 'var2'. 
mytbl %>%
  group_by(var1, var2) %>%
  arrange(date) %>%
  filter(row_number()==1)

# As above but get the last row
mytbl %>%
  group_by(var1, var2) %>%
  arrange(date) %>%
  filter(row_number()==n())

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.1

Extract a single single variable from a dplyr pipeline

Use case: Extract a vector at the end of a dplyr %>% pipeline (if we just need to extract a variable from a table, we have the obvious mytbl$var solution).

Method 1: Use dplyr::pull() (preferred)

mytbl %>% PROCESSING_STEPS %>% pull(var1)

Method 2: Use dplyr period (.) and the R extraction and replacement operator ($).

mytbl %>% PROCESSING_STEPS %>% .$var1

The dot (.) is commonly used as a placeholder to refer to the LHS in an processing step, e.g. x %>% f(y, .) is equivalent to x %>% f(y, x) and x %>% f(y, z, option = .) is equivalent to x %>% f(y, z, option = x). In the example above, the dot represents the table mytbl %>% PROCESSING_STEPS (the stuff to the LHS of the operator). For additional detail, see this StackOverflow response.

Method 3: Use the magtrittr exposition operator %$%

require(magrittr)

mytbl %>% PROCESSING_STEPS %$% var1

The exposition operator %$% exposes variables on the LHS of the operator to the RHS. For further information, see the magrittr vignette page.

Method 4: Use the R extraction and replacement operator ($)

# Option 1
mytbl_new <- mytbl %>% PROCESSING_STEPS
mytbl_new$var1

# Option 2 (not preferred)
(mytbl %>% PROCSSING_STEPS)$myvar

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.1 ◊ magrittr: 1.5


Group and summarize dataframe data

Group dataframe data by week, month, or year

Use floor_date() from the lubridate package.

library(lubridate)

dataset %>%
  # We could substitute in 'week' or 'year'. Other accepted units are
  # 'seconds', 'minute', 'hour', 'day', 'bimonth', 'quarter', 'season',
  # 'halfyear', and 'year'.
  group_by(month=lubridate::floor_date(date, "month")) %>%
  summarize(nevents = n())

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.1 ◊ lubridate: 1.7.9


Create and join dataframe data

Left join multiple dataframes from a list

Left join multiple tables using purr::reduce(). The library purrr (functional programming tools) is loaded by the standard import of tidyverse.

# Tables (tbl_a, etc.) share the common variable "date"
merged_tbl <- list(tbl_a, tbl_b, tbl_c, tbl_d) %>%
  purrr::reduce(left_join, by = "date")

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.1 ◊ purr: 0.3.4

Left join multiple tables from a list after filtering

Apply the same filter to a list of tables and then join them. The recipe proceeds in two steps:

  1. Use lapply to apply the same filtering function to each dataset in a list of datasets. We use lapply rather than some other apply function (e.g. vapply) because it returns a list. A list is required by Step 2.
  2. Apply left_join() using purrr::reduce() on the list of filtered datasets. See recipe left join multiple dataframes from a list.
# datsets have the variables DAY_DATE, GROUP_NAME, and N_COUNT
datasets <- list(dataset1, dataset2, dataset3, dataset4)
dataset_names <- list("name1", "name2", "name3", "name4")

group_name <- "my_group"

# Filter out all groups except the group of interest, "my_group"
# Define a lambda function to do the filtering
datasets_joined <- lapply(
  datasets,
  function(dataset){
    dataset %>%
    filter(GROUP_NAME == "my_group")}
  ) %>%

  # Join the datasets
  purrr::reduce(full_join, by = "DAY_DATE") %>%
  # Optionally put data in longer format
  pivot_longer(-day_date, names_to = "Dataset", values_to = "N_COUNT")

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.1 ◊ purr: 0.3.4

Process then concatenate multiple dataframes from a list

Use method do.call to row bind rbind each element in the processed list. The method do.call(FCN, ARGS_LIST) executes function FCN and passes to it additional arguments list ARGS_LIST. FCN can either be the function object or a character string naming the function "FCN".

do.call unnests each element of the list allowing rbind to operate on a set of dataframes and not a set of dataframes as list elements.

do.call("rbind",
        # Apply function my_fn to each element in the list
        lapply(list(tbl_a, tbl_b, tbl_c, tbl_d), my_fn))

⇧ Back to TOCR: 4.0.3

Create a dataframe from all possible combinations of array variables

The base function expand.grid() can be used to create a dataframe from all possible combinations the supplied vectors or factors.

I find this useful when creating a complete, multiple variable index to LEFT JOIN in incomplete observation data.

# Create a two-column index of date and household room to LEFT JOIN potentially
# incomplete temperature data.
daily_house_temp <- expand.grid(
    # List of dates between 8/1/2021 and 9/1/2021 (inclusive)
    date = seq(as.Date('2021-08-01'), as.Date('2021-09-01'), by="day"),
    room = c('kitchen', 'beddroom', 'garadge')
)

⇧ Back to TOCR: 4.1.1

Apply a function to a row or column

Use the sapply

sapply(var1, 1, FN)
sapply(var2, 2, FN)

⇧ Back to TOCR: 4.1.1


Calculate, replace, or modify add dataframe data

Re-code observations for a single variable

Substitute one or more values for other values in single dataframe variable. This can be done explicitly or by providing a list of mappings.

Use dplyr::recode()

Example 1: Find and replace observations for a single variable

# Explicit method
recoded_data <- original_data %>%
  mutate(myvar = dplyr::recode(myvar,
                   oldvalue1 = newvalue1,
                   oldvalue2 = newvalue2,
                   oldvalue3 = newvalue3))

# List method. Use !!! (unquote-splice) to insert the mapping.
valuemap <- c(oldvalue1 = newvalue1,
              oldvalue2 = newvalue2,
              oldvalue3 = newvalue3)

recoded_data <- original_data %>%
  mutate(myvar = dplyr::recode(myvar, !!!value_map))

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.0

Replace NA values for a single variable

Example: Replace NA values with 0 for a single variable

dataset %>% mutate(myvar = tidyr::replace_na(myvar, 0))

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.0

Apply a function to modify several variables

Both examples use outlier detection and replacement algorithm forecast::tsclean() to several variables (note that the forecast package has been superseded by the fable1 package but fable doesn’t seem to include an updated version of tsclean()).

Example 1: Identify and replace outliers in a timeseries using tsclean() and mutate_at()

# dataset has variables 'date', `group_var`, 'incoming', 'outgoing', and
# 'difference'. Apply `forecast::tsclean` to all variables besides 'date' and
# `group_var`
dataset %>% mutate_at(vars(-date,-group_var), forecast::tsclean)

Example 2: Identify and replace outliers in a timeseries using tsclean(), mutate(), and across()

# Symmetrical moving average mean
dataset %>%
  mutate(across(
    .cols = c(-date, -group_var),
    .funs = forecast::tsclean 
  ))

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.0

Apply a moving average to a variable

Uses the slider package.

# Symmetrical moving average mean
dataset %>%
  mutate(across(
    .cols = c(myvar_1, myvar_2),
    # All arguments after the function `.fns` (in this example
    # `slider::slide_dbl`) are passed to the function
    .fns = slider::slide_dbl, ~ mean(.x), .before = 3, .after = 3
  ))

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.0 ◊ slider: 0.1.5

Change the datatype of several variables

Example 1: Convert several variables to factors

Use the dplyr across() function to apply the same transformation to multiple columns.

factor_vanames <- c("myvar1", "myvar2", "myvar3")
df <- df %>% 
    mutate(across(.cols = factor_varnames, .fns = as.factor))

⇧ Back to TOCR: 4.1.0 ◊ dplyr: 1.0.6

Calculate the derivative or difference

Using diff() directly won’t work because diff() returns a vector of length 1 smaller than the input vector. Use the lag() function instead.

dataset %>%
  mutate(dmyvar = myvar - lag(myvar))

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.0 slider: 0.1.5


  1. The fable package “provides methods and tools for displaying and analysing univariate time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling. Data, model and forecast objects are all stored in a tidy format.” (excerpt from fable documentation) The difference with respect to the forecast package is the second sentense: all fable objects are stored in a tidy format.