R Data Manipulation Cookbook

This document is a copy of my personal cookbook of R data manipulation and wrangling recipes for tasks I have had to repeatedly look up. Most recipes are built upon the tidyverse dplyr and tidyr packages.

Examples use the following variables:

  • mytbl: Arbitrary data.frame or tibble table with one or more variables
  • date (only for certain examples), var1, var2, etc.: arbitrary table variables
  • PROCESSING_STEPS: Any number of dplyr processing steps such as select(), mutate(), filter(), etc.

Each example subsection includes the package versions and R version used at the time of writing.

This document will be updated periodically.

Last updated: 2020-07-31


General lessons and observations

  • dplyr::filter() conflicts with many other filter() functions such as stats::filter(). It may be necessary to explicitly use the namespace, i.e. dplyr::filter() instead of filter().

Extract data

Get the first or last items (by datetime) per grouping variable(s)

# Get the first row per group defined by variables 'var1' and 'var2'
mytbl %>%
  group_by(var1, var2) %>%
  arrange(date) %>%
  filter(row_number()==1)

# Get the last row defined by variables 'var1' and 'var2'
mytbl %>%
  group_by(var1, var2) %>%
  arrange(date) %>%
  filter(row_number()==n())

R: 4.0.2 ◊ dplyr: 1.0.1

Extract a single single variable (vector) from a dplyr pipeline

Use case: extracting a vector at the end of a long processing chain, i.e. the end of a dplyr %>% pipeline. Otherwise, if we just need to extract a variable from a table, we have the obvious mytbl$var solution.

Method 1: Using dplyr::pull() (preferred)

mytbl %>% PROCESSING_STEPS %>% pull(var1)

Method 2: Using dplyr period (.) and the R extraction and replacement operator ($).

The dot (.) is commonly used as a placeholder to refer to the LHS in an processing step, e.g. x %>% f(y, .) is equivalent to x %>% f(y, x) and x %>% f(y, z, option = .) is equivalent to x %>% f(y, z, option = x). This is not the only use of the dot. For additional detail, see this StackOverflow response.

mytbl %>% PROCESSING_STEPS %>% .$var1

If we let mytbl_new <- mytbl %>% PROCESSING_STEPS, then the above is equivalent to mytbl_new$var1.

Method 3: Using magtrittr

require(magrittr)

mytbl %>% PROCESSING_STEPS %$% var1

Method 4: Use the R extration and replacement operator ($)

# Option 1
mytbl_new <- mytbl %>% PROCESSING_STEPS
mytbl_new$var1

# Option 2 (not preferred)
(mytbl %>% PROCSSING_STEPS)$myvar

R: 4.0.2 ◊ dplyr: 1.0.1 ◊ magrittr: 1.5


Group and summarize data

Group data by week, month, year

Use floor_date() from the lubridate package.

library(lubridate)

dataset %>%
  # We could substitute in "week" or "year". Other accepted units are
  # 'seconds', 'minute', 'hour', 'day', 'bimonth', 'quarter', 'season',
  # 'halfyear', and 'year'.
  group_by(month=lubridate::floor_date(date, "month")) %>%
  summarize(nevents = n())

R: 4.0.2 ◊ dplyr: 1.0.1 ◊ lubridate: 1.7.9


Combine data

Join multiple tables from a list

Left join multiple tables using purr::reduce(). The library purrr is loaded by the standard import of tidyverse.

# Tables (tbl_a, etc.) share the common variable "date"
merged_tbl <- list(tbl_a, tbl_b, tbl_c, tbl_d) %>%
  purrr::reduce(left_join, by = "date")

R: 4.0.2 ◊ dplyr: 1.0.1 ◊ purr: 0.3.4

Join multiple tables from a list after filtering

datasets <- list(dataset1, dataset2, dataset3, dataset4)
dataset_names <- list("name1", "name2", "name3", "name4")

group_name <- "my_group"

# Table has variables DAY_DATE, GROUP_NAME, and N_COUNT
datasets_joined <- lapply(
  datasets,
  function(dataset){
    dataset %>%
    filter(GROUP_NAME == "my_group")}
  ) %>%

  purrr::reduce(full_join, by = "DAY_DATE") %>%
  # Optionally put data in longer format
  pivot_longer(-day_date, names_to = "Dataset", values_to = "N_COUNT")

R: 4.0.2 ◊ dplyr: 1.0.1 ◊ purr: 0.3.4


Process table data

Re-map values for a single variable

Replace characters and strings

Use dplyr::recode()

Example 1: Find and replace string entries for a single variable

# Define the mapping with the old value we are replacing on the left side of
# each named vector and its replacement value on the right side
valuemap <- c("old_char_obs_1" = "New_Value_1",
              "old_char_obs_2" = "New_Value_2",
              "old_char_obs_3" = "New_Value_3",
              "old_char_obs_4" = "New_Value_4")

# 'myvar' is the variable we want to recode. Use !!! (unquote-splice) to insert
# the mapping.
recoded_data <- original_data %>%
  mutate(myvar = dplyr::recode(myvar, !!!value_map))
  mutate(myvar = dplyr::recode(myvar, !!!value_map))

Alternatively, we could have done (the !!! unravels the valuemap named vector into the below code):

recoded_data <- original_data %>%
  mutate(myvar = dplyr::recode(myvar,
                   "old_char_obs_1" = "New_Value_1",
                   "old_char_obs_2" = "New_Value_2",
                   "old_char_obs_3" = "New_Value_3",
                   "old_char_obs_4" = "New_Value_4")

R: 4.0.2 ◊ dplyr: 1.0.0

Replace NA values

Example 1: Replace NA values with 0 for a single variable

dataset %>% mutate(myvar = replace_na(myvar, 0))

R: 4.0.2 ◊ dplyr: 1.0.0

Apply a function to several variables

Both examples apply outlier detection and replacement algorithm forecast::tsclean to several variables.

Example 1: Use mutate_at

# dataset has variables 'date', `group_var`, 'incoming', 'outgoing', and
# 'difference'. Apply `forecast::tsclean` to all variables besides 'date' and
# `group_var`
dataset %>% mutate_at(vars(-date,-group_var), forecast::tsclean)

Example 2: Use mutate and across

# Symmetrical moving average mean
dataset %>%
  mutate(across(
    .cols = c(-date, -group_var),
    .funs = forecast::tsclean 
  ))

R: 4.0.2 ◊ dplyr: 1.0.0

Apply a moving window

Apply a moving average

Use the slider package

# Symmetrical moving average mean
dataset %>%
  mutate(across(
    .cols = c(myvar_1, myvar_2),
    # All arguments after the function `.fns` (in this example
    # `slider::slide_dbl`) are passed to the function
    .fns = slider::slide_dbl, ~ mean(.x), .before = 3, .after = 3
  ))

dplyr: 1.0.0 ◊ slider: 0.1.5