R data manipulation cookbook

About this document

This document is a copy of my personal cookbook of R data manipulation and wrangling recipes for tasks I have had to repeatedly look up. Most recipes are built upon the tidyverse dplyr and tidyr packages.

Examples use the following variables:

  • mytbl: Arbitrary data.frame or tibble table with one or more variables
  • date (only for certain examples), var1, var2, etc.: arbitrary table variables
  • PROCESSING_STEPS: Any number of dplyr processing steps such as select(), mutate(), filter(), etc.

Each example subsection includes the package versions and R version used at the time of writing.

Last updated: 2020-11-03

Table of contents

Click to expand table of contents

Filter and extract data from a dataframe

Get the first or last rows per grouping variable(s)

# Get the first row by date per groups defined by variables 'var1' and 'var2'. 
mytbl %>%
  group_by(var1, var2) %>%
  arrange(date) %>%

# As above, but get the last row
mytbl %>%
  group_by(var1, var2) %>%
  arrange(date) %>%

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.1

Extract a single single variable from a dplyr pipeline

Use case: Extract a vector at the end of a dplyr %>% pipeline (if we just need to extract a variable from a table, we have the obvious mytbl$var solution).

Method 1: Use dplyr::pull() (preferred)

mytbl %>% PROCESSING_STEPS %>% pull(var1)

Method 2: Use dplyr period (.) and the R extraction and replacement operator ($).

mytbl %>% PROCESSING_STEPS %>% .$var1

The dot (.) is commonly used as a placeholder to refer to the LHS in an processing step, e.g. x %>% f(y, .) is equivalent to x %>% f(y, x) and x %>% f(y, z, option = .) is equivalent to x %>% f(y, z, option = x). In the example above, the dot represents the table mytbl %>% PROCESSING_STEPS (the stuff to the LHS of the operator). For additional detail, see this StackOverflow response.

Method 3: Use the magtrittr exposition operator %$%


mytbl %>% PROCESSING_STEPS %$% var1

The exposition operator %$% exposes variables on the LHS of the operator to the RHS. For further information, see the magrittr vignette page.

Method 4: Use the R extration and replacement operator ($)

# Option 1
mytbl_new <- mytbl %>% PROCESSING_STEPS

# Option 2 (not preferred)
(mytbl %>% PROCSSING_STEPS)$myvar

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.1 ◊ magrittr: 1.5

Group and summarize dataframe data

Group dataframe data by week, month, year

Use floor_date() from the lubridate package.


dataset %>%
  # We could substitute in "week" or "year". Other accepted units are
  # 'seconds', 'minute', 'hour', 'day', 'bimonth', 'quarter', 'season',
  # 'halfyear', and 'year'.
  group_by(month=lubridate::floor_date(date, "month")) %>%
  summarize(nevents = n())

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.1 ◊ lubridate: 1.7.9

Combine and join dataframe data into a new dataframe

Left join multiple dataframes from a list

Method 1: Using purr::reduce()

Left join multiple tables using purr::reduce(). The library purrr is loaded by the standard import of tidyverse.

# Tables (tbl_a, etc.) share the common variable "date"
merged_tbl <- list(tbl_a, tbl_b, tbl_c, tbl_d) %>%
  purrr::reduce(left_join, by = "date")

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.1 ◊ purr: 0.3.4

Left join multiple tables from a list after filtering

datasets <- list(dataset1, dataset2, dataset3, dataset4)
dataset_names <- list("name1", "name2", "name3", "name4")

group_name <- "my_group"

# Table has variables DAY_DATE, GROUP_NAME, and N_COUNT
datasets_joined <- lapply(
    dataset %>%
    filter(GROUP_NAME == "my_group")}
  ) %>%

  purrr::reduce(full_join, by = "DAY_DATE") %>%
  # Optionally put data in longer format
  pivot_longer(-day_date, names_to = "Dataset", values_to = "N_COUNT")

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.1 ◊ purr: 0.3.4

Process then concatenate (rbind) multiple dataframes from a list

Use method do.call. The method do.call(FCN, ADGS_LIST) executes function FCN and passes additional arguments ARGS_LIST, as a list, to it. FCN can either be the function or a character string naming the function "FCN".

do.call unnests each element of the list allowing rbind to operate on a set of dataframes not a set of dataframes as list elements.

do.call("rbind", lapply(list(tbl_a, tbl_b, tbl_c, tbl_d), my_fn))

⇧ Back to TOCR: 4.0.3

Replacing, modifying, and adding dataframe data

Replace characters and strings in a dataframe variable

Use dplyr::recode()

Example 1: Find and replace string entries for a single variable

# Define the mapping with the old value we are replacing on the left side of
# each named vector and its replacement value on the right side
valuemap <- c("old_char_obs_1" = "New_Value_1",
              "old_char_obs_2" = "New_Value_2",
              "old_char_obs_3" = "New_Value_3",
              "old_char_obs_4" = "New_Value_4")

# 'myvar' is the variable we want to recode. Use !!! (unquote-splice) to insert
# the mapping.
recoded_data <- original_data %>%
  mutate(myvar = dplyr::recode(myvar, !!!value_map))

Alternatively, we could have done (the !!! unravels the valuemap named vector into the below code):

recoded_data <- original_data %>%
  mutate(myvar = dplyr::recode(myvar,
                   "old_char_obs_1" = "New_Value_1",
                   "old_char_obs_2" = "New_Value_2",
                   "old_char_obs_3" = "New_Value_3",
                   "old_char_obs_4" = "New_Value_4")

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.0

Replace NA values in a dataframe variable

Example 1: Replace NA values with 0 for a single variable

dataset %>% mutate(myvar = replace_na(myvar, 0))

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.0

Apply a function and modify several dataframe variables

Both examples apply outlier detection and replacement algorithm forecast::tsclean to several variables.

Example 1: Use mutate_at

# dataset has variables 'date', `group_var`, 'incoming', 'outgoing', and
# 'difference'. Apply `forecast::tsclean` to all variables besides 'date' and
# `group_var`
dataset %>% mutate_at(vars(-date,-group_var), forecast::tsclean)

Example 2: Use mutate and across

# Symmetrical moving average mean
dataset %>%
    .cols = c(-date, -group_var),
    .funs = forecast::tsclean 

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.0

Apply a moving average to a dataframe variable

Use the slider package

# Symmetrical moving average mean
dataset %>%
    .cols = c(myvar_1, myvar_2),
    # All arguments after the function `.fns` (in this example
    # `slider::slide_dbl`) are passed to the function
    .fns = slider::slide_dbl, ~ mean(.x), .before = 3, .after = 3

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.0 ◊ slider: 0.1.5

Calculate the derivative or difference

Using diff() directly won't work because diff() returns a vector of length 1 smaller than the input vector. Use the lag() function instead.

dataset %>%
  mutate(dmyvar = myvar - lag(myvar))

⇧ Back to TOCR: 4.0.2 ◊ dplyr: 1.0.0 slider: 0.1.5