8 Working with dataframes

For an accompanying video, see df.mp4.

8.1 Motivation

While base R is powerful, its syntax can be verbose and inconsistent for everyday data manipulation. The tidyverse offers a suite of packages that work seamlessly together, providing a coherent and intuitive framework for your workflow.

Instead of installing individual packages like dplyr or tidyr separately, the tidyverse metapackage installs the core packages and recommended dependencies all at once:

if (!requireNamespace("tidyverse", quietly = TRUE)) {
  install.packages("tidyverse")
}
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Key Advantages:

Readable Syntax: Tidyverse functions replace cumbersome expressions (e.g., test_df[test_df[["column"]] == "value", ]) with cleaner, more intuitive code.
Pipe-Friendly: Designed with the data object as the first argument, these functions work seamlessly with the pipe operator for streamlined chaining.
Consistent Interfaces: Uniform parameter names and positions across functions reduce confusion and help prevent errors.
Predictable Behavior: Standardized return types and design make outcomes more reliable and debugging easier.

At the heart of many analyses is the dataframe, and the tidyverse is built to simplify working with and transforming dataframes effectively.

8.2 Displaying Dataframes in R

Working with large dataframes can be challenging if you inadvertently print all rows to the console. Consider the following example:

example_df <- data.frame(
  id = rep(paste0("id_", 1:500), each = 20),
  timepoint = rep(1:20, 500),
  disease_status = rep(
    sample(c("healthy", "diseased"), 20, replace = TRUE),
    each = 500
  ),
  blood_pressure = c(
    runif(20, 120, 140) |> round(),
    sample(c(NA, runif(19, 120, 140)), 9980, replace = TRUE)
  )
)
# Introduce a typo for demonstration
example_df[50, "disease_status"] <- "Healthy"

This dataframe contains 10,000 rows:

nrow(example_df)

[1] 10000

Printing the entire dataframe (e.g., simply typing example_df) would display all rows, which is both inconvenient and time-consuming. Instead, we typically use the head() function to preview just the first few rows:

head(example_df)

    id timepoint disease_status blood_pressure
1 id_1         1       diseased            121
2 id_1         2       diseased            133
3 id_1         3       diseased            137
4 id_1         4       diseased            140
5 id_1         5       diseased            127
6 id_1         6       diseased            131

However, constantly having to call head() is not ideal. A better solution is to convert the dataframe into a tibble.

8.2.1 Enhancing Dataframe Display with `tibble`

The tibble package (a core part of the tidyverse) offers a more concise and informative display. When you print a tibble, it only shows the first 10 rows and includes data type information for each column. This is particularly useful because it lets you quickly verify, for example, whether timepoint is numeric or character—information that can be obscured in the default dataframe printout.

8.2.1.1 Installation and Conversion

First, ensure that the tibble package is installed and loaded:

if (!requireNamespace("tibble", quietly = TRUE)) {
  install.packages("tibble")
}
library(tibble)

Convert your dataframe to a tibble using as_tibble():

# Check the class before conversion
class(example_df)

[1] "data.frame"

# Convert to tibble
example_df <- as_tibble(example_df)

# Check the class after conversion
class(example_df)

[1] "tbl_df"     "tbl"        "data.frame"

Now, simply typing example_df will display a neat summary of your data:

example_df

# A tibble: 10,000 × 4
   id    timepoint disease_status blood_pressure
   <chr>     <int> <chr>                   <dbl>
 1 id_1          1 diseased                  121
 2 id_1          2 diseased                  133
 3 id_1          3 diseased                  137
 4 id_1          4 diseased                  140
 5 id_1          5 diseased                  127
 6 id_1          6 diseased                  131
 7 id_1          7 diseased                  123
 8 id_1          8 diseased                  134
 9 id_1          9 diseased                  138
10 id_1         10 diseased                  135
# ℹ 9,990 more rows

This tidy display helps you catch issues like typos or unexpected data types early on. I generally load both tibble and ggplot2 at the top of my scripts, and you can also create new dataframes directly as tibbles:

new_tbl <- tibble(x = 1:5, y = rnorm(5))
new_tbl

# A tibble: 5 × 2
      x      y
  <int>  <dbl>
1     1  1.17 
2     2  1.43 
3     3 -0.406
4     4  1.37 
5     5  0.844

Note that for a tibble object (i.e. a dataframe that has class tbl_df) to display as a tibble, the tibble package needs to be attached. So always add library(tibble) to the top of your script.

A final, very nice feature of tibbles is that selecting one column using df[, "column"] will return a tibble, rather than a vector. This is typically what we would expect:

example_df[, "id"]

# A tibble: 10,000 × 1
   id   
   <chr>
 1 id_1 
 2 id_1 
 3 id_1 
 4 id_1 
 5 id_1 
 6 id_1 
 7 id_1 
 8 id_1 
 9 id_1 
10 id_1 
# ℹ 9,990 more rows

class(example_df[, "id"])

[1] "tbl_df"     "tbl"        "data.frame"

8.2.2 Exploring Data Columns with `view_cols`

Whilst tibbles provide a more informative display, they still only show the first few rows of each column.

Another useful tool is the view_cols function from the UtilsDataRSV package. This function displays unique entries for each column—always showing any missing values (NAs)—so you can quickly identify anomalies such as typos or unexpected values.

To install the UtilsDataRSV package, use the following code:

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}
remotes::install_github("SATVILab/UtilsDataRSV")

Once installed, you can apply view_cols to your tibble to inspect the unique values in each column:

UtilsDataRSV::view_cols(example_df)

[1] "id"
 [1] "id_332" "id_488" "id_299" "id_15"  "id_48"  "id_97"  "id_131" "id_229"
 [9] "id_94"  "id_82"  "id_235" "id_140" "id_37"  "id_135" "id_306" "id_370"
[17] "id_426" "id_51"  "id_437" "id_449"
[1] "480 unique entries not displayed"
[1] "_____________________"
[1] "timepoint"
[1] 15  6  7 16 17
[1] "_____________________"
[1] "disease_status"
[1] "Healthy"  "diseased" "healthy" 
[1] "_____________________"
[1] "blood_pressure"
[1] 134.0000 124.3354 131.0000 130.4272       NA
[1] "_____________________"

Warning: Not all unique entries displayed for these non-numeric cols: id

view_cols is particularly helpful when you cannot easily inspect all the rows or columns of a dataframe. For example, with 10,000 rows, it’s impractical to scroll through the entire dataset to:

Identify Typos: In our example, the first ten rows may not reveal that disease_status contains three unique entries (e.g., "healthy", "diseased", and the typo "Healthy").
Detect Missing Data: It’s easy to overlook missing values (NAs).
Verify Expected Values: For instance, the id column should have 500 unique entries, and view_cols can help you confirm th

For a possibly more polished alternative, consider exploring the skimr package, which (I think) offers similar functionality.

You can also display the dataframe on its side, using dplyr::glimpse():

dplyr::glimpse(example_df)

Rows: 10,000
Columns: 4
$ id             <chr> "id_1", "id_1", "id_1", "id_1", "id_1", "id_1", "id_1",…
$ timepoint      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ disease_status <chr> "diseased", "diseased", "diseased", "diseased", "diseas…
$ blood_pressure <dbl> 121.0000, 133.0000, 137.0000, 140.0000, 127.0000, 131.0…

8.3 Pipe

Essentially, the pipe is a rewrite of code, from

f(x,y)

x |> f(y)

In other words, the object to the left of the pipe (x) becomes the first argument to the function on the right (f) and y becomes its second parameter.

As as silly example, this:

test_vec <- 1:5
mean(test_vec[test_vec > 3], trim = 0.5)

[1] 4.5

is equivalent to this:

test_vec[test_vec > 3] |>
  mean(trim = 0.5)

[1] 4.5

In terms of f(x,y), f is mean, x is test_vec[test_vec > 3] and y is trim = 0.5.

The above examples are simply, and there is no great advantage to the pipe operator.

However, when you have chained operations, the pipe operator can make the code more readable. For a realistic example, see R4DS on the pipe, where they show a dramatic gain in readability from using the pipe. To run their example, you will need to have the flights dataset from the nycflights13 package and the tidyverse package attached.

To ensure this, first run the chunk below before running their example:

pkg_vec <- c("tidyverse", "nycflights13", "tibble")
for (x in pkg_vec) {
  if (!requireNamespace(x, quietly = TRUE)) {
    install.packages(x)
  }
}
library(nycflights13)
library(tidyverse)
library(tibble)
data(flights, package = "nycflights13")

Their example involves many functions that we’ll discuss in the rest of this chapter.

8.4 Working with rows and columns

The dplyr package provides a suite of functions for manipulating dataframes, including selecting rows and columns, creating new columns, and summarizing data.

There is little point in re-writing excellent content, so I refer you to the R4DS chapter on dplyr for a comprehensive introduction to the package.

8.4.1 Summary

Here is a concise summary of the content on the dplyr section of R4DS. For more examples, refer the excellent examples within each function’s help file (e.g. ?filter).

To run the examples below, you will need to attach the nycflights13 package and the tidyverse package (or just the dplyr package).

if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
library(dplyr)
data(flights, package = "nycflights13")
flights

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

8.4.1.1 Rows

The rows functions select rows (filter), order rows (arrange), and select distinct rows (distinct).

filter(): Keeps rows that meet specified conditions.

flights |> filter(month == 1)

# A tibble: 27,004 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 26,994 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

arrange(): Reorders rows based on column values.

flights |> arrange(dep_delay)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    12     7     2040           2123       -43       40           2352
 2  2013     2     3     2022           2055       -33     2240           2338
 3  2013    11    10     1408           1440       -32     1549           1559
 4  2013     1    11     1900           1930       -30     2233           2243
 5  2013     1    29     1703           1730       -27     1947           1957
 6  2013     8     9      729            755       -26     1002            955
 7  2013    10    23     1907           1932       -25     2143           2143
 8  2013     3    30     2030           2055       -25     2213           2250
 9  2013     3     2     1431           1455       -24     1601           1631
10  2013     5     5      934            958       -24     1225           1309
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

distinct(): Returns unique rows or unique combinations of specified columns.

flights |> distinct(origin, dest)

# A tibble: 224 × 2
   origin dest 
   <chr>  <chr>
 1 EWR    IAH  
 2 LGA    IAH  
 3 JFK    MIA  
 4 JFK    BQN  
 5 LGA    ATL  
 6 EWR    ORD  
 7 EWR    FLL  
 8 LGA    IAD  
 9 JFK    MCO  
10 LGA    ORD  
# ℹ 214 more rows

8.4.1.2 Columns

The columns functions create new columns (mutate), select columns (select), rename columns (rename), and relocate columns (relocate).

select(): Chooses and reorders specific columns.

flights |> select(year, month, day, dep_delay)

# A tibble: 336,776 × 4
    year month   day dep_delay
   <int> <int> <int>     <dbl>
 1  2013     1     1         2
 2  2013     1     1         4
 3  2013     1     1         2
 4  2013     1     1        -1
 5  2013     1     1        -6
 6  2013     1     1        -4
 7  2013     1     1        -5
 8  2013     1     1        -3
 9  2013     1     1        -3
10  2013     1     1        -2
# ℹ 336,766 more rows

Selecting a range:

flights |> select(year:day)

# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows

Selecting all columns except some:

flights |> select(!year:arr_delay)

# A tibble: 336,776 × 10
   carrier flight tailnum origin dest  air_time distance  hour minute
   <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
 1 UA        1545 N14228  EWR    IAH        227     1400     5     15
 2 UA        1714 N24211  LGA    IAH        227     1416     5     29
 3 AA        1141 N619AA  JFK    MIA        160     1089     5     40
 4 B6         725 N804JB  JFK    BQN        183     1576     5     45
 5 DL         461 N668DN  LGA    ATL        116      762     6      0
 6 UA        1696 N39463  EWR    ORD        150      719     5     58
 7 B6         507 N516JB  EWR    FLL        158     1065     6      0
 8 EV        5708 N829AS  LGA    IAD         53      229     6      0
 9 B6          79 N593JB  JFK    MCO        140      944     6      0
10 AA         301 N3ALAA  LGA    ORD        138      733     6      0
# ℹ 336,766 more rows
# ℹ 1 more variable: time_hour <dttm>

mutate(): Creates new columns derived from existing ones.

flights |>
  mutate(speed = distance / air_time * 60) |>
  select(distance, air_time, speed)

# A tibble: 336,776 × 3
   distance air_time speed
      <dbl>    <dbl> <dbl>
 1     1400      227  370.
 2     1416      227  374.
 3     1089      160  408.
 4     1576      183  517.
 5      762      116  394.
 6      719      150  288.
 7     1065      158  404.
 8      229       53  259.
 9      944      140  405.
10      733      138  319.
# ℹ 336,766 more rows

rename(): Changes column names without removing other columns.

flights |> rename(YEAR = year)

# A tibble: 336,776 × 19
    YEAR month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

relocate(): Moves columns to a new position within the data frame.

flights |> relocate(carrier, flight)

# A tibble: 336,776 × 19
   carrier flight  year month   day dep_time sched_dep_time dep_delay arr_time
   <chr>    <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1 UA        1545  2013     1     1      517            515         2      830
 2 UA        1714  2013     1     1      533            529         4      850
 3 AA        1141  2013     1     1      542            540         2      923
 4 B6         725  2013     1     1      544            545        -1     1004
 5 DL         461  2013     1     1      554            600        -6      812
 6 UA        1696  2013     1     1      554            558        -4      740
 7 B6         507  2013     1     1      555            600        -5      913
 8 EV        5708  2013     1     1      557            600        -3      709
 9 B6          79  2013     1     1      557            600        -3      838
10 AA         301  2013     1     1      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 10 more variables: sched_arr_time <int>, arr_delay <dbl>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Moving the columns before a specified column:

flights |> relocate(carrier, flight, .before = year)

# A tibble: 336,776 × 19
   carrier flight  year month   day dep_time sched_dep_time dep_delay arr_time
   <chr>    <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1 UA        1545  2013     1     1      517            515         2      830
 2 UA        1714  2013     1     1      533            529         4      850
 3 AA        1141  2013     1     1      542            540         2      923
 4 B6         725  2013     1     1      544            545        -1     1004
 5 DL         461  2013     1     1      554            600        -6      812
 6 UA        1696  2013     1     1      554            558        -4      740
 7 B6         507  2013     1     1      555            600        -5      913
 8 EV        5708  2013     1     1      557            600        -3      709
 9 B6          79  2013     1     1      557            600        -3      838
10 AA         301  2013     1     1      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 10 more variables: sched_arr_time <int>, arr_delay <dbl>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

8.4.1.3 Groups & Summaries

The groups and summaries functions group data (group_by) and summarise data within groups (summarise).

group_by(): Splits the data into groups based on one or more columns. The grouping is not visible, and does not create multiple dataframes. See summarise below for how to use the groups.

flights |> group_by(month)

# A tibble: 336,776 × 19
# Groups:   month [12]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

summarise(): Computes summary statistics for each group.

flights |> group_by(month) |> summarise(avg_delay = mean(dep_delay, na.rm = TRUE))

# A tibble: 12 × 2
   month avg_delay
   <int>     <dbl>
 1     1     10.0 
 2     2     10.8 
 3     3     13.2 
 4     4     13.9 
 5     5     13.0 
 6     6     20.8 
 7     7     21.7 
 8     8     12.6 
 9     9      6.72
10    10      6.24
11    11      5.44
12    12     16.6

8.5 Changing the shape of data

A key concept in data analysis is that of “tidy data”. A dataset is tidy when:

Each variable is in its own column.
Each observation is in its own row.
Each cell contains a single value.

Consistent data structure simplifies analysis, leverages vectorized operations, and makes it easier to use tidyverse functions.

The tidyr package provides convenient tools for tidying data, such as pivot_longer() and pivot_wider().

As before, I refer you to the R4DS chapter on tidyr for a comprehensive introduction to the package.

8.5.1 Summary

Here is a concise summary of the key content on the tidyr section of R4DS. For more examples, refer the excellent examples within each function’s help file (e.g. ?pivot_longer).

To run the examples below, you will need to attach the tidyr package and load the billboard and cms_patient_experience datasets.

if (!requireNamespace("tidyr", quietly = TRUE)) {
  install.packages("tidyr")
}
library(tidyr)
data(billboard, package = "tidyr")
data(cms_patient_experience, package = "tidyr")

pivot_longer():

Converts data from wide to long format by gathering multiple columns into key-value pairs (results in fewer columns, more rows).

Here is the billboard data before the transformation:

billboard

# A tibble: 317 × 79
   artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
   <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
 2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
 3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
 4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
 5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
 6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
 7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
 8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
 9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
# ℹ 307 more rows
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

Here is the billboard data after the transformation:

billboard |> 
  pivot_longer(
    cols = starts_with("wk"), # columns to pivot (display along rows)
    names_to = "week",  # new column for the column names
    values_to = "rank", # new column for the values
    values_drop_na = TRUE # drop rows with NA values
  )

# A tibble: 5,307 × 5
   artist  track                   date.entered week   rank
   <chr>   <chr>                   <date>       <chr> <dbl>
 1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
 2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
 3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
 4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
 5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
 6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
 7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
 8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
 9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
# ℹ 5,297 more rows

pivot_wider():

Transforms long data to wide format by spreading key-value pairs across columns (results in more columns, fewer rows).

Here is the cms_patient_experience data before the transformation:

cms_patient_experience

# A tibble: 500 × 5
   org_pac_id org_nm                           measure_cd measure_title prf_rate
   <chr>      <chr>                            <chr>      <chr>            <dbl>
 1 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       63
 2 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       87
 3 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       86
 4 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       57
 5 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       85
 6 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       24
 7 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       59
 8 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       85
 9 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       83
10 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       63
# ℹ 490 more rows

Here is the cms_patient_experience data after the transformation:

cms_patient_experience |> 
  pivot_wider(
    id_cols = c("org_pac_id", "org_nm"), # columns to keep as identifiers
    names_from = measure_cd, # column to spread (unique entries become columns)
    values_from = prf_rate # column to use for values (values become cell contents)
  )

# A tibble: 95 × 8
   org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5 CAHPS_GRP_8
   <chr>      <chr>        <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1 0446157747 USC C…          63          87          86          57          85
 2 0446162697 ASSOC…          59          85          83          63          88
 3 0547164295 BEAVE…          49          NA          75          44          73
 4 0749333730 CAPE …          67          84          85          65          82
 5 0840104360 ALLIA…          66          87          87          64          87
 6 0840109864 REX H…          73          87          84          67          91
 7 0840513552 SCL H…          58          83          76          58          78
 8 0941545784 GRITM…          46          86          81          54          NA
 9 1052612785 COMMU…          65          84          80          58          87
10 1254237779 OUR L…          61          NA          NA          65          NA
# ℹ 85 more rows
# ℹ 1 more variable: CAHPS_GRP_12 <dbl>

Columns not specified in id_cols, names_from, or values_from are dropped (such as measure_nm in this example).

8.6 Joining dataframes

The dplyr package also provides functions for joining dataframes, such as left_join() and inner_join().

The full-length notes are in the R4DS chapter on joins.

8.6.1 Key Joining Concepts in dplyr

Data analysis typically involves combining multiple data frames. Joins let you connect tables using shared keys, which can be primary keys (unique identifiers in one table) and foreign keys (variables that reference primary keys in another).

8.6.2 Summary

Again, here is a concise summary of the R4DS content on the joins section. More examples are available in the help files for each function (e.g. ?left_join).

To run the examples below, attach the dplyr package and create and load the following datasets:

library(dplyr)
data("flights", package = "nycflights13")
flights2 <- flights |> 
  select(year, time_hour, origin, dest, tailnum, carrier)
data("airlines", package = "nycflights13")
data("planes", package = "nycflights13")
df1 <- tibble(key = c(1, 2, 2), val_x = c("x1", "x2", "x3"))
df2 <- tibble(key = c(1, 2, 2), val_y = c("y1", "y2", "y3"))

Here is what these datasets look like:

flights2

# A tibble: 336,776 × 6
    year time_hour           origin dest  tailnum carrier
   <int> <dttm>              <chr>  <chr> <chr>   <chr>  
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA     
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA     
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA     
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6     
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL     
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA     
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6     
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV     
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6     
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA     
# ℹ 336,766 more rows

airlines

# A tibble: 16 × 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.

planes

# A tibble: 3,322 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
# ℹ 3,312 more rows

df1

# A tibble: 3 × 2
    key val_x
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     2 x3

df2

# A tibble: 3 × 2
    key val_y
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     2 y3

8.6.2.1 Mutating Joins

Mutating joins add new columns from one data frame to another based on matching key values. They share a common interface:

left_join(): Keeps all rows from the left table and adds matching columns from the right table.

# Add full airline names to flights2 data
flights2 |> left_join(airlines)

Joining with `by = join_by(carrier)`

# A tibble: 336,776 × 7
    year time_hour           origin dest  tailnum carrier name                  
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>                 
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United Air Lines Inc. 
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United Air Lines Inc. 
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American Airlines Inc.
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      JetBlue Airways       
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Delta Air Lines Inc.  
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      United Air Lines Inc. 
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      JetBlue Airways       
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      ExpressJet Airlines I…
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      JetBlue Airways       
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      American Airlines Inc.
# ℹ 336,766 more rows

inner_join(): Keeps only rows with matching keys in both tables.

# Only keep rows where both x and y have a matching key
df1 |> inner_join(df2)

Joining with `by = join_by(key)`

Warning in inner_join(df1, df2): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

# A tibble: 5 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     2 x2    y3   
4     2 x3    y2   
5     2 x3    y3

In the example above, dplyr emits a warning as two rows in df1 (those with key value 2) have two matches in df2 (those with key value 2). Often, this is an error (your data isn’t what you expect it to be), but that is the example they showed.

Specifying Keys:

By default, join functions match on columns with the same name. Use join_by() to specify different keys:

flights2 |> left_join(planes, join_by(tailnum))

# A tibble: 336,776 × 14
   year.x time_hour           origin dest  tailnum carrier year.y type          
    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>         
 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed wing mu…
 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed wing mu…
 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed wing mu…
 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed wing mu…
 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991 Fixed wing mu…
 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012 Fixed wing mu…
 7   2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6        2000 Fixed wing mu…
 8   2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV        1998 Fixed wing mu…
 9   2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6        2004 Fixed wing mu…
10   2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA          NA <NA>          
# ℹ 336,766 more rows
# ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
#   seats <int>, speed <int>, engine <chr>

8.6.2.2 Filtering Joins

Filtering joins select rows from one table based solely on whether they have a match in another table.

semi_join(): Keeps rows in x that have at least one match in y (does not add columns from y).

# Keep only origin airports that appear in flights2
airports |> semi_join(flights2, join_by(faa == origin))

# A tibble: 3 × 8
  faa   name                  lat   lon   alt    tz dst   tzone           
  <chr> <chr>               <dbl> <dbl> <dbl> <dbl> <chr> <chr>           
1 EWR   Newark Liberty Intl  40.7 -74.2    18    -5 A     America/New_York
2 JFK   John F Kennedy Intl  40.6 -73.8    13    -5 A     America/New_York
3 LGA   La Guardia           40.8 -73.9    22    -5 A     America/New_York

anti_join(): Keeps rows in x that have no match in y.

# Find tail numbers in flights2 that are missing from planes
flights2 |> anti_join(planes, join_by(tailnum)) |> distinct(tailnum)

# A tibble: 722 × 1
   tailnum
   <chr>  
 1 N3ALAA 
 2 N3DUAA 
 3 N542MQ 
 4 N730MQ 
 5 N9EAMQ 
 6 N532UA 
 7 N3EMAA 
 8 N518MQ 
 9 N3BAAA 
10 N3CYAA 
# ℹ 712 more rows

These joining functions provide a powerful and flexible way to integrate data from different sources, ensuring that your analyses are both comprehensive and accurate.