While base R is powerful, its syntax can be verbose and inconsistent for everyday data manipulation. The tidyverse offers a suite of packages that work seamlessly together, providing a coherent and intuitive framework for your workflow.
Instead of installing individual packages like dplyr or tidyr separately, the tidyverse metapackage installs the core packages and recommended dependencies all at once:
if (!requireNamespace("tidyverse", quietly =TRUE)) {install.packages("tidyverse")}library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Key Advantages:
Readable Syntax: Tidyverse functions replace cumbersome expressions (e.g., test_df[test_df[["column"]] == "value", ]) with cleaner, more intuitive code.
Pipe-Friendly: Designed with the data object as the first argument, these functions work seamlessly with the pipe operator for streamlined chaining.
Consistent Interfaces: Uniform parameter names and positions across functions reduce confusion and help prevent errors.
Predictable Behavior: Standardized return types and design make outcomes more reliable and debugging easier.
At the heart of many analyses is the dataframe, and the tidyverse is built to simplify working with and transforming dataframes effectively.
8.2 Displaying Dataframes in R
Working with large dataframes can be challenging if you inadvertently print all rows to the console. Consider the following example:
Printing the entire dataframe (e.g., simply typing example_df) would display all rows, which is both inconvenient and time-consuming. Instead, we typically use the head() function to preview just the first few rows:
However, constantly having to call head() is not ideal. A better solution is to convert the dataframe into a tibble.
8.2.1 Enhancing Dataframe Display with tibble
The tibble package (a core part of the tidyverse) offers a more concise and informative display. When you print a tibble, it only shows the first 10 rows and includes data type information for each column. This is particularly useful because it lets you quickly verify, for example, whether timepoint is numeric or character—information that can be obscured in the default dataframe printout.
8.2.1.1 Installation and Conversion
First, ensure that the tibble package is installed and loaded:
if (!requireNamespace("tibble", quietly =TRUE)) {install.packages("tibble")}library(tibble)
Convert your dataframe to a tibble using as_tibble():
# Check the class before conversionclass(example_df)
[1] "data.frame"
# Convert to tibbleexample_df <-as_tibble(example_df)# Check the class after conversionclass(example_df)
[1] "tbl_df" "tbl" "data.frame"
Now, simply typing example_df will display a neat summary of your data:
This tidy display helps you catch issues like typos or unexpected data types early on. I generally load both tibble and ggplot2 at the top of my scripts, and you can also create new dataframes directly as tibbles:
new_tbl <-tibble(x =1:5, y =rnorm(5))new_tbl
# A tibble: 5 × 2
x y
<int> <dbl>
1 1 1.17
2 2 1.43
3 3 -0.406
4 4 1.37
5 5 0.844
Note that for a tibble object (i.e. a dataframe that has class tbl_df) to display as a tibble, the tibble package needs to be attached. So always add library(tibble) to the top of your script.
A final, very nice feature of tibbles is that selecting one column using df[, "column"] will return a tibble, rather than a vector. This is typically what we would expect:
Whilst tibbles provide a more informative display, they still only show the first few rows of each column.
Another useful tool is the view_cols function from the UtilsDataRSV package. This function displays unique entries for each column—always showing any missing values (NAs)—so you can quickly identify anomalies such as typos or unexpected values.
To install the UtilsDataRSV package, use the following code:
if (!requireNamespace("remotes", quietly =TRUE)) {install.packages("remotes")}remotes::install_github("SATVILab/UtilsDataRSV")
Once installed, you can apply view_cols to your tibble to inspect the unique values in each column:
Warning: Not all unique entries displayed for these non-numeric cols: id
view_cols is particularly helpful when you cannot easily inspect all the rows or columns of a dataframe. For example, with 10,000 rows, it’s impractical to scroll through the entire dataset to:
Identify Typos: In our example, the first ten rows may not reveal that disease_status contains three unique entries (e.g., "healthy", "diseased", and the typo "Healthy").
Detect Missing Data: It’s easy to overlook missing values (NAs).
Verify Expected Values: For instance, the id column should have 500 unique entries, and view_cols can help you confirm th
For a possibly more polished alternative, consider exploring the skimr package, which (I think) offers similar functionality.
You can also display the dataframe on its side, using dplyr::glimpse():
In other words, the object to the left of the pipe (x) becomes the first argument to the function on the right (f) and y becomes its second parameter.
As as silly example, this:
test_vec <-1:5mean(test_vec[test_vec >3], trim =0.5)
[1] 4.5
is equivalent to this:
test_vec[test_vec >3] |>mean(trim =0.5)
[1] 4.5
In terms of f(x,y), f is mean, x is test_vec[test_vec > 3] and y is trim = 0.5.
The above examples are simply, and there is no great advantage to the pipe operator.
However, when you have chained operations, the pipe operator can make the code more readable. For a realistic example, see R4DS on the pipe, where they show a dramatic gain in readability from using the pipe. To run their example, you will need to have the flights dataset from the nycflights13 package and the tidyverse package attached.
To ensure this, first run the chunk below before running their example:
Their example involves many functions that we’ll discuss in the rest of this chapter.
8.4 Working with rows and columns
The dplyr package provides a suite of functions for manipulating dataframes, including selecting rows and columns, creating new columns, and summarizing data.
There is little point in re-writing excellent content, so I refer you to the R4DS chapter on dplyr for a comprehensive introduction to the package.
8.4.1 Summary
Here is a concise summary of the content on the dplyr section of R4DS. For more examples, refer the excellent examples within each function’s help file (e.g. ?filter).
To run the examples below, you will need to attach the nycflights13 package and the tidyverse package (or just the dplyr package).
if (!requireNamespace("dplyr", quietly =TRUE)) {install.packages("dplyr")}library(dplyr)data(flights, package ="nycflights13")flights
The groups and summaries functions group data (group_by) and summarise data within groups (summarise).
group_by(): Splits the data into groups based on one or more columns. The grouping is not visible, and does not create multiple dataframes. See summarise below for how to use the groups.
A key concept in data analysis is that of “tidy data”. A dataset is tidy when:
Each variable is in its own column.
Each observation is in its own row.
Each cell contains a single value.
Consistent data structure simplifies analysis, leverages vectorized operations, and makes it easier to use tidyverse functions.
The tidyr package provides convenient tools for tidying data, such as pivot_longer() and pivot_wider().
As before, I refer you to the R4DS chapter on tidyr for a comprehensive introduction to the package.
8.5.1 Summary
Here is a concise summary of the key content on the tidyr section of R4DS. For more examples, refer the excellent examples within each function’s help file (e.g. ?pivot_longer).
To run the examples below, you will need to attach the tidyr package and load the billboard and cms_patient_experience datasets.
if (!requireNamespace("tidyr", quietly =TRUE)) {install.packages("tidyr")}library(tidyr)data(billboard, package ="tidyr")data(cms_patient_experience, package ="tidyr")
pivot_longer():
Converts data from wide to long format by gathering multiple columns into key-value pairs (results in fewer columns, more rows).
Here is the billboard data before the transformation:
Here is the billboard data after the transformation:
billboard |>pivot_longer(cols =starts_with("wk"), # columns to pivot (display along rows)names_to ="week", # new column for the column namesvalues_to ="rank", # new column for the valuesvalues_drop_na =TRUE# drop rows with NA values )
Transforms long data to wide format by spreading key-value pairs across columns (results in more columns, fewer rows).
Here is the cms_patient_experience data before the transformation:
cms_patient_experience
# A tibble: 500 × 5
org_pac_id org_nm measure_cd measure_title prf_rate
<chr> <chr> <chr> <chr> <dbl>
1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 63
2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 87
3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 86
4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 57
5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 85
6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 24
7 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 59
8 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 85
9 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 83
10 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 63
# ℹ 490 more rows
Here is the cms_patient_experience data after the transformation:
cms_patient_experience |>pivot_wider(id_cols =c("org_pac_id", "org_nm"), # columns to keep as identifiersnames_from = measure_cd, # column to spread (unique entries become columns)values_from = prf_rate # column to use for values (values become cell contents) )
Data analysis typically involves combining multiple data frames. Joins let you connect tables using shared keys, which can be primary keys (unique identifiers in one table) and foreign keys (variables that reference primary keys in another).
8.6.2 Summary
Again, here is a concise summary of the R4DS content on the joins section. More examples are available in the help files for each function (e.g. ?left_join).
To run the examples below, attach the dplyr package and create and load the following datasets:
# A tibble: 336,776 × 6
year time_hour origin dest tailnum carrier
<int> <dttm> <chr> <chr> <chr> <chr>
1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA
2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA
3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA
4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6
5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL
6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA
7 2013 2013-01-01 06:00:00 EWR FLL N516JB B6
8 2013 2013-01-01 06:00:00 LGA IAD N829AS EV
9 2013 2013-01-01 06:00:00 JFK MCO N593JB B6
10 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA
# ℹ 336,766 more rows
airlines
# A tibble: 16 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
planes
# A tibble: 3,322 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
# ℹ 3,312 more rows
Mutating joins add new columns from one data frame to another based on matching key values. They share a common interface:
left_join(): Keeps all rows from the left table and adds matching columns from the right table.
# Add full airline names to flights2 dataflights2 |>left_join(airlines)
Joining with `by = join_by(carrier)`
# A tibble: 336,776 × 7
year time_hour origin dest tailnum carrier name
<int> <dttm> <chr> <chr> <chr> <chr> <chr>
1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines Inc.
2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines Inc.
3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines Inc.
4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways
5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc.
6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines Inc.
7 2013 2013-01-01 06:00:00 EWR FLL N516JB B6 JetBlue Airways
8 2013 2013-01-01 06:00:00 LGA IAD N829AS EV ExpressJet Airlines I…
9 2013 2013-01-01 06:00:00 JFK MCO N593JB B6 JetBlue Airways
10 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA American Airlines Inc.
# ℹ 336,766 more rows
inner_join(): Keeps only rows with matching keys in both tables.
# Only keep rows where both x and y have a matching keydf1 |>inner_join(df2)
Joining with `by = join_by(key)`
Warning in inner_join(df1, df2): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
In the example above, dplyr emits a warning as two rows in df1 (those with key value 2) have two matches in df2 (those with key value 2). Often, this is an error (your data isn’t what you expect it to be), but that is the example they showed.
Specifying Keys:
By default, join functions match on columns with the same name. Use join_by() to specify different keys:
flights2 |>left_join(planes, join_by(tailnum))
# A tibble: 336,776 × 14
year.x time_hour origin dest tailnum carrier year.y type
<int> <dttm> <chr> <chr> <chr> <chr> <int> <chr>
1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999 Fixed wing mu…
2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998 Fixed wing mu…
3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990 Fixed wing mu…
4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012 Fixed wing mu…
5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991 Fixed wing mu…
6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012 Fixed wing mu…
7 2013 2013-01-01 06:00:00 EWR FLL N516JB B6 2000 Fixed wing mu…
8 2013 2013-01-01 06:00:00 LGA IAD N829AS EV 1998 Fixed wing mu…
9 2013 2013-01-01 06:00:00 JFK MCO N593JB B6 2004 Fixed wing mu…
10 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA NA <NA>
# ℹ 336,766 more rows
# ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
# seats <int>, speed <int>, engine <chr>
8.6.2.2 Filtering Joins
Filtering joins select rows from one table based solely on whether they have a match in another table.
semi_join(): Keeps rows in x that have at least one match in y (does not add columns from y).
# Keep only origin airports that appear in flights2airports |>semi_join(flights2, join_by(faa == origin))
# A tibble: 3 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 EWR Newark Liberty Intl 40.7 -74.2 18 -5 A America/New_York
2 JFK John F Kennedy Intl 40.6 -73.8 13 -5 A America/New_York
3 LGA La Guardia 40.8 -73.9 22 -5 A America/New_York
anti_join(): Keeps rows in x that have no match in y.
# Find tail numbers in flights2 that are missing from planesflights2 |>anti_join(planes, join_by(tailnum)) |>distinct(tailnum)
These joining functions provide a powerful and flexible way to integrate data from different sources, ensuring that your analyses are both comprehensive and accurate.