Reshaping Data with tidyr

Keith Baxelbaum, Rose Hartman, and
Alexis Zavez

Data Science and Biostatistics Unit (DSBU) and
Arcus Education, DBHI

2024-05-06

  • Use keyboard arrow keys to
    • advance ( → ) and
    • go back ( ← )
  • Type “s” to see speaker notes
  • Type “?” to see other keyboard shortcuts

Join the CHOP R User Group

CHOPR hex sticker logo

  • Friendly help troubleshooting your R code
  • Announcements for upcoming talks, workshops, and conferences

Link to join: https://bit.ly/chopRusers

Come to R Office Hours!

  • Set up a meeting to get live help with your R code from our most experienced useRs
  • Office hours appointments can be one-on-one or open to the community

Link to calendar: https://bit.ly/chopROfficeHours

Coming Soon - More R102 Sessions!

This is the third talk in a new series called R102: MasteRing the Fundamentals


Next up: Data Types and Visualizations, June 3rd 12:00pm ET


Learn more about this new series, including dates and titles for each session:
https://arcus.github.io/r102/

Update your R!

Recent discovery of vulnerabilities in R.

Make sure to update to the most recent version (4.4.0).

Details at: https://nvd.nist.gov/vuln/detail/CVE-2024-2732

Shameless Plug

Have a funded research project and need some additional help from an experienced biostatistician or data scientist to analyze your data?

The Data Science and Biostatistics Unit (DSBU) is DBHi’s and CHOP Research Institute’s centralized service unit for biostatistics and data science analysis support. Reach out to Alexis Zavez (zaveza@chop.edu) or Keith Baxelbaum (baxelbaumk@chop.edu) for more info!

R 102:
Reshaping Data with tidyr

Today’s plan!

What we’re covering: . . .

  • What does it mean to have tidy data?
  • How to reshape data and how to extract multiple observations from a single cell

What we’re not covering:

  • Manipulating data (e.g. imputing missing data, combining variables), creating new variables, joining multiple datasets

How should we structure our data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham

The same data can be presented in various formats.

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

How should we structure our data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham

The same data can be presented in various formats.

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

How should we structure our data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham

The same data can be presented in various formats.

table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

How should we structure our data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham

The same data can be presented in various formats.

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table4b
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

What is tidy data?

  • Variables = columns
  • Observations = rows
  • Values = cells; One value per cell!

Consistent formatting helps:

  • ensure functions operate consistently on our data (think TIDYverse)
  • ensure WE know what to expect in our data

Getting data into tidy format

  • What to do when multiple variables’ values are in a single column?
  • What to do when repeated measures (multiple observations of the same variable) are in the same row?
  • What to do when a single cell contains more than one observation?

Using pivot wider to spread data across more columns

Pivot wider

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

The problem: The count column has data for both the cases and population variables!

Pivot wider

Enter the pivot_wider function!

library(tidyr)

df <- pivot_wider(table2, 
                  values_from = count,
                  names_from = type)

pivot_wider arguments:

  • Dataframe with the original, unpivoted data
  • values_from: the name of the column with the data values
  • names_from: the name of the column that identifies which variable is represented in the data column

Pivot wider

head(table2)
# A tibble: 6 × 4
  country      year type           count
  <chr>       <dbl> <chr>          <dbl>
1 Afghanistan  1999 cases            745
2 Afghanistan  1999 population  19987071
3 Afghanistan  2000 cases           2666
4 Afghanistan  2000 population  20595360
5 Brazil       1999 cases          37737
6 Brazil       1999 population 172006362
pivot_wider(table2, values_from = count, names_from = type)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Compatible with piping!

table2 |>
  pivot_wider(values_from = count, names_from = type)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

You can use the “pipe” operator for this if you’re used to piping. The pipe (|>) sends the value on the left of the pipe to serve as the first argument in the function on the right of the pipe.

Coding Challenge 1

Your turn!

Look in the reshaping_data_exercises.rmd file to find your first coding challenge.

02:00

Learn more

pivot_wider has lots of other functionality that we won’t cover today. This includes options for renaming things, pivoting multiple columns simultaneously, and filling in missing values with default values. To learn more:

Using pivot longer to gather observations of the same variable into rows

Pivot longer

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

The problem: The same variable (number of cases) is in multiple columns! Not just one observation per row!

Pivot longer

Enter the pivot_longer function!

df_cases <- pivot_longer(table4a, 
                  cols = `1999`:`2000`)

pivot_longer arguments:

  • Dataframe with the original, unpivoted data
  • cols: The set of column names with the data that you want to lengthen.

Pivot longer

head(table4a)
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
pivot_longer(table4a, cols = `1999`:`2000`)
# A tibble: 6 × 3
  country     name   value
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Pivot longer

Some QOL arguments availabile in pivot_longer!

head(table4a)
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
pivot_longer(table4a, cols = `1999`:`2000`, 
             names_to = "year", 
             values_to = "cases")
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Coding Challenge 2

Your turn!

Look in the reshaping_data_exercises.rmd file to find your next coding challenge.

03:00

Learn more

pivot_longer has lots of other functionality that we won’t cover today. For example, there are arguments that help clean up names beyond what we’ve already seen, options to manipulate the data as you pivot, and options for what to do with missing values. To learn how to use these:

Splitting multiple values contained in the same cell

Separating multiple variables in a single cell

Sometimes multiple variables end up combined in your dataframe.

head(table3)
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

The problem: The rate column includes both cases and population variables!

Separating multiple variables in a single cell

Often idiosyncratic to your specific dataset, but a family of useful functions from tidyr:

  • separate_longer_delim
  • separate_longer_position
  • separate_wider_delim
  • separate_wider_position

Separating multiple variables in a single cell

head(table3)
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Which should use to get tidy data? Vote in the comments!

  • separate_longer_delim
  • separate_longer_position
  • separate_wider_delim
  • separate_wider_position

Separating multiple variables in a single cell

head(table3)
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
separate_wider_delim(table3,
 cols = rate,
 delim = "/",
 names = c("cases", "population"))
# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Learn more

There’s another member of this family of functions for more complex separation needs: separate_wider_regex

What have we learned today?

  • There are various ways to store data; the tidy format is quite useful!
  • pivot_wider can tidy datasets where multiple variables are presented in the same column.
  • pivot_longer can tidy datasets where multiple observations of the same variable are in the same row.
  • The separate_ family of functions can tidy datasets where one cell has more than one piece of information.

Capstone project!

In the exercises R markdown file, look in the reshaping_data_exercises.Rmd file at the capstone project to try to apply these tools to a more complex dataset, while having to write your code from scratch!