Reshaping Data with tidyr

Keith Baxelbaum, Rose Hartman, and
Alexis Zavez

Data Science and Biostatistics Unit (DSBU) and
Arcus Education, DBHI

2024-05-06

Use keyboard arrow keys to
- advance ( → ) and
- go back ( ← )
Type “s” to see speaker notes
Type “?” to see other keyboard shortcuts

Join the CHOP R User Group

CHOPR hex sticker logo

Friendly help troubleshooting your R code
Announcements for upcoming talks, workshops, and conferences

Link to join: https://bit.ly/chopRusers

Come to R Office Hours!

Set up a meeting to get live help with your R code from our most experienced useRs
Office hours appointments can be one-on-one or open to the community

Link to calendar: https://bit.ly/chopROfficeHours

We have regular office hours appointments available to get R help. We’re offering two kinds of appointments: one-on-one, if you want individual help, or open appointments that are more of a community conversation about whatever R questions people bring up that day. The open appointments can be particularly valuable to attend if you want to hear other people’s questions about R but maybe don’t have a project of your own that you’re troubleshooting at the moment. We set up a calendar online where you can book an appointment. [click] And if you like this idea and you’re an experienced R user yourself, consider joining us to offer appointments! It’s just whatever times work for you, so we can set up a schedule around whatever other work you’ve got going on. Reach out to me via email or slack for details.

Coming Soon - More R102 Sessions!

This is the third talk in a new series called R102: MasteRing the Fundamentals

Next up: Data Types and Visualizations, June 3rd 12:00pm ET

Learn more about this new series, including dates and titles for each session:
https://arcus.github.io/r102/

Update your R!

Recent discovery of vulnerabilities in R.

Make sure to update to the most recent version (4.4.0).

Details at: https://nvd.nist.gov/vuln/detail/CVE-2024-2732

Shameless Plug

Have a funded research project and need some additional help from an experienced biostatistician or data scientist to analyze your data?

The Data Science and Biostatistics Unit (DSBU) is DBHi’s and CHOP Research Institute’s centralized service unit for biostatistics and data science analysis support. Reach out to Alexis Zavez (zaveza@chop.edu) or Keith Baxelbaum (baxelbaumk@chop.edu) for more info!

R 102:
Reshaping Data with tidyr

Today’s talk is a quick primer on some tools for data manipulation. If you’ve had a little exposure to R before, such as through an Intro to R for Clinical Data workshop, this is hopefully the right level for you now. If you’re completely brand new to R, first of all: Welcome! You may find it tricky to actively follow along with the code today since I’m going to skim over some of the initial steps, but go ahead and give it a try, or just listen and watch if that feels more like the right speed.

Our topic is how to turn your messy dataset or datasets into a well-structured, analysis-ready format. Data come to use in all manner of disarray, often due to reasons beyond our control – a collaborator loves hand-curated Excel spreadsheets, or REDCap spit out data with a zillion columns, or we have data from five different clinical sources that we need to somehow link. This talk will start to explore efficient ways to wrestle data into a consistent, easy-to-use format called tidy data. These are big tasks, and we only scratch the surface of how to accomplish them.

Please feel free to put any questions in the chat - we have a few additional team members who will try to answer your questions during the talk. If there is extra time, we can also stop and talk about things as a group.

Today’s plan!

What we’re covering: . . .

What does it mean to have tidy data?

How to reshape data and how to extract multiple observations from a single cell

What we’re not covering:

Manipulating data (e.g. imputing missing data, combining variables), creating new variables, joining multiple datasets

How should we structure our data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham

The same data can be presented in various formats.

table1

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

How should we structure our data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham

The same data can be presented in various formats.

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

How should we structure our data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham

The same data can be presented in various formats.

table3

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

How should we structure our data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham

The same data can be presented in various formats.

table4a

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

table4b

# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

What is tidy data?

Variables = columns
Observations = rows
Values = cells; One value per cell!

Consistent formatting helps:

ensure functions operate consistently on our data (think TIDYverse)
ensure WE know what to expect in our data

Tidy data is a very specific, consistent way to format data that ensures we know exactly what to expect in our data. This is important and useful for multiple reasons. First, if data are consistently formatted the same way, then we can be confident in how functions will interact with these data. In the tidyverse (and increasingly other packages in R), the expectation is that data are in tidy format. This helps us get consistent output from the same functions!

Second, consistent data formatting helps us know what to expect from our data! If we always have one observation per row and one row per observation, then we can immediately know how to parse our data. If we know that each variable has its own column, then we know that we don’t have to try to unpack complex columns to understand what the data are.

Getting data into tidy format

What to do when multiple variables’ values are in a single column?
What to do when repeated measures (multiple observations of the same variable) are in the same row?
What to do when a single cell contains more than one observation?

Using pivot wider to spread data across more columns

Pivot wider

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

The problem: The count column has data for both the cases and population variables!

The first reshaping challenge we’ll address is when variables are specified in rows, rather than in columns. Recall that tidy data means that each variable has its own column, and rows are reserved for unique observations of these variables. But data don’t always come in this way.

Recall the structure of table2 from earlier (this table is pre-loaded as part of the tidyverse package if you want to play with it). In this table, the “type” column is used to identify which variable a row is talking about, and then the “count” column contains the value for that variable for that observation. This isn’t what we want for tidy data; in particular, we want the values for population and for cases for a given country in a given year to be on the same row – this is one “observation,” so they shouldn’t be separated.

A great hint that your data may have this issue is if a column with data includes different types of data or things on really different scales (e.g. here, the population rows are several orders of magnitude larger than the cases rows). In this case, we want to spread these variables out to be in their own columns rather than sharing a column.

Pivot wider

Enter the pivot_wider function!

library(tidyr)

df <- pivot_wider(table2, 
                  values_from = count,
                  names_from = type)

pivot_wider arguments:

Dataframe with the original, unpivoted data
values_from: the name of the column with the data values
names_from: the name of the column that identifies which variable is represented in the data column

Pivot wider

head(table2)

# A tibble: 6 × 4
  country      year type           count
  <chr>       <dbl> <chr>          <dbl>
1 Afghanistan  1999 cases            745
2 Afghanistan  1999 population  19987071
3 Afghanistan  2000 cases           2666
4 Afghanistan  2000 population  20595360
5 Brazil       1999 cases          37737
6 Brazil       1999 population 172006362

pivot_wider(table2, values_from = count, names_from = type)

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Compatible with piping!

table2 |>
  pivot_wider(values_from = count, names_from = type)

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

You can use the “pipe” operator for this if you’re used to piping. The pipe (|>) sends the value on the left of the pipe to serve as the first argument in the function on the right of the pipe.

Coding Challenge 1

Your turn!

Look in the reshaping_data_exercises.rmd file to find your first coding challenge.

02:00

Learn more

pivot_wider has lots of other functionality that we won’t cover today. This includes options for renaming things, pivoting multiple columns simultaneously, and filling in missing values with default values. To learn more:

Using pivot longer to gather observations of the same variable into rows

Pivot longer

table4a

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

The problem: The same variable (number of cases) is in multiple columns! Not just one observation per row!

Pivot longer

Enter the pivot_longer function!

df_cases <- pivot_longer(table4a, 
                  cols = `1999`:`2000`)

pivot_longer arguments:

Dataframe with the original, unpivoted data
cols: The set of column names with the data that you want to lengthen.

Pivot longer

head(table4a)

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

pivot_longer(table4a, cols = `1999`:`2000`)

# A tibble: 6 × 3
  country     name   value
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Pivot longer

Some QOL arguments availabile in pivot_longer!

head(table4a)

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

pivot_longer(table4a, cols = `1999`:`2000`, 
             names_to = "year", 
             values_to = "cases")

# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Coding Challenge 2

Your turn!

Look in the reshaping_data_exercises.rmd file to find your next coding challenge.

03:00

Learn more

pivot_longer has lots of other functionality that we won’t cover today. For example, there are arguments that help clean up names beyond what we’ve already seen, options to manipulate the data as you pivot, and options for what to do with missing values. To learn how to use these:

Splitting multiple values contained in the same cell

Separating multiple variables in a single cell

Sometimes multiple variables end up combined in your dataframe.

head(table3)

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

The problem: The rate column includes both cases and population variables!

Sometimes our data come in in a way that we have multiple different variables in the same cell. This could be something like table 3 here, where rather than separating cases and population, they’re presented in a single cell as a ratio. This may be useful for some analyses, but not for others – what if we want to investigate population change irrespective of cases? What if we want to split data on the basis of population (small, medium, large)? For these (and other) uses, it’s helpful to have each variable on its own. We can always compute a ratio later!

One place this pops up a lot is things like a list of languages that a patient speaks. Rather than have this as a series of separate entries, it might be entered as a list with commas separating values. These lists can be a huge pain to work with! In situations like this, it’s often helpful to split the offending column of combined variables into separate columns. Let’s look into some options for doing so!

Separating multiple variables in a single cell

Often idiosyncratic to your specific dataset, but a family of useful functions from tidyr:

separate_longer_delim
separate_longer_position
separate_wider_delim
separate_wider_position

Separating multiple variables in a single cell

head(table3)

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Which should use to get tidy data? Vote in the comments!

separate_longer_delim
separate_longer_position
separate_wider_delim
separate_wider_position

Separating multiple variables in a single cell

head(table3)

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

separate_wider_delim(table3,
 cols = rate,
 delim = "/",
 names = c("cases", "population"))

# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Learn more

There’s another member of this family of functions for more complex separation needs: separate_wider_regex

What have we learned today?

There are various ways to store data; the tidy format is quite useful!

pivot_wider can tidy datasets where multiple variables are presented in the same column.

pivot_longer can tidy datasets where multiple observations of the same variable are in the same row.

The separate_ family of functions can tidy datasets where one cell has more than one piece of information.

Capstone project!

In the exercises R markdown file, look in the reshaping_data_exercises.Rmd file at the capstone project to try to apply these tools to a more complex dataset, while having to write your code from scratch!

Reshaping Data with tidyr

Join the CHOP R User Group

Come to R Office Hours!

Coming Soon - More R102 Sessions!

Update your R!

Shameless Plug

R 102: Reshaping Data with tidyr

Today’s plan!

How should we structure our data?

How should we structure our data?

How should we structure our data?

How should we structure our data?

What is tidy data?

Getting data into tidy format

Using pivot wider to spread data across more columns

Pivot wider

Pivot wider

Pivot wider

Compatible with piping!

Coding Challenge 1

Learn more

Using pivot longer to gather observations of the same variable into rows

Pivot longer

Pivot longer

Pivot longer

Pivot longer

Coding Challenge 2

Learn more

Splitting multiple values contained in the same cell

Separating multiple variables in a single cell

Separating multiple variables in a single cell

Separating multiple variables in a single cell

Separating multiple variables in a single cell

Learn more

What have we learned today?

Capstone project!

R 102:
Reshaping Data with tidyr