Keith Baxelbaum, Rose Hartman, and
Alexis Zavez
Data Science and Biostatistics Unit (DSBU) and
Arcus Education, DBHI
2024-05-06
Link to join: https://bit.ly/chopRusers
Link to calendar: https://bit.ly/chopROfficeHours
This is the third talk in a new series called R102: MasteRing the Fundamentals
Next up: Data Types and Visualizations, June 3rd 12:00pm ET
Learn more about this new series, including dates and titles for each session:
https://arcus.github.io/r102/
Recent discovery of vulnerabilities in R.
Make sure to update to the most recent version (4.4.0).
Details at: https://nvd.nist.gov/vuln/detail/CVE-2024-2732
Have a funded research project and need some additional help from an experienced biostatistician or data scientist to analyze your data?
The Data Science and Biostatistics Unit (DSBU) is DBHi’s and CHOP Research Institute’s centralized service unit for biostatistics and data science analysis support. Reach out to Alexis Zavez (zaveza@chop.edu) or Keith Baxelbaum (baxelbaumk@chop.edu) for more info!
What we’re covering: . . .
What we’re not covering:
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham
The same data can be presented in various formats.
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham
The same data can be presented in various formats.
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham
The same data can be presented in various formats.
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham
The same data can be presented in various formats.
Consistent formatting helps:
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
The problem: The count column has data for both the cases and population variables!
Enter the pivot_wider
function!
pivot_wider
arguments:
values_from
: the name of the column with the data valuesnames_from
: the name of the column that identifies which variable is represented in the data column# A tibble: 6 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
You can use the “pipe” operator for this if you’re used to piping. The pipe (|>
) sends the value on the left of the pipe to serve as the first argument in the function on the right of the pipe.
Your turn!
Look in the reshaping_data_exercises.rmd
file to find your first coding challenge.
02:00
pivot_wider has lots of other functionality that we won’t cover today. This includes options for renaming things, pivoting multiple columns simultaneously, and filling in missing values with default values. To learn more:
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
The problem: The same variable (number of cases) is in multiple columns! Not just one observation per row!
Enter the pivot_longer
function!
pivot_longer
arguments:
cols
: The set of column names with the data that you want to lengthen.# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
# A tibble: 6 × 3
country name value
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
Some QOL arguments availabile in pivot_longer!
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
# A tibble: 6 × 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
Your turn!
Look in the reshaping_data_exercises.rmd
file to find your next coding challenge.
03:00
pivot_longer has lots of other functionality that we won’t cover today. For example, there are arguments that help clean up names beyond what we’ve already seen, options to manipulate the data as you pivot, and options for what to do with missing values. To learn how to use these:
Sometimes multiple variables end up combined in your dataframe.
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
The problem: The rate column includes both cases and population variables!
Often idiosyncratic to your specific dataset, but a family of useful functions from tidyr:
separate_longer_delim
separate_longer_position
separate_wider_delim
separate_wider_position
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Which should use to get tidy data? Vote in the comments!
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
There’s another member of this family of functions for more complex separation needs: separate_wider_regex
pivot_wider
can tidy datasets where multiple variables are presented in the same column.pivot_longer
can tidy datasets where multiple observations of the same variable are in the same row.separate_
family of functions can tidy datasets where one cell has more than one piece of information.In the exercises R markdown file, look in the reshaping_data_exercises.Rmd
file at the capstone project to try to apply these tools to a more complex dataset, while having to write your code from scratch!