# Create an example called x that contains either "female" or "male"
x <-c("female", "male", "male", "female")
print(x)
[1] "female" "male" "male" "female"
chr [1:4] "female" "male" "male" "female"
Keith Baxelbaum, Rose Hartman, and
Alexis Zavez (presenter)
Data Science and Biostatistics Unit (DSBU) and
Arcus Education, DBHI
2024-06-03
Link to join: https://bit.ly/chopRusers
Link to calendar: https://bit.ly/chopROfficeHours
This is the fourth talk in a new series called R102: MasteRing the Fundamentals
Previous Talks:
Missing Values in R (March 2024)
Summary Statistics in R (April 2024)
Reshaping Data with tidyr (May 2025)
To watch previous talks or review slides:
https://arcus.github.io/r102/
There are several R functions that can return a variable’s type
Today we’ll focus on using str()
, which displays the internal structure of an R object
Other options include class()
and typeof()
- read more about those here!
Character: each value is a string (e.g., “female”)
Factor: each value is a string, but the possible values are stored as levels within R
Numeric: each value is a real number
Logical: each value is either TRUE or FALSE
Character and factor data types look similar:
[1] "female" "male" "male" "female"
[1] female male male female
Levels: female male
summary()
function!)For more on the summary()
function, check out materials from our earlier talk on Summary Statistics in R available here
Let’s see what happens when we try to store different data types in the same vector:
Yes, using R’s functions like as.factor()
, as.character()
, and as.numeric()
:
[1] "female" "male" "male" "female"
chr [1:4] "female" "male" "male" "female"
[1] female male male female
Levels: female male
Factor w/ 2 levels "female","male": 1 2 2 1
[1] "female" "male" "male" "female"
chr [1:4] "female" "male" "male" "female"
There are two general options for creating figures in R:
Option 1: Base R functions like plot()
Option 2: Using the tidyverse and functions like ggplot()
Base R figures are still a good option!
Option 1: Work in the cloud: https://posit.cloud/content/7522885
Option 2: Work on your computer: https://github.com/arcus/r102
Note: ggplot2 is actually part of the tidyverse core set of packages
There’s a lot of helpful information (including examples and tutorials) on the package websites for each of the packages we’ll be using:
Only if needed:
Each R session:
In the console or in the data_types_and_viz_exercises.rmd
file, run the following command:
# A tibble: 6 × 17
subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name
<dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 1412 jhezane westerling female 4 covid inpatient wa…
2 533 penny targaryen female 7 covid clinical lab
3 9134 grunt rivers male 7 covid clinical lab
4 8518 melisandre swyft female 8 covid clinical lab
5 8967 rolley karstark male 8 covid emergency de…
6 11048 megga karstark female 8 covid oncology day…
# ℹ 10 more variables: result <chr>, demo_group <chr>, age <dbl>,
# drive_thru_ind <dbl>, ct_result <dbl>, orderset <dbl>, payor_group <chr>,
# patient_class <chr>, col_rec_tat <dbl>, rec_ver_tat <dbl>
To learn more about this dataset:
From the help documentation:
This data set is from Amrom E. Obstfeld, who de-identified data on COVID-19 testing during 2020 at CHOP (Children’s Hospital of Pennsylvania). This data set contains data concerning testing for SARS-CoV2 via PCR as well as associated metadata. These data have been anonymized, time-shifted, and permuted.
medicaldata
R package these data are published in, see the medicaldata
package website – and note that the maintainers are always looking for more data contributions!Your turn!
Look in the data_types_and_viz_exercises.rmd
file to find your first coding challenge.
02:00
str()
We can also use str()
on the entire covid_testing dataset all at once:
spc_tbl_ [15,524 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ subject_id : num [1:15524] 1412 533 9134 8518 8967 ...
$ fake_first_name: chr [1:15524] "jhezane" "penny" "grunt" "melisandre" ...
$ fake_last_name : chr [1:15524] "westerling" "targaryen" "rivers" "swyft" ...
$ gender : chr [1:15524] "female" "female" "male" "female" ...
$ pan_day : num [1:15524] 4 7 7 8 8 8 9 9 9 9 ...
$ test_id : chr [1:15524] "covid" "covid" "covid" "covid" ...
$ clinic_name : chr [1:15524] "inpatient ward a" "clinical lab" "clinical lab" "clinical lab" ...
$ result : chr [1:15524] "negative" "negative" "negative" "negative" ...
$ demo_group : chr [1:15524] "patient" "patient" "patient" "patient" ...
$ age : num [1:15524] 0 0 0.8 0.8 0.8 0.8 0.8 0 0 0.9 ...
$ drive_thru_ind : num [1:15524] 0 1 1 1 0 0 1 0 1 1 ...
$ ct_result : num [1:15524] 45 45 45 45 45 45 45 45 45 45 ...
$ orderset : num [1:15524] 0 0 1 1 1 0 1 1 1 1 ...
$ payor_group : chr [1:15524] "government" "commercial" NA NA ...
$ patient_class : chr [1:15524] "inpatient" "not applicable" NA NA ...
$ col_rec_tat : num [1:15524] 1.4 2.3 7.3 5.8 1.2 1.4 2.6 0.7 1 7.1 ...
$ rec_ver_tat : num [1:15524] 5.2 5.8 4.7 5 6.4 7 4.2 6.3 5.6 7 ...
- attr(*, "spec")=
.. cols(
.. subject_id = col_double(),
.. fake_first_name = col_character(),
.. fake_last_name = col_character(),
.. gender = col_character(),
.. pan_day = col_double(),
.. test_id = col_character(),
.. clinic_name = col_character(),
.. result = col_character(),
.. demo_group = col_character(),
.. age = col_double(),
.. drive_thru_ind = col_double(),
.. ct_result = col_double(),
.. orderset = col_double(),
.. payor_group = col_character(),
.. patient_class = col_character(),
.. col_rec_tat = col_double(),
.. rec_ver_tat = col_double()
.. )
- attr(*, "problems")=<externalptr>
Click here for the full list of geom_ options
ggplot(data = covid_testing, # dataset to use for plot
mapping = aes( # list of aesthetic mappings to use for plot
x = pan_day, # x-axis variable
y = ct_result, # y-axis variable
color = as.factor(drive_thru_ind) # color variable
)
) +
geom_point() + # adds the points to the scatterplot
scale_color_manual(values = c("0" = 'blue', # sets colors for each level
"1" = 'red'))
ggplot(data = covid_testing, # dataset to use for plot
mapping = aes( # list of aesthetic mappings to use for plot
x = pan_day, # x-axis variable
y = ct_result, # y-axis variable
color = as.factor(drive_thru_ind)
)
) +
geom_point() +
scale_color_manual(values = c("0" = 'blue', # sets colors for each level
"1" = 'red')) +
ggtitle("CT Results by Pandemic Day") + # adds title to plot
xlab("Pandemic Day") + # changes x-axis label
ylab("CT Result") + # changes y-axis label
labs(color = "Drive Thru Indicator") # changes legend label
ggplot(data = covid_testing, # dataset to use for plot
mapping = aes( # list of aesthetic mappings to use for plot
x = pan_day, # x-axis variable
y = ct_result, # y-axis variable
color = as.factor(drive_thru_ind)
)
) +
geom_point() +
scale_color_manual(values = c("0" = 'blue', # sets colors for each level
"1" = 'red')) +
ggtitle("CT Results by Pandemic Day") + # adds title to plot
xlab("Pandemic Day") + # changes x-axis label
ylab("CT Result") + # changes y-axis label
labs(color = "Drive Thru Indicator") + # changes legend label
theme_bw() + # changes plot theme to black and white
theme(plot.title = element_text(hjust = 0.5), # centers plot title
legend.position="bottom") # moves legend to bottom
Suppose we are interested in separating the points in the previous scatterplot based on patient gender
ggplot’s facet_wrap()
function provides a easy way to do this:
ggplot(data = covid_testing, # dataset to use for plot
mapping = aes( # list of aesthetic mappings to use for plot
x = pan_day, # x-axis variable
y = ct_result, # y-axis variable
color = as.factor(drive_thru_ind)
)
) +
geom_point() +
scale_color_manual(values = c("0" = 'blue', # sets colors for each level
"1" = 'red')) +
ggtitle("CT Results by Pandemic Day") + # adds title to plot
xlab("Pandemic Day") + # changes x-axis label
ylab("CT Result") + # changes y-axis label
labs(color = "Drive Thru Indicator") + # changes legend label
theme_bw() + # changes plot theme to black and white
theme(plot.title = element_text(hjust = 0.5), # centers plot title
legend.position="bottom") + # moves legend to bottom
facet_wrap(~gender)
Your turn!
Look in the data_types_and_viz_exercises.rmd
file to find your second coding challenge.
03:00
scale_color_manual()
: used to color lines and points
scale_fill_manual()
: used to color fillable objects (e.g. histograms)
Note: We will also need to modify the code for the legend title from labs(color = "Label Text")
to labs(fill = "Label Text")
ggplot(data = covid_testing, # dataset to use for plot
mapping = aes( # list of aesthetic mappings to use for plot
x = pan_day, # x-axis variable
fill = gender) # note: using "fill" not "color"
) +
geom_histogram(# adds the histograms to the graph
bins = 10, # number of bins
color = "black"# color of bin outline
)
ggplot(data = covid_testing, # dataset to use for plot
mapping = aes( # list of aesthetic mappings to use for plot
x = pan_day, # x-axis variable
fill = gender) # note: using "fill" not "color"
) +
geom_histogram(# adds the histograms to the graph
bins = 10, # number of bins
color = "black"# color of bin outline
) +
scale_fill_manual(values = c("female" = 'lightgrey', # sets colors for each level
"male" = 'darkgrey'))
ggplot(data = covid_testing, # dataset to use for plot
mapping = aes( # list of aesthetic mappings to use for plot
x = pan_day, # x-axis variable
fill = gender) # note: using "fill" not "color"
) +
geom_histogram(# adds the histograms to the graph
bins = 10, # number of bins
color = "black"# color of bin outline
) +
scale_fill_manual(values = c("female" = 'lightgrey', # sets colors for each level
"male" = 'darkgrey')) +
ggtitle("Number of Patients by Pandemic Day") + # adds title to plot
xlab("Pandemic Day") + # changes x-axis label
ylab("Number of Patients") + # changes y-axis label
labs(fill = "Patient Gender") + # changes legend label (note this says "fill")
theme_bw() + # changes plot theme to black and white
theme(plot.title = element_text(hjust = 0.5), # centers plot title
legend.position="bottom") # moves legend to bottom
The ggsave()
function is a convenient option for exporting a ggplot object:
There are lots of options regarding plot specifications like dimensions and resolution - learn more here!
Try to recreate this figure:
Don’t struggle in silence!
data_types_and_viz_solutions.Rmd
, but you’ll learn a lot more if you try it yourself firstHave funding for your research project and interested in working with an experienced biostatistician or data scientist to analyze your data?
The Data Science and Biostatistics Unit (DSBU) is DBHi’s and CHOP Research Institute’s centralized service unit for biostatistics and data science analysis support. Reach out to Alexis Zavez (zaveza@chop.edu) or Keith Baxelbaum (baxelbaumk@chop.edu) for more info!