Data Types and Visualizations
in R

Keith Baxelbaum, Rose Hartman, and
Alexis Zavez (presenter)

Data Science and Biostatistics Unit (DSBU) and
Arcus Education, DBHI

2024-06-03

  • Use keyboard arrow keys to
    • advance ( → ) and
    • go back ( ← )
  • Type “s” to see speaker notes
  • Type “?” to see other keyboard shortcuts

Join the CHOP R User Group

CHOPR hex sticker logo

  • Friendly help troubleshooting your R code
  • Announcements for upcoming talks, workshops, and conferences

Link to join: https://bit.ly/chopRusers

Come to R Office Hours!

  • Set up a meeting to get live help with your R code from our most experienced useRs
  • Office hours appointments can be one-on-one or open to the community

Link to calendar: https://bit.ly/chopROfficeHours

Recap: Previous R102 Sessions

This is the fourth talk in a new series called R102: MasteRing the Fundamentals


Previous Talks:

Missing Values in R (March 2024)

Summary Statistics in R (April 2024)

Reshaping Data with tidyr (May 2025)


To watch previous talks or review slides:
https://arcus.github.io/r102/

R 102:
Data Types and Visualizations

What we’re covering today

  • How to identify data types in R
  • A review of the data types that are available in R
  • Converting variables from one data type to another
  • Creating flexible visualizations of data in R

Identifying Data Types in R

There are several R functions that can return a variable’s type


Today we’ll focus on using str(), which displays the internal structure of an R object


Other options include class() and typeof() - read more about those here!

What are the most common data types?


Character: each value is a string (e.g., “female”)

# Create an example called x that contains either "female" or "male"
x <-c("female", "male", "male", "female")

print(x)
[1] "female" "male"   "male"   "female"
str(x)
 chr [1:4] "female" "male" "male" "female"

What are the most common data types?


Factor: each value is a string, but the possible values are stored as levels within R

# Convert the character example to a factor 
x.factor <- as.factor(x)
print(x.factor)
[1] female male   male   female
Levels: female male
str(x.factor)
 Factor w/ 2 levels "female","male": 1 2 2 1

What are the most common data types?


Numeric: each value is a real number

x <- c(-5.2,0,1.2,2.82,7.676)
print(x)
[1] -5.200  0.000  1.200  2.820  7.676
str(x)
 num [1:5] -5.2 0 1.2 2.82 7.68

Other data types

Logical: each value is either TRUE or FALSE

x <- c(TRUE, FALSE, T, F)
print(x)
[1]  TRUE FALSE  TRUE FALSE
str(x)
 logi [1:4] TRUE FALSE TRUE FALSE

Complex: each value is a complex number

x <- c((1 + 2i), (2 + 3i))
print(x)
[1] 1+2i 2+3i
str(x)
 cplx [1:2] 1+2i 2+3i

Character vs Factor Data Types

Character and factor data types look similar:

x <-c("female", "male", "male", "female")

print(x)
[1] "female" "male"   "male"   "female"
x.factor <- as.factor(x)

print(x.factor)
[1] female male   male   female
Levels: female male
  • Factors are stored as numbers and a table of levels, which can save memory and computation time
  • More computational options available for factors compared to characters (e.g., summary() function!)

For more on the summary() function, check out materials from our earlier talk on Summary Statistics in R available here

  • Variables like name, study id, etc. can be stored as character vectors

Can we mix data types?

Let’s see what happens when we try to store different data types in the same vector:

x <- c(TRUE, 3.0, "male")
print(x)
[1] "TRUE" "3"    "male"
str(x)
 chr [1:3] "TRUE" "3" "male"
  • We generally want to avoid mixing data types within one variable
  • R will try to convert all values in the column to one data type
  • Often, it doesn’t make much sense to have multiple data types stored within the same variable

Can we change data types?

Yes, using R’s functions like as.factor(), as.character(), and as.numeric():

x <-c("female", "male", "male", "female")
print(x)
[1] "female" "male"   "male"   "female"
str(x)
 chr [1:4] "female" "male" "male" "female"
x <- as.factor(x) # convert from character to factor 
print(x)
[1] female male   male   female
Levels: female male
str(x)
 Factor w/ 2 levels "female","male": 1 2 2 1
x <- as.character(x) # convert from factor back to character
print(x)
[1] "female" "male"   "male"   "female"
str(x)
 chr [1:4] "female" "male" "male" "female"

Can we change data types?

x <-c("3.7", "4.2", "5.0")
print(x)
[1] "3.7" "4.2" "5.0"
str(x)
 chr [1:3] "3.7" "4.2" "5.0"
x <- as.numeric(x) # convert from character to numeric 
print(x)
[1] 3.7 4.2 5.0
str(x)
 num [1:3] 3.7 4.2 5

How can we create visualizations in R?

There are two general options for creating figures in R:


Option 1: Base R functions like plot()

Examples of figures created in base R


Option 2: Using the tidyverse and functions like ggplot()

Examples of figures created with ggplot

Some nice features of ggplot

  • All figures are built using a series of layers
  • Able to save plots (or partial plots) as objects
# save basis of plot as an object called "p"
p <- ggplot(data = example_data, aes(x = x_variable, y = y_variable))

# add a geometric object (points) and display the plot
p + geom_point()

# add a different geometric object (line) and display the plot
p + geom_line()
  • Quickly create separate plots for each value of a factor variable using facet_grid or facet_wrap

Base R figures are still a good option!

Data Types and Visualizations in R

Option 1: Work in the cloud: https://posit.cloud/content/7522885

Option 2: Work on your computer: https://github.com/arcus/r102

The packages we’ll be using today

tidyverse hex sticker logo. medicaldata hex sticker logo. ggplot2 hex sticker logo.

Note: ggplot2 is actually part of the tidyverse core set of packages

Learn more

There’s a lot of helpful information (including examples and tutorials) on the package websites for each of the packages we’ll be using:

Load packages

Only if needed:

install.packages(c("tidyverse", "medicaldata", "ggplot2"))


Each R session:

library(tidyverse)
library(medicaldata) 
library(ggplot2)

The data

In the console or in the data_types_and_viz_exercises.rmd file, run the following command:

head(covid_testing)
# A tibble: 6 × 17
  subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name  
       <dbl> <chr>           <chr>          <chr>    <dbl> <chr>   <chr>        
1       1412 jhezane         westerling     female       4 covid   inpatient wa…
2        533 penny           targaryen      female       7 covid   clinical lab 
3       9134 grunt           rivers         male         7 covid   clinical lab 
4       8518 melisandre      swyft          female       8 covid   clinical lab 
5       8967 rolley          karstark       male         8 covid   emergency de…
6      11048 megga           karstark       female       8 covid   oncology day…
# ℹ 10 more variables: result <chr>, demo_group <chr>, age <dbl>,
#   drive_thru_ind <dbl>, ct_result <dbl>, orderset <dbl>, payor_group <chr>,
#   patient_class <chr>, col_rec_tat <dbl>, rec_ver_tat <dbl>

About these data

To learn more about this dataset:

?covid_testing

From the help documentation:

This data set is from Amrom E. Obstfeld, who de-identified data on COVID-19 testing during 2020 at CHOP (Children’s Hospital of Pennsylvania). This data set contains data concerning testing for SARS-CoV2 via PCR as well as associated metadata. These data have been anonymized, time-shifted, and permuted.

Learn more

  • To learn more about the covid data and the study behind it, check out this link.
  • To learn more about the medicaldata R package these data are published in, see the medicaldata package website – and note that the maintainers are always looking for more data contributions!

Coding Challenge 1:

Your turn!

Look in the data_types_and_viz_exercises.rmd file to find your first coding challenge.

02:00

Quick aside: using str()

We can also use str() on the entire covid_testing dataset all at once:

str(covid_testing)
spc_tbl_ [15,524 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ subject_id     : num [1:15524] 1412 533 9134 8518 8967 ...
 $ fake_first_name: chr [1:15524] "jhezane" "penny" "grunt" "melisandre" ...
 $ fake_last_name : chr [1:15524] "westerling" "targaryen" "rivers" "swyft" ...
 $ gender         : chr [1:15524] "female" "female" "male" "female" ...
 $ pan_day        : num [1:15524] 4 7 7 8 8 8 9 9 9 9 ...
 $ test_id        : chr [1:15524] "covid" "covid" "covid" "covid" ...
 $ clinic_name    : chr [1:15524] "inpatient ward a" "clinical lab" "clinical lab" "clinical lab" ...
 $ result         : chr [1:15524] "negative" "negative" "negative" "negative" ...
 $ demo_group     : chr [1:15524] "patient" "patient" "patient" "patient" ...
 $ age            : num [1:15524] 0 0 0.8 0.8 0.8 0.8 0.8 0 0 0.9 ...
 $ drive_thru_ind : num [1:15524] 0 1 1 1 0 0 1 0 1 1 ...
 $ ct_result      : num [1:15524] 45 45 45 45 45 45 45 45 45 45 ...
 $ orderset       : num [1:15524] 0 0 1 1 1 0 1 1 1 1 ...
 $ payor_group    : chr [1:15524] "government" "commercial" NA NA ...
 $ patient_class  : chr [1:15524] "inpatient" "not applicable" NA NA ...
 $ col_rec_tat    : num [1:15524] 1.4 2.3 7.3 5.8 1.2 1.4 2.6 0.7 1 7.1 ...
 $ rec_ver_tat    : num [1:15524] 5.2 5.8 4.7 5 6.4 7 4.2 6.3 5.6 7 ...
 - attr(*, "spec")=
  .. cols(
  ..   subject_id = col_double(),
  ..   fake_first_name = col_character(),
  ..   fake_last_name = col_character(),
  ..   gender = col_character(),
  ..   pan_day = col_double(),
  ..   test_id = col_character(),
  ..   clinic_name = col_character(),
  ..   result = col_character(),
  ..   demo_group = col_character(),
  ..   age = col_double(),
  ..   drive_thru_ind = col_double(),
  ..   ct_result = col_double(),
  ..   orderset = col_double(),
  ..   payor_group = col_character(),
  ..   patient_class = col_character(),
  ..   col_rec_tat = col_double(),
  ..   rec_ver_tat = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 


ggplot2 for Visualizations

Example 1: Scatterplot (single color)

baseplot <- ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result # y-axis variable
                             ))
baseplot

Example 1: Scatterplot (single color)

baseplot + 
  geom_point() # adds the points to the scatterplot

Click here for the full list of geom_ options

Changing the color of the points

baseplot + 
  geom_point(  # adds the points to the scatterplot
    color = "red" # sets the color of the points to "red"
    )

Example 2: Scatterplot (color by variable)

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = drive_thru_ind # color variable
                             )
       ) + 
  geom_point() # adds the points to the scatterplot

table(covid_testing$drive_thru_ind)

   0    1 
7537 7987 
str(covid_testing$drive_thru_ind)
 num [1:15524] 0 1 1 1 0 0 1 0 1 1 ...

Converting drive_thru_ind to a factor

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = as.factor(drive_thru_ind) # color variable
                             )
       ) + 
  geom_point() # adds the points to the scatterplot

Changing the color of the points

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = as.factor(drive_thru_ind) # color variable
                             )
       ) + 
  geom_point() + # adds the points to the scatterplot
  scale_color_manual(values = c("0" = 'blue', # sets colors for each level
                                "1" = 'red'))

Adding plot titles and labels

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = as.factor(drive_thru_ind)
                             )
       ) + 
  geom_point() + 
  scale_color_manual(values = c("0" = 'blue', # sets colors for each level
                                "1" = 'red')) + 
  ggtitle("CT Results by Pandemic Day") + # adds title to plot
  xlab("Pandemic Day") + # changes x-axis label
  ylab("CT Result") + # changes y-axis label 
  labs(color = "Drive Thru Indicator") # changes legend label

Changing plot placement and theme

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = as.factor(drive_thru_ind)
                             )
       ) + 
  geom_point() + 
  scale_color_manual(values = c("0" = 'blue', # sets colors for each level
                                "1" = 'red')) + 
  ggtitle("CT Results by Pandemic Day") + # adds title to plot
  xlab("Pandemic Day") + # changes x-axis label
  ylab("CT Result") + # changes y-axis label 
  labs(color = "Drive Thru Indicator") + # changes legend label
  theme_bw() + # changes plot theme to black and white
  theme(plot.title = element_text(hjust = 0.5), # centers plot title
        legend.position="bottom") # moves legend to bottom 

Using facet_wrap

Suppose we are interested in separating the points in the previous scatterplot based on patient gender

ggplot’s facet_wrap() function provides a easy way to do this:

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = as.factor(drive_thru_ind)
                             )
       ) + 
  geom_point() + 
  scale_color_manual(values = c("0" = 'blue', # sets colors for each level
                                "1" = 'red')) + 
  ggtitle("CT Results by Pandemic Day") + # adds title to plot
  xlab("Pandemic Day") + # changes x-axis label
  ylab("CT Result") + # changes y-axis label 
  labs(color = "Drive Thru Indicator") + # changes legend label
  theme_bw() + # changes plot theme to black and white
  theme(plot.title = element_text(hjust = 0.5), # centers plot title
        legend.position="bottom") +  # moves legend to bottom 
  facet_wrap(~gender)

Learn more

  • In addition to facet_wrap(), there is a similar option called facet_grid()

  • To learn more about faceting, including the differences between these two functions, check out this link

  • To learn more about the different ggplot themes that are available, check out this link

Coding Challenge 2:

Your turn!

Look in the data_types_and_viz_exercises.rmd file to find your second coding challenge.

03:00

Aside: “color” vs. “fill”


scale_color_manual(): used to color lines and points

scale_fill_manual(): used to color fillable objects (e.g. histograms)


Note: We will also need to modify the code for the legend title from labs(color = "Label Text") to labs(fill = "Label Text")

Example 3: Histogram (color by variable)

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             fill = gender) # note: using "fill" not "color"
       ) + 
  geom_histogram(# adds the histograms to the graph
    bins = 10, # number of bins 
    color = "black"# color of bin outline 
    ) 

Changing the color of the bars

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             fill = gender) # note: using "fill" not "color"
       ) + 
  geom_histogram(# adds the histograms to the graph
    bins = 10, # number of bins 
    color = "black"# color of bin outline 
    ) +
  scale_fill_manual(values = c("female" = 'lightgrey', # sets colors for each level
                                "male" = 'darkgrey'))

Modifying plot titles, labels, and theme

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             fill = gender) # note: using "fill" not "color"
       ) + 
  geom_histogram(# adds the histograms to the graph
    bins = 10, # number of bins 
    color = "black"# color of bin outline 
    ) +
  scale_fill_manual(values = c("female" = 'lightgrey', # sets colors for each level
                                "male" = 'darkgrey')) + 
  ggtitle("Number of Patients by Pandemic Day") + # adds title to plot
  xlab("Pandemic Day") + # changes x-axis label
  ylab("Number of Patients") + # changes y-axis label 
  labs(fill = "Patient Gender") + # changes legend label (note this says "fill")
  theme_bw() + # changes plot theme to black and white
  theme(plot.title = element_text(hjust = 0.5), # centers plot title
        legend.position="bottom") # moves legend to bottom 

Exporting a Visualization

The ggsave() function is a convenient option for exporting a ggplot object:

ggsave(filename = "Figure_Name.png", 
       plot = last_plot(),
       path = NULL) # Will default to your working directory 


There are lots of options regarding plot specifications like dimensions and resolution - learn more here!

Coding Challenge 3: Homework!

Try to recreate this figure:

examplefigure.

Coding Challenge 3: Homework!

Don’t struggle in silence!

  • Ask questions and share tips on the CHOPR slack
  • Come to R Office Hours to show off your progress and get help
  • There’s a solution available in data_types_and_viz_solutions.Rmd, but you’ll learn a lot more if you try it yourself first

What we covered

  • A review of the data types that are available in R
  • Converting variables from one data type to another
  • Creating flexible scatterplots and histograms of data using ggplot
  • Exporting ggplot figures

Shameless Plug

Have funding for your research project and interested in working with an experienced biostatistician or data scientist to analyze your data?

The Data Science and Biostatistics Unit (DSBU) is DBHi’s and CHOP Research Institute’s centralized service unit for biostatistics and data science analysis support. Reach out to Alexis Zavez (zaveza@chop.edu) or Keith Baxelbaum (baxelbaumk@chop.edu) for more info!

Thank you!