Data Types and Visualizations
in R

Keith Baxelbaum, Rose Hartman, and
Alexis Zavez (presenter)

Data Science and Biostatistics Unit (DSBU) and
Arcus Education, DBHI

2024-06-03

Use keyboard arrow keys to
- advance ( → ) and
- go back ( ← )
Type “s” to see speaker notes
Type “?” to see other keyboard shortcuts

Join the CHOP R User Group

CHOPR hex sticker logo

Friendly help troubleshooting your R code
Announcements for upcoming talks, workshops, and conferences

Link to join: https://bit.ly/chopRusers

Come to R Office Hours!

Set up a meeting to get live help with your R code from our most experienced useRs
Office hours appointments can be one-on-one or open to the community

Link to calendar: https://bit.ly/chopROfficeHours

Recap: Previous R102 Sessions

This is the fourth talk in a new series called R102: MasteRing the Fundamentals

Previous Talks:

Missing Values in R (March 2024)

Summary Statistics in R (April 2024)

Reshaping Data with tidyr (May 2025)

To watch previous talks or review slides:
https://arcus.github.io/r102/

R 102:
Data Types and Visualizations

What we’re covering today

How to identify data types in R

A review of the data types that are available in R

Converting variables from one data type to another

Creating flexible visualizations of data in R

Identifying Data Types in R

There are several R functions that can return a variable’s type

Today we’ll focus on using str(), which displays the internal structure of an R object

Other options include class() and typeof() - read more about those here!

What are the most common data types?

Character: each value is a string (e.g., “female”)

# Create an example called x that contains either "female" or "male"
x <-c("female", "male", "male", "female")

print(x)

[1] "female" "male"   "male"   "female"

str(x)

 chr [1:4] "female" "male" "male" "female"

What are the most common data types?

Factor: each value is a string, but the possible values are stored as levels within R

# Convert the character example to a factor 
x.factor <- as.factor(x)
print(x.factor)

[1] female male   male   female
Levels: female male

str(x.factor)

 Factor w/ 2 levels "female","male": 1 2 2 1

What are the most common data types?

Numeric: each value is a real number

x <- c(-5.2,0,1.2,2.82,7.676)
print(x)

[1] -5.200  0.000  1.200  2.820  7.676

str(x)

 num [1:5] -5.2 0 1.2 2.82 7.68

Other data types

Logical: each value is either TRUE or FALSE

x <- c(TRUE, FALSE, T, F)
print(x)

[1]  TRUE FALSE  TRUE FALSE

str(x)

 logi [1:4] TRUE FALSE TRUE FALSE

Complex: each value is a complex number

x <- c((1 + 2i), (2 + 3i))
print(x)

[1] 1+2i 2+3i

str(x)

 cplx [1:2] 1+2i 2+3i

Character vs Factor Data Types

Character and factor data types look similar:

x <-c("female", "male", "male", "female")

print(x)

[1] "female" "male"   "male"   "female"

x.factor <- as.factor(x)

print(x.factor)

[1] female male   male   female
Levels: female male

Factors are stored as numbers and a table of levels, which can save memory and computation time

More computational options available for factors compared to characters (e.g., summary() function!)

For more on the summary() function, check out materials from our earlier talk on Summary Statistics in R available here

Variables like name, study id, etc. can be stored as character vectors

Can we mix data types?

Let’s see what happens when we try to store different data types in the same vector:

x <- c(TRUE, 3.0, "male")
print(x)

[1] "TRUE" "3"    "male"

str(x)

 chr [1:3] "TRUE" "3" "male"

We generally want to avoid mixing data types within one variable

R will try to convert all values in the column to one data type

Often, it doesn’t make much sense to have multiple data types stored within the same variable

Can we change data types?

Yes, using R’s functions like as.factor(), as.character(), and as.numeric():

x <-c("female", "male", "male", "female")
print(x)

[1] "female" "male"   "male"   "female"

str(x)

 chr [1:4] "female" "male" "male" "female"

x <- as.factor(x) # convert from character to factor 
print(x)

[1] female male   male   female
Levels: female male

str(x)

 Factor w/ 2 levels "female","male": 1 2 2 1

x <- as.character(x) # convert from factor back to character
print(x)

[1] "female" "male"   "male"   "female"

str(x)

 chr [1:4] "female" "male" "male" "female"

Can we change data types?

x <-c("3.7", "4.2", "5.0")
print(x)

[1] "3.7" "4.2" "5.0"

str(x)

 chr [1:3] "3.7" "4.2" "5.0"

x <- as.numeric(x) # convert from character to numeric 
print(x)

[1] 3.7 4.2 5.0

str(x)

 num [1:3] 3.7 4.2 5

How can we create visualizations in R?

There are two general options for creating figures in R:

Option 1: Base R functions like plot()

Examples of figures created in base R

Option 2: Using the tidyverse and functions like ggplot()

Examples of figures created with ggplot

Some nice features of ggplot

All figures are built using a series of layers

Able to save plots (or partial plots) as objects

# save basis of plot as an object called "p"
p <- ggplot(data = example_data, aes(x = x_variable, y = y_variable))

# add a geometric object (points) and display the plot
p + geom_point()

# add a different geometric object (line) and display the plot
p + geom_line()

Quickly create separate plots for each value of a factor variable using facet_grid or facet_wrap

Base R figures are still a good option!

Data Types and Visualizations in R

Option 1: Work in the cloud: https://posit.cloud/content/7522885

Option 2: Work on your computer: https://github.com/arcus/r102

Time to start coding! By far the best way to learn R is to practice, so work through this code yourself as you follow along.

The first link will take you to Posit Cloud, which gives you a way to work with the code right in your browser without having to install anything on your machine. You will need to create a free account if you don’t already have one. I’ll click that link now so you can see what it looks like. It will take a few minutes to load.

You can also get all of the code for this talk directly from our GitHub and download it to work on your own machine. If you want to go this route, go to our GitHub repo and then find this green “Code” button. If you click that you’ll see you have several options, one of which is downloading a zip file – click that and it will download all the files you need for this talk. Once it’s done downloading, double click it to unzip the file. If you’re comfortable using git, you can also clone the repo, or fork it if you’d like a personal copy. And if you don’t know what cloning and forking are, no worries! Just use the zip file.

The packages we’ll be using today

tidyverse hex sticker logo. medicaldata hex sticker logo. ggplot2 hex sticker logo.

Note: ggplot2 is actually part of the tidyverse core set of packages

Learn more

There’s a lot of helpful information (including examples and tutorials) on the package websites for each of the packages we’ll be using:

Load packages

Only if needed:

install.packages(c("tidyverse", "medicaldata", "ggplot2"))

Each R session:

library(tidyverse)
library(medicaldata) 
library(ggplot2)

The data

In the console or in the data_types_and_viz_exercises.rmd file, run the following command:

head(covid_testing)

# A tibble: 6 × 17
  subject_id fake_first_name fake_last_name gender pan_day test_id clinic_name  
       <dbl> <chr>           <chr>          <chr>    <dbl> <chr>   <chr>        
1       1412 jhezane         westerling     female       4 covid   inpatient wa…
2        533 penny           targaryen      female       7 covid   clinical lab 
3       9134 grunt           rivers         male         7 covid   clinical lab 
4       8518 melisandre      swyft          female       8 covid   clinical lab 
5       8967 rolley          karstark       male         8 covid   emergency de…
6      11048 megga           karstark       female       8 covid   oncology day…
# ℹ 10 more variables: result <chr>, demo_group <chr>, age <dbl>,
#   drive_thru_ind <dbl>, ct_result <dbl>, orderset <dbl>, payor_group <chr>,
#   patient_class <chr>, col_rec_tat <dbl>, rec_ver_tat <dbl>

About these data

To learn more about this dataset:

?covid_testing

From the help documentation:

This data set is from Amrom E. Obstfeld, who de-identified data on COVID-19 testing during 2020 at CHOP (Children’s Hospital of Pennsylvania). This data set contains data concerning testing for SARS-CoV2 via PCR as well as associated metadata. These data have been anonymized, time-shifted, and permuted.

Learn more

To learn more about the covid data and the study behind it, check out this link.
To learn more about the medicaldata R package these data are published in, see the medicaldata package website – and note that the maintainers are always looking for more data contributions!

Coding Challenge 1:

Your turn!

Look in the data_types_and_viz_exercises.rmd file to find your first coding challenge.

02:00

Quick aside: using `str()`

We can also use str() on the entire covid_testing dataset all at once:

str(covid_testing)

spc_tbl_ [15,524 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ subject_id     : num [1:15524] 1412 533 9134 8518 8967 ...
 $ fake_first_name: chr [1:15524] "jhezane" "penny" "grunt" "melisandre" ...
 $ fake_last_name : chr [1:15524] "westerling" "targaryen" "rivers" "swyft" ...
 $ gender         : chr [1:15524] "female" "female" "male" "female" ...
 $ pan_day        : num [1:15524] 4 7 7 8 8 8 9 9 9 9 ...
 $ test_id        : chr [1:15524] "covid" "covid" "covid" "covid" ...
 $ clinic_name    : chr [1:15524] "inpatient ward a" "clinical lab" "clinical lab" "clinical lab" ...
 $ result         : chr [1:15524] "negative" "negative" "negative" "negative" ...
 $ demo_group     : chr [1:15524] "patient" "patient" "patient" "patient" ...
 $ age            : num [1:15524] 0 0 0.8 0.8 0.8 0.8 0.8 0 0 0.9 ...
 $ drive_thru_ind : num [1:15524] 0 1 1 1 0 0 1 0 1 1 ...
 $ ct_result      : num [1:15524] 45 45 45 45 45 45 45 45 45 45 ...
 $ orderset       : num [1:15524] 0 0 1 1 1 0 1 1 1 1 ...
 $ payor_group    : chr [1:15524] "government" "commercial" NA NA ...
 $ patient_class  : chr [1:15524] "inpatient" "not applicable" NA NA ...
 $ col_rec_tat    : num [1:15524] 1.4 2.3 7.3 5.8 1.2 1.4 2.6 0.7 1 7.1 ...
 $ rec_ver_tat    : num [1:15524] 5.2 5.8 4.7 5 6.4 7 4.2 6.3 5.6 7 ...
 - attr(*, "spec")=
  .. cols(
  ..   subject_id = col_double(),
  ..   fake_first_name = col_character(),
  ..   fake_last_name = col_character(),
  ..   gender = col_character(),
  ..   pan_day = col_double(),
  ..   test_id = col_character(),
  ..   clinic_name = col_character(),
  ..   result = col_character(),
  ..   demo_group = col_character(),
  ..   age = col_double(),
  ..   drive_thru_ind = col_double(),
  ..   ct_result = col_double(),
  ..   orderset = col_double(),
  ..   payor_group = col_character(),
  ..   patient_class = col_character(),
  ..   col_rec_tat = col_double(),
  ..   rec_ver_tat = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

ggplot2 for Visualizations

Example 1: Scatterplot (single color)

baseplot <- ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result # y-axis variable
                             ))
baseplot

Example 1: Scatterplot (single color)

baseplot + 
  geom_point() # adds the points to the scatterplot

Click here for the full list of geom_ options

Changing the color of the points

baseplot + 
  geom_point(  # adds the points to the scatterplot
    color = "red" # sets the color of the points to "red"
    )

Example 2: Scatterplot (color by variable)

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = drive_thru_ind # color variable
                             )
       ) + 
  geom_point() # adds the points to the scatterplot

table(covid_testing$drive_thru_ind)


   0    1 
7537 7987

str(covid_testing$drive_thru_ind)

 num [1:15524] 0 1 1 1 0 0 1 0 1 1 ...

Converting drive_thru_ind to a factor

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = as.factor(drive_thru_ind) # color variable
                             )
       ) + 
  geom_point() # adds the points to the scatterplot

Changing the color of the points

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = as.factor(drive_thru_ind) # color variable
                             )
       ) + 
  geom_point() + # adds the points to the scatterplot
  scale_color_manual(values = c("0" = 'blue', # sets colors for each level
                                "1" = 'red'))

Adding plot titles and labels

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = as.factor(drive_thru_ind)
                             )
       ) + 
  geom_point() + 
  scale_color_manual(values = c("0" = 'blue', # sets colors for each level
                                "1" = 'red')) + 
  ggtitle("CT Results by Pandemic Day") + # adds title to plot
  xlab("Pandemic Day") + # changes x-axis label
  ylab("CT Result") + # changes y-axis label 
  labs(color = "Drive Thru Indicator") # changes legend label

Changing plot placement and theme

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = as.factor(drive_thru_ind)
                             )
       ) + 
  geom_point() + 
  scale_color_manual(values = c("0" = 'blue', # sets colors for each level
                                "1" = 'red')) + 
  ggtitle("CT Results by Pandemic Day") + # adds title to plot
  xlab("Pandemic Day") + # changes x-axis label
  ylab("CT Result") + # changes y-axis label 
  labs(color = "Drive Thru Indicator") + # changes legend label
  theme_bw() + # changes plot theme to black and white
  theme(plot.title = element_text(hjust = 0.5), # centers plot title
        legend.position="bottom") # moves legend to bottom

Using facet_wrap

Suppose we are interested in separating the points in the previous scatterplot based on patient gender

ggplot’s facet_wrap() function provides a easy way to do this:

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             y = ct_result, # y-axis variable
                             color = as.factor(drive_thru_ind)
                             )
       ) + 
  geom_point() + 
  scale_color_manual(values = c("0" = 'blue', # sets colors for each level
                                "1" = 'red')) + 
  ggtitle("CT Results by Pandemic Day") + # adds title to plot
  xlab("Pandemic Day") + # changes x-axis label
  ylab("CT Result") + # changes y-axis label 
  labs(color = "Drive Thru Indicator") + # changes legend label
  theme_bw() + # changes plot theme to black and white
  theme(plot.title = element_text(hjust = 0.5), # centers plot title
        legend.position="bottom") +  # moves legend to bottom 
  facet_wrap(~gender)

Learn more

In addition to facet_wrap(), there is a similar option called facet_grid()
To learn more about faceting, including the differences between these two functions, check out this link
To learn more about the different ggplot themes that are available, check out this link

Coding Challenge 2:

Your turn!

Look in the data_types_and_viz_exercises.rmd file to find your second coding challenge.

03:00

Aside: “color” vs. “fill”

scale_color_manual(): used to color lines and points

scale_fill_manual(): used to color fillable objects (e.g. histograms)

Note: We will also need to modify the code for the legend title from labs(color = "Label Text") to labs(fill = "Label Text")

Example 3: Histogram (color by variable)

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             fill = gender) # note: using "fill" not "color"
       ) + 
  geom_histogram(# adds the histograms to the graph
    bins = 10, # number of bins 
    color = "black"# color of bin outline 
    )

Changing the color of the bars

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             fill = gender) # note: using "fill" not "color"
       ) + 
  geom_histogram(# adds the histograms to the graph
    bins = 10, # number of bins 
    color = "black"# color of bin outline 
    ) +
  scale_fill_manual(values = c("female" = 'lightgrey', # sets colors for each level
                                "male" = 'darkgrey'))

Modifying plot titles, labels, and theme

ggplot(data = covid_testing, # dataset to use for plot
                   mapping = aes( # list of aesthetic mappings to use for plot
                             x = pan_day, # x-axis variable
                             fill = gender) # note: using "fill" not "color"
       ) + 
  geom_histogram(# adds the histograms to the graph
    bins = 10, # number of bins 
    color = "black"# color of bin outline 
    ) +
  scale_fill_manual(values = c("female" = 'lightgrey', # sets colors for each level
                                "male" = 'darkgrey')) + 
  ggtitle("Number of Patients by Pandemic Day") + # adds title to plot
  xlab("Pandemic Day") + # changes x-axis label
  ylab("Number of Patients") + # changes y-axis label 
  labs(fill = "Patient Gender") + # changes legend label (note this says "fill")
  theme_bw() + # changes plot theme to black and white
  theme(plot.title = element_text(hjust = 0.5), # centers plot title
        legend.position="bottom") # moves legend to bottom

Exporting a Visualization

The ggsave() function is a convenient option for exporting a ggplot object:

ggsave(filename = "Figure_Name.png", 
       plot = last_plot(),
       path = NULL) # Will default to your working directory

There are lots of options regarding plot specifications like dimensions and resolution - learn more here!

Coding Challenge 3: Homework!

Try to recreate this figure:

examplefigure.

Coding Challenge 3: Homework!

Don’t struggle in silence!

Ask questions and share tips on the CHOPR slack
Come to R Office Hours to show off your progress and get help
There’s a solution available in data_types_and_viz_solutions.Rmd, but you’ll learn a lot more if you try it yourself first

What we covered

A review of the data types that are available in R
Converting variables from one data type to another
Creating flexible scatterplots and histograms of data using ggplot
Exporting ggplot figures

Shameless Plug

Have funding for your research project and interested in working with an experienced biostatistician or data scientist to analyze your data?

The Data Science and Biostatistics Unit (DSBU) is DBHi’s and CHOP Research Institute’s centralized service unit for biostatistics and data science analysis support. Reach out to Alexis Zavez (zaveza@chop.edu) or Keith Baxelbaum (baxelbaumk@chop.edu) for more info!

Data Types and Visualizations in R

Join the CHOP R User Group

Come to R Office Hours!

Recap: Previous R102 Sessions

R 102: Data Types and Visualizations

What we’re covering today

Identifying Data Types in R

What are the most common data types?

What are the most common data types?

What are the most common data types?

Other data types

Character vs Factor Data Types

Can we mix data types?

Can we change data types?

Can we change data types?

How can we create visualizations in R?

Some nice features of ggplot

Data Types and Visualizations in R

The packages we’ll be using today

Learn more

Load packages

The data

About these data

Learn more

Coding Challenge 1:

Quick aside: using str()

ggplot2 for Visualizations

Example 1: Scatterplot (single color)

Example 1: Scatterplot (single color)

Changing the color of the points

Example 2: Scatterplot (color by variable)

Converting drive_thru_ind to a factor

Changing the color of the points

Adding plot titles and labels

Changing plot placement and theme

Using facet_wrap

Learn more

Coding Challenge 2:

Aside: “color” vs. “fill”

Example 3: Histogram (color by variable)

Changing the color of the bars

Modifying plot titles, labels, and theme

Exporting a Visualization

Coding Challenge 3: Homework!

Coding Challenge 3: Homework!

What we covered

Shameless Plug

Thank you!

Data Types and Visualizations
in R

R 102:
Data Types and Visualizations

Quick aside: using `str()`