Summary Statistics
in R

Keith Baxelbaum, Rose Hartman, and
Alexis Zavez (presenter)

Data Science and Biostatistics Unit (DSBU) and
Arcus Education, DBHI

2024-04-08

Use keyboard arrow keys to
- advance ( → ) and
- go back ( ← )
Type “s” to see speaker notes
Type “?” to see other keyboard shortcuts

Join the CHOP R User Group

CHOPR hex sticker logo

Friendly help troubleshooting your R code
Announcements for upcoming talks, workshops, and conferences

Link to join: https://bit.ly/chopRusers

Come to R Office Hours!

Set up a meeting to get live help with your R code from our most experienced useRs
Office hours appointments can be one-on-one or open to the community

Link to calendar: https://bit.ly/chopROfficeHours

We have regular office hours appointments available to get R help. We’re offering two kinds of appointments: one-on-one, if you want individual help, or open appointments that are more of a community conversation about whatever R questions people bring up that day. The open appointments can be particularly valuable to attend if you want to hear other people’s questions about R but maybe don’t have a project of your own that you’re troubleshooting at the moment. We set up a calendar online where you can book an appointment.

And if you like this idea and you’re an experienced R user yourself, consider joining us to offer appointments! It’s just whatever times work for you, so we can set up a schedule around whatever other work you’ve got going on. Reach out to me via email or slack for details.

R/Medicine is Accepting Abstracts!

R-based tools should have a substantial role in the work
The content of the presentation should be of interest to the R/Medicine community

CFP ends Monday, April 22 at 11:59 PM EDT

R/Medicine 2024: June 10-14 (virtual)

More info: https://rconsortium.github.io/RMedicine_website/

Coming Soon - More R102 Sessions!

This is the second talk in a new series called R102: MasteRing the Fundamentals

Next up: Reshaping Data with tidyr, May 6th 12:00pm ET

Learn more about this new series, including dates and titles for each session:
https://arcus.github.io/r102/

R 102:
Summary Statistics

Okay! So this talk is a quick dip into some 102-level R. In other words, if you’ve had a little exposure to R before, such as through an Intro to R for Clinical Data workshop, this is hopefully the right level for you now. If you’re completely brand new to R, first of all: Welcome! You may find it tricky to actively follow along with the code today since I’m going to skim over some of the initial steps, but go ahead and give it a try, or just listen and watch if that feels more like the right speed.

Our topic is getting summary statistics in R, the first step to pretty much any analysis. We’ll also look at how to present those summary stats in a lovely, publication-ready Table 1. Please feel free to put any questions in the chat - we have a few additional team members who will try to answer your questions during the talk. If there is extra time, we can also stop and talk about things as a group.

What we’re covering today

How to get summary stats in R for both continuous and categorical variables

How to create a publication-ready Table 1

NOT teaching the full analysis in R pipeline – we’re glossing over importing data and tidying data

NOT teaching formal statistical modeling (e.g. hypothesis testing)

Thinking About Summary Statistics

Summary statistics, also called descriptive statistics, serve a few important functions:

understand your data better during exploratory data analysis
give you an idea of what your study population was like at baseline
show preexisting differences between groups

Which statistic?

Categorical variables:

counts

Continuous variables:

central tendency
variability

Central tendency

Measures of central tendency:

mean
median

Variability

Measures of variability:

variance
standard deviation
interquartile range
range

Learn more

We won’t get into the mathematical definitions of each of these statistics here, but if you’d like a refresher, check out these excellent Khan academy videos:

Getting Summary Statistics in R

Two options:

Work in the cloud: https://posit.cloud/content/6126628
Work on your computer: https://github.com/arcus/education_r_environment

Time to start coding! By far the best way to learn R is to practice, so work through this code yourself as you follow along.

The first link will take you to Posit Cloud, which gives you a way to work with the code right in your browser without having to install anything on your machine. You will need to create a free account if you don’t already have one. I’ll click that link now so you can see what it looks like. It will take a few minutes to load.

You can also get all of the code for this talk directly from our GitHub and download it to work on your own machine. If you want to go this route, go to our GitHub repo and then find this green “Code” button. If you click that you’ll see you have several options, one of which is downloading a zip file – click that and it will download all the files you need for this talk. Once it’s done downloading, double click it to unzip the file. If you’re comfortable using git, you can also clone the repo, or fork it if you’d like a personal copy. And if you don’t know what cloning and forking are, no worries! Just use the zip file.

The packages we’ll be using today

tidyverse hex sticker logo. medicaldata hex sticker logo. gtsummary hex sticker logo.

Learn more

There’s a lot of helpful information (including examples and tutorials) on the package websites for each of the packages we’ll be using:

Load packages

Only if needed:

install.packages(c("tidyverse", "medicaldata", "gtsummary"))

Each R session:

library(tidyverse)
library(medicaldata) 
library(gtsummary)

The data

In the console or in the exercises rmd file, run the following command:

head(cytomegalovirus)

  ID age sex race                    diagnosis diagnosis.type
1  1  61   1    0       acute myeloid leukemia              1
2  2  62   1    1         non-Hodgkin lymphoma              0
3  3  63   0    1         non-Hodgkin lymphoma              0
4  4  33   0    1             Hodgkin lymphoma              0
5  5  54   0    1 acute lymphoblastic leukemia              0
6  6  55   1    1                myelofibrosis              1
  time.to.transplant prior.radiation prior.chemo prior.transplant recipient.cmv
1               5.16               0           2                0             1
2              79.05               1           3                0             0
3              35.58               0           4                0             1
4              33.02               1           4                0             1
5              11.40               0           5                0             1
6               2.43               0           0                0             1
  donor.cmv donor.sex TNC.dose CD34.dose CD3.dose CD8.dose TBI.dose C1/C2 aKIRs
1         0         0    18.31      2.29     3.21     0.95      200     0     1
2         0         1     4.26      2.04       NA       NA      200     1     5
3         1         0     8.09      6.97     2.19     0.59      200     0     3
4         0         1    21.02      6.09     4.87     2.32      200     0     2
5         1         0    14.70      2.36     6.55     2.40      400     0     6
6         1         1     4.29      6.91     2.53     0.86      200     0     2
  cmv time.to.cmv agvhd time.to.agvhd cgvhd time.to.cgvhd
1   1        3.91     1          3.55     0          6.28
2   0       65.12     0         65.12     0         65.12
3   0        3.75     0          3.75     0          3.75
4   0       48.49     1         28.55     1         10.45
5   0        4.37     1          2.79     0          4.37
6   1        4.53     1          3.88     0          6.87

This command is also written out for you in the summary_stats_exercises.rmd file, in the next code chunk.

Let’s take a look at the data. [click]

You should see the first six rows of the cytomegalovirus data frame, which look like this.

Note that this is one of the example datasets that comes built-in when you install the medicaldata package, so it’s already available to you without you having to read it in or download anything.

For those of you that have worked in R before, you know importing data is a whole thing, so we’re definitely skipping over a potentially tricky bit by using built-in data, but we only have so much time today and I wanted to be able to put as much time as possible towards actually working with the summary statistics. So we’re just merrily skipping past all the importing and tidying that would normally happen.

About these data

To learn more about this dataset:

?cytomegalovirus

From the help documentation:

This data set contains 64 consecutive patients who underwent T-cell replete, matched sibling donor reduced-intensity conditioning allogeneic hematopoietic stem cell transplant. The primary risk factor of interest was the number of activating killer immunoglobulin-like receptors (aKIRs: 1-4 vs. 5-6).

Learn more

To learn more about the cytomegalovirus data and the study behind it, read Sobecks et al. 2011.
To learn more about the medicaldata R package these data are published in, see the medicaldata package website – and note that the maintainers are always looking for more data contributions!

Using `summary()`

First try running summary() on just the age column:

summary(cytomegalovirus$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  29.00   46.00   55.00   52.44   61.00   67.00

Using `summary()`

Now try running summary on the whole dataframe all at once:

summary(cytomegalovirus)

       ID             age             sex              race       
 Min.   : 1.00   Min.   :29.00   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:16.75   1st Qu.:46.00   1st Qu.:0.0000   1st Qu.:1.0000  
 Median :32.50   Median :55.00   Median :1.0000   Median :1.0000  
 Mean   :32.50   Mean   :52.44   Mean   :0.5312   Mean   :0.9062  
 3rd Qu.:48.25   3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :64.00   Max.   :67.00   Max.   :1.0000   Max.   :1.0000  
                                                                  
  diagnosis         diagnosis.type   time.to.transplant prior.radiation 
 Length:64          Min.   :0.0000   Min.   :  1.84     Min.   :0.0000  
 Class :character   1st Qu.:0.0000   1st Qu.:  8.15     1st Qu.:0.0000  
 Mode  :character   Median :1.0000   Median : 13.70     Median :0.0000  
                    Mean   :0.5172   Mean   : 30.85     Mean   :0.1719  
                    3rd Qu.:1.0000   3rd Qu.: 33.28     3rd Qu.:0.0000  
                    Max.   :1.0000   Max.   :173.83     Max.   :1.0000  
                    NA's   :6        NA's   :1                          
  prior.chemo    prior.transplant recipient.cmv     donor.cmv     
 Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
 Median :2.000   Median :0.0000   Median :1.000   Median :1.0000  
 Mean   :2.188   Mean   :0.1875   Mean   :0.625   Mean   :0.5938  
 3rd Qu.:3.000   3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000  
 Max.   :8.000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
                                                                  
   donor.sex      TNC.dose       CD34.dose         CD3.dose       CD8.dose    
 Min.   :0.0   Min.   : 2.06   Min.   : 2.040   Min.   :1.08   Min.   :0.160  
 1st Qu.:0.0   1st Qu.: 6.89   1st Qu.: 4.165   1st Qu.:3.08   1st Qu.:0.805  
 Median :0.5   Median :10.48   Median : 5.840   Median :3.98   Median :1.200  
 Mean   :0.5   Mean   :10.44   Mean   : 5.546   Mean   :4.30   Mean   :1.323  
 3rd Qu.:1.0   3rd Qu.:13.42   3rd Qu.: 6.990   3rd Qu.:5.48   3rd Qu.:1.663  
 Max.   :1.0   Max.   :21.02   Max.   :12.510   Max.   :8.18   Max.   :3.190  
                                                NA's   :3      NA's   :12     
    TBI.dose         C1/C2            aKIRs            cmv        
 Min.   :200.0   Min.   :0.0000   Min.   :1.000   Min.   :0.0000  
 1st Qu.:200.0   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.0000  
 Median :200.0   Median :1.0000   Median :2.000   Median :0.0000  
 Mean   :287.5   Mean   :0.5156   Mean   :2.844   Mean   :0.4062  
 3rd Qu.:400.0   3rd Qu.:1.0000   3rd Qu.:4.250   3rd Qu.:1.0000  
 Max.   :400.0   Max.   :1.0000   Max.   :6.000   Max.   :1.0000  
                                                                  
  time.to.cmv         agvhd        time.to.agvhd        cgvhd       
 Min.   : 0.430   Min.   :0.0000   Min.   : 0.660   Min.   :0.0000  
 1st Qu.: 2.712   1st Qu.:0.0000   1st Qu.: 2.623   1st Qu.:0.0000  
 Median : 5.030   Median :0.0000   Median : 5.800   Median :0.0000  
 Mean   :13.206   Mean   :0.4062   Mean   :15.605   Mean   :0.4375  
 3rd Qu.:16.012   3rd Qu.:1.0000   3rd Qu.:21.410   3rd Qu.:1.0000  
 Max.   :84.470   Max.   :1.0000   Max.   :85.190   Max.   :1.0000  
                                                                    
 time.to.cgvhd   
 Min.   : 0.820  
 1st Qu.: 3.848  
 Median : 6.735  
 Mean   :10.304  
 3rd Qu.:11.307  
 Max.   :65.120

Now let’s try running it on the whole data frame at once.

[click] You should see a summary of each of the 26 variables in the data, with a set of summary statistics printed under each. This is a great way to get a lot of information quickly. The summary function is clever enough to print different summary statsitics for different variable types — it won’t try to calculate the mean or median of a categorical variable (called a factor in R), for example, since that wouldn’t be meaningful.

The summary function checks what type of variable (numeric, categorical/factor, etc.) is in each column and returns sensible summary statistics for each. For a factor, you get the count for each level, and also a count of how many observations are missing. For numeric variables, you get the range (minimum and maximum), the 1st quartile, median, mean, 3rd quartile, and the count of missing observations.

In this dataframe, most of the data are numeric. There is one column, diagnosis, that shows up as character. If you look at the help documentation for the data set (?cytomegalovirus), you’ll see that diagnosis should actually be a factor with 13 levels. If it were a factor, we could get more useful summary statistics for it, too.

Convert to factor

Convert diagnosis column from character to factor:

cytomegalovirus$diagnosis <- as.factor(cytomegalovirus$diagnosis)

Re-run the summary for that column:

summary(cytomegalovirus$diagnosis)

acute lymphoblastic leukemia       acute myeloid leukemia 
                           1                           12 
             aplastic anemia chronic lymphocytic leukemia 
                           1                            5 
    chronic myeloid leukemia            congenital anemia 
                           4                            1 
            Hodgkin lymphoma            multiple myelomas 
                           3                            7 
    myelodysplastic syndrome                myelofibrosis 
                           9                            4 
 myeloproliferative disorder         non-Hodgkin lymphoma 
                           1                           12 
        renal cell carcinoma 
                           4

Convert to factor

cytomegalovirus <- cytomegalovirus |> 
  mutate(sex = factor(sex, levels = c(0, 1), labels = c("Female", "Male")),
         race = factor(race, levels = c(0, 1), labels = c("African-American", "White")),
         diagnosis.type = factor(diagnosis.type, levels = c(0, 1), labels = c("Lymphoid", "Myeloid")),
         prior.radiation = factor(prior.radiation, levels = c(0, 1), labels = c("No", "Yes")),
         prior.transplant = factor(prior.transplant, levels = c(0, 1), labels = c("No", "Yes")),
         donor.cmv = factor(donor.cmv, levels = c(0, 1), labels = c("Negative", "Positive")),
         recipient.cmv = factor(recipient.cmv, levels = c(0, 1), labels = c("Negative", "Positive")),
         donor.sex = factor(donor.sex, levels = c(0, 1), labels = c("Male", "Female")),
         `C1/C2` = factor(`C1/C2`, levels = c(0, 1), labels = c("Heterozygous", "Homozygous")),
         cmv = factor(cmv, levels = c(0, 1), labels = c("No", "Yes")),
         agvhd = factor(agvhd, levels = c(0, 1), labels = c("No", "Yes")),
         cgvhd = factor(cgvhd, levels = c(0, 1), labels = c("No", "Yes"))
         )

We could do it with individual commands using as.factor like we did for diagnosis, but instead we’ll take a look at another approach. Also note that this code is ready for you in the summary_stats_exercises.Rmd file — no need to try to type this all out now.

Let’s break that down. We’re using mutate to make changes to the columns in the dataframe. You can use mutate to create new columns, but in this case we’re just overwriting existing columns. Within the mutate command, we have a different argument for each column we want to change. They all take the form column_name = factor(column_name, levels = c(), labels = c()) where the levels are the existing values in the column already (e.g. 0 and 1) and labels are what we want them to show up as in the factor (e.g. “Female” and “Male”). Note that the order of levels and labels has to line up, so the first level will get the first label, and the second level will get the second label, etc. To learn more about the factor() command, check out its help documentation with ?factor.

Learn more

A few things from the above code that you might want to look into further:

Pipes! See 2nd edition R4DS section on pipes, which uses the new |> instead of %>%.
If you’re curious, a comparison of the new and old pipes.
More about mutate and data transformation in general in R4DS section on mutate

Troubleshooting

If you make a mistake modifying the data, how can you undo it?

If this were a dataset we read in from an external file (like a .csv), you could just read it in again to get a fresh copy. But how do you get a fresh copy of a built-in dataset?

To reset the data to its original state, run rm(cytomegalovirus) in the console. This will delete your current version of the data from R’s environment, and you’ll just be left with the original clean copy from the medicaldata package.

Summary again

Let’s run summary on the whole dataframe again to see the results now that we’ve cleaned up the column types.

summary(cytomegalovirus)

       ID             age            sex                   race   
 Min.   : 1.00   Min.   :29.00   Female:30   African-American: 6  
 1st Qu.:16.75   1st Qu.:46.00   Male  :34   White           :58  
 Median :32.50   Median :55.00                                    
 Mean   :32.50   Mean   :52.44                                    
 3rd Qu.:48.25   3rd Qu.:61.00                                    
 Max.   :64.00   Max.   :67.00                                    
                                                                  
                        diagnosis   diagnosis.type time.to.transplant
 acute myeloid leukemia      :12   Lymphoid:28     Min.   :  1.84    
 non-Hodgkin lymphoma        :12   Myeloid :30     1st Qu.:  8.15    
 myelodysplastic syndrome    : 9   NA's    : 6     Median : 13.70    
 multiple myelomas           : 7                   Mean   : 30.85    
 chronic lymphocytic leukemia: 5                   3rd Qu.: 33.28    
 chronic myeloid leukemia    : 4                   Max.   :173.83    
 (Other)                     :15                   NA's   :1         
 prior.radiation  prior.chemo    prior.transplant  recipient.cmv    donor.cmv 
 No :53          Min.   :0.000   No :52           Negative:24    Negative:26  
 Yes:11          1st Qu.:1.000   Yes:12           Positive:40    Positive:38  
                 Median :2.000                                                
                 Mean   :2.188                                                
                 3rd Qu.:3.000                                                
                 Max.   :8.000                                                
                                                                              
  donor.sex     TNC.dose       CD34.dose         CD3.dose       CD8.dose    
 Male  :32   Min.   : 2.06   Min.   : 2.040   Min.   :1.08   Min.   :0.160  
 Female:32   1st Qu.: 6.89   1st Qu.: 4.165   1st Qu.:3.08   1st Qu.:0.805  
             Median :10.48   Median : 5.840   Median :3.98   Median :1.200  
             Mean   :10.44   Mean   : 5.546   Mean   :4.30   Mean   :1.323  
             3rd Qu.:13.42   3rd Qu.: 6.990   3rd Qu.:5.48   3rd Qu.:1.663  
             Max.   :21.02   Max.   :12.510   Max.   :8.18   Max.   :3.190  
                                              NA's   :3      NA's   :12     
    TBI.dose              C1/C2        aKIRs        cmv      time.to.cmv    
 Min.   :200.0   Heterozygous:31   Min.   :1.000   No :38   Min.   : 0.430  
 1st Qu.:200.0   Homozygous  :33   1st Qu.:1.000   Yes:26   1st Qu.: 2.712  
 Median :200.0                     Median :2.000            Median : 5.030  
 Mean   :287.5                     Mean   :2.844            Mean   :13.206  
 3rd Qu.:400.0                     3rd Qu.:4.250            3rd Qu.:16.012  
 Max.   :400.0                     Max.   :6.000            Max.   :84.470  
                                                                            
 agvhd    time.to.agvhd    cgvhd    time.to.cgvhd   
 No :38   Min.   : 0.660   No :36   Min.   : 0.820  
 Yes:26   1st Qu.: 2.623   Yes:28   1st Qu.: 3.848  
          Median : 5.800            Median : 6.735  
          Mean   :15.605            Mean   :10.304  
          3rd Qu.:21.410            3rd Qu.:11.307  
          Max.   :85.190            Max.   :65.120

Individual summary statistics

What is the shortest time from diagnosis to transplant in the data?

min(cytomegalovirus$time.to.transplant)

[1] NA

Checking the help documentation

?min

From the help documentation for min:

Usage

max(..., na.rm = FALSE)

min(..., na.rm = FALSE)

Working around missing values

min(cytomegalovirus$time.to.transplant, na.rm = TRUE)

[1] 1.84

More individual statistics

What is the longest time from diagnosis to transplant?

max(cytomegalovirus$time.to.transplant, na.rm = TRUE)

[1] 173.83

What is the mean time to transplant?

mean(cytomegalovirus$time.to.transplant, na.rm = TRUE)

[1] 30.84556

More individual statistics:

median(cytomegalovirus$time.to.transplant, na.rm = TRUE)

[1] 13.7

var(cytomegalovirus$time.to.transplant, na.rm = TRUE) # variance

[1] 1638.105

sd(cytomegalovirus$time.to.transplant, na.rm = TRUE) # standard deviation

[1] 40.47351

Coding Challenge 1

Your turn!

Look in the summary_stats_exercises.rmd file to find your first coding challenge.

02:00

Counts

For categorical variables, summary statistics are just counts.

We’ll look at two ways to get the counts for a single categorical variable in R:

summary(cytomegalovirus$diagnosis.type)

Lymphoid  Myeloid     NA's 
      28       30        6

table(cytomegalovirus$diagnosis.type)


Lymphoid  Myeloid 
      28       30

table(cytomegalovirus$diagnosis.type, useNA = "always")


Lymphoid  Myeloid     <NA> 
      28       30        6

More counts

To get counts of two categorical variables together (e.g. what is the diagnosis type breakdown by recipient CMV status?), use cross-tabulation, also called “crosstabs” or “two-way frequency tables”.

xtabs(~cmv + diagnosis.type, 
      data = cytomegalovirus)

     diagnosis.type
cmv   Lymphoid Myeloid
  No        15      19
  Yes       13      11

Crosstabs

To get the results of a chi-squared test on the crosstabs, you can apply the summary function to an xtabs object.

Option 1:

# you can nest the summary and xtabs functions
summary(xtabs(~cmv + diagnosis.type, 
              data = cytomegalovirus))

Option 2:

# or you can save the xtabs object and then run summary() on it
cmv_and_diagnosistype <- xtabs(~cmv + diagnosis.type, 
                               data = cytomegalovirus)
summary(cmv_and_diagnosistype)

Call: xtabs(formula = ~cmv + diagnosis.type, data = cytomegalovirus)
Number of cases in table: 58 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 0.569, df = 1, p-value = 0.4507

Learn more

For beautifully printed crosstabs, see the tbl_cross function in the gtsummary package.

Coding Challenge 2

Your turn!

Look in the summary_stats_exercises.rmd file to find your next coding challenge.

02:00

Presenting Summary Statistics
in R

Table 1 from the published paper on the cytomegalovirus data. Columns are "1 to 4 aKIRs", "5 to 6 aKIRs" and "P value". There are rows for each of the following variables: age, sex, race, diagnosis category, months from diagnosis to treatment, prior radiation therapy, number of prior chemo regimens, prior transplants, recipient CMV seropositive, donor CMV seropositive, and recipient or donor CMV seropositive.

Table 1 summary statistics on the cytomegalovirus data

Creating Table 1 with `gtsummary`

We’ll recreate an approximation of the cytomegalovirus Table 1 using gtsummary:

Top half of Table 1 showing the column headers "1 to 4 aKIRs", "5 to 6 aKIRs" and "P value".

Recoding variables for the table

cytomegalovirus <- cytomegalovirus |> 
  mutate(aKIRs_groups = if_else(aKIRs > 4, 
                                "5 to 6 aKIRs", 
                                "1 to 4 aKIRs"),
         aKIRs_groups = factor(aKIRs_groups, 
                               levels = c("1 to 4 aKIRs", "5 to 6 aKIRs")),
         prior.chemo_groups = if_else(prior.chemo >= 4, 
                                      "≥ 4", 
                                      as.character(prior.chemo)),
         prior.chemo_groups = factor(prior.chemo_groups, 
                                     levels = c("0", "1", "2", "3", "≥ 4")))

The if_else function is what’s called a ternary operator, and it has three parts:

a conditional test
a value to use if the test returns TRUE
a value to use if the test returns FALSE.

Let’s break that down.

[click] We’re using the if_else function to recode the numeric variable aKIRs as one of two strings, “5 to 6 aKIRs” if it’s greater than 4 and “1 to 4 aKIRs” otherwise. [click] That will result in a character vector called aKIRs_groups, so then with the next line we turn it into a factor.

[click] Then we apply the same approach to the prior.chemo variable, first we make everything greater than or equal to 4 into “≥ 4”, and set all other values to as.character() on whatever their original values were — this changes the numbers into string versions, so 0 becomes “0”, 1 becomes “1”, etc. That creates a new variable called prior.chemo_groups, but it’s a character vector, so we turn it into a factor.

[click, click] The if_else function is what’s called a ternary operator.

In the aKIRs example, we first tested each value to see if it was > 4. If yes, we assigned it “5 to 6 aKIRs”, and if no we assigned it to “1 to 4 aKIRs”. Note that if_else preserves missing values, so anything that was NA in aKIRs will still be NA in aKIRs_groups.

Learn more

For more on using mutate and if_else, see see the R Basics: Data Transformation sections on mutate and logical operators to see more examples of the kinds of conditional tests you can put in.

Troubleshooting

Note that there are two different functions that look similar and do almost (but not quite!) exactly the same thing: if_else and ifelse.

if_else is part of the dplyr package, and ifelse is part of base R. This is a common point of confusion. There are a few differences in how the two functions work, but one of the big ones is that ifelse does NOT preserve missingness. It’s a subtle difference, but watch out!

Our first attempt at Table 1

Now we can use the new aKIRs_groups and prior.chemo_groups variables in our Table 1.

cytomegalovirus |>  
  select(aKIRs_groups, age, sex, race, 
         diagnosis.type, time.to.transplant, 
         prior.chemo_groups, prior.transplant, 
         cmv, donor.cmv) |> 
  tbl_summary(by = aKIRs_groups)

Our first attempt at Table 1

Characteristic	1 to 4 aKIRs, N = 48¹	5 to 6 aKIRs, N = 16¹
age	57 (45, 61)	54 (52, 60)
sex
Female	23 (48%)	7 (44%)
Male	25 (52%)	9 (56%)
race
African-American	5 (10%)	1 (6.3%)
White	43 (90%)	15 (94%)
diagnosis.type
Lymphoid	22 (51%)	6 (40%)
Myeloid	21 (49%)	9 (60%)
Unknown	5	1
time.to.transplant	16 (9, 35)	11 (5, 16)
Unknown	1	0
prior.chemo_groups
0	7 (15%)	3 (19%)
1	10 (21%)	5 (31%)
2	13 (27%)	4 (25%)
3	7 (15%)	2 (13%)
≥ 4	11 (23%)	2 (13%)
prior.transplant	10 (21%)	2 (13%)
cmv	23 (48%)	3 (19%)
donor.cmv
Negative	20 (42%)	6 (38%)
Positive	28 (58%)	10 (63%)
¹ Median (IQR); n (%)

Customizing a `gtsummary` table

This is a decent table! Notice there are a few things about this table that we might still want to change, for example:

row labels are taken from the variable names in the dataframe, so they’re lowercase and don’t use whitespace as we might like (e.g. “diagnosis.type” would be better as “Diagnosis Type”)
There’s no information about significance testing, like there is in the published table

Save a little typing…

table_data <- cytomegalovirus |>  
  select(aKIRs_groups, age, sex, race, 
         diagnosis.type, time.to.transplant, 
         prior.radiation, prior.chemo_groups, 
         prior.transplant, recipient.cmv, donor.cmv)

Add p-values

table_data |> 
  tbl_summary(by = aKIRs_groups) |> 
  add_p()

Characteristic	1 to 4 aKIRs, N = 48¹	5 to 6 aKIRs, N = 16¹	p-value²
age	57 (45, 61)	54 (52, 60)	0.7
sex			0.8
Female	23 (48%)	7 (44%)
Male	25 (52%)	9 (56%)
race			>0.9
African-American	5 (10%)	1 (6.3%)
White	43 (90%)	15 (94%)
diagnosis.type			0.5
Lymphoid	22 (51%)	6 (40%)
Myeloid	21 (49%)	9 (60%)
Unknown	5	1
time.to.transplant	16 (9, 35)	11 (5, 16)	0.2
Unknown	1	0
prior.radiation	9 (19%)	2 (13%)	0.7
prior.chemo_groups			0.9
0	7 (15%)	3 (19%)
1	10 (21%)	5 (31%)
2	13 (27%)	4 (25%)
3	7 (15%)	2 (13%)
≥ 4	11 (23%)	2 (13%)
prior.transplant	10 (21%)	2 (13%)	0.7
recipient.cmv			0.2
Negative	16 (33%)	8 (50%)
Positive	32 (67%)	8 (50%)
donor.cmv			0.8
Negative	20 (42%)	6 (38%)
Positive	28 (58%)	10 (63%)
¹ Median (IQR); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test; Fisher’s exact test

Customize p-values

The help documentation (?add_p.tbl_summary) explains a little more thoroughly:

Tests default to “kruskal.test” for continuous variables (“wilcox.test” when “by” variable has two levels), “chisq.test.no.correct” for categorical variables with all expected cell counts >=5, and “fisher.test” for categorical variables with any expected cell count <5.

Customize p-values

Make the p values round to 2 digits instead of 1, and use t.test instead of wilcoxon rank sum test:

table_data |> 
  tbl_summary(by = aKIRs_groups) |> 
  add_p(test = list(all_continuous() ~ "t.test",
                    all_categorical() ~ "chisq.test.no.correct"),
        test.args = all_continuous() ~ list(var.equal = TRUE),
        pvalue_fun = function(x) style_pvalue(x, digits = 2))

Customize p-values

Characteristic	1 to 4 aKIRs, N = 48¹	5 to 6 aKIRs, N = 16¹	p-value²
age	57 (45, 61)	54 (52, 60)	0.43
sex			0.77
Female	23 (48%)	7 (44%)
Male	25 (52%)	9 (56%)
race			0.62
African-American	5 (10%)	1 (6.3%)
White	43 (90%)	15 (94%)
diagnosis.type			0.46
Lymphoid	22 (51%)	6 (40%)
Myeloid	21 (49%)	9 (60%)
Unknown	5	1
time.to.transplant	16 (9, 35)	11 (5, 16)	0.86
Unknown	1	0
prior.radiation	9 (19%)	2 (13%)	0.57
prior.chemo_groups			0.85
0	7 (15%)	3 (19%)
1	10 (21%)	5 (31%)
2	13 (27%)	4 (25%)
3	7 (15%)	2 (13%)
≥ 4	11 (23%)	2 (13%)
prior.transplant	10 (21%)	2 (13%)	0.46
recipient.cmv			0.23
Negative	16 (33%)	8 (50%)
Positive	32 (67%)	8 (50%)
donor.cmv			0.77
Negative	20 (42%)	6 (38%)
Positive	28 (58%)	10 (63%)
¹ Median (IQR); n (%)
² Two Sample t-test; Pearson’s Chi-squared test

A little encouragement…

If you’re feeling like these arguments look confusing (why does it start with list()? What does the ~ mean??), you’re not alone!

As you get further away from the default settings in gtsummary (and other packages), your code is likely to become less intuitive and harder to read. That’s okay!

What was that error?

When you ran the above code, you may have noticed that R printed out several warnings with text like Warning for variable 'race': simpleWarning in stats::chisq.test(x = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 2L, : Chi-squared approximation may be incorrect.

Clean up labels

One glaring issue that remains is the row labels. We don’t want labels like “prior.chemo_groups” in our final table!

We can clean those up with the label argument in the tbl_summary command:

table_data |> 
  tbl_summary(by = aKIRs_groups, 
              label = list(age ~ "Age at transplant (yrs)",
                           sex ~ "Sex",
                           race ~ "Race",
                           diagnosis.type ~ "Diagnostic category",
                           time.to.transplant ~ "Months from diagnosis to transplant",
                           prior.radiation ~ "Prior radiation therapy",
                           prior.chemo_groups ~ "Number of prior chemotherapy regimens",
                           prior.transplant ~ "Prior transplants",
                           recipient.cmv ~ "Recipient CMV seropositive",
                           donor.cmv ~ "Donor CMV seropositive")) |> 
  add_p(test = list(all_continuous() ~ "t.test",
                    all_categorical() ~ "chisq.test.no.correct"),
        test.args = all_continuous() ~ list(var.equal = TRUE),
        pvalue_fun = function(x) style_pvalue(x, digits = 2))

Characteristic	1 to 4 aKIRs, N = 48¹	5 to 6 aKIRs, N = 16¹	p-value²
Age at transplant (yrs)	57 (45, 61)	54 (52, 60)	0.43
Sex			0.77
Female	23 (48%)	7 (44%)
Male	25 (52%)	9 (56%)
Race			0.62
African-American	5 (10%)	1 (6.3%)
White	43 (90%)	15 (94%)
Diagnostic category			0.46
Lymphoid	22 (51%)	6 (40%)
Myeloid	21 (49%)	9 (60%)
Unknown	5	1
Months from diagnosis to transplant	16 (9, 35)	11 (5, 16)	0.86
Unknown	1	0
Prior radiation therapy	9 (19%)	2 (13%)	0.57
Number of prior chemotherapy regimens			0.85
0	7 (15%)	3 (19%)
1	10 (21%)	5 (31%)
2	13 (27%)	4 (25%)
3	7 (15%)	2 (13%)
≥ 4	11 (23%)	2 (13%)
Prior transplants	10 (21%)	2 (13%)	0.46
Recipient CMV seropositive			0.23
Negative	16 (33%)	8 (50%)
Positive	32 (67%)	8 (50%)
Donor CMV seropositive			0.77
Negative	20 (42%)	6 (38%)
Positive	28 (58%)	10 (63%)
¹ Median (IQR); n (%)
² Two Sample t-test; Pearson’s Chi-squared test

Exporting a table

So how do you get the table from R to Word? You can save the table as a Word doc, which you can then open and copy into your article.

To save a table as a Word file, add the following two lines to the end of your table commands (you can change “my_table.docx” to be whatever you want the file name to be):

  as_flex_table() |> 
  flextable::save_as_docx(path = "my_table.docx")

Troubleshooting

Note that this requires a function from the flextable package. If you’ve never used flextable before, you may need to run install.packages("flextable") before this code will work.

Exporting a table

For example, to save the table above, you could do the following:

table_data |> 
  tbl_summary(by = aKIRs_groups, 
              label = list(age ~ "Age at transplant (yrs)",
                           sex ~ "Sex",
                           race ~ "Race",
                           diagnosis.type ~ "Diagnostic category",
                           time.to.transplant ~ "Months from diagnosis to transplant",
                           prior.radiation ~ "Prior radiation therapy",
                           prior.chemo_groups ~ "Number of prior chemotherapy regimens",
                           prior.transplant ~ "Prior transplants",
                           recipient.cmv ~ "Recipient CMV seropositive",
                           donor.cmv ~ "Donor CMV seropositive")) |> 
  add_p(test = list(all_continuous() ~ "t.test",
                    all_categorical() ~ "chisq.test.no.correct"),
        test.args = all_continuous() ~ list(var.equal = TRUE),
        pvalue_fun = function(x) style_pvalue(x, digits = 2)) |> 
  as_flex_table() |> 
  flextable::save_as_docx(path = "my_table.docx")

Learn more

For a review of how to specify a path, see the explanation of absolute and relative paths in our Directories and File Paths module.

Coding Challenge 3: Homework!

For your final coding challenge, return to RStudio and follow the instructions there to create a Table 1 for a brand new data set, the laryngoscope data.

Here’s the published table you’ll be trying to recreate:

Table 1 from the published article on the laryngoscope data.

For inspiration (and lots of examples of things to try), check out the tbl_summary help documentation.

Coding Challenge 3: Homework!

Don’t struggle in silence!

Ask questions and share tips on the CHOPR slack
Come to R Office Hours to show off your progress and get help
There’s a solution available in summary_stats_solutions.Rmd, but you’ll learn a lot more if you try it yourself first

What we covered

getting example data from the medicaldata package
summary()
min(), max(), mean(), median(), var(), sd(), quantile()
table()
xtabs()
building tables with gtsummary package, specifically using tbl_summary()
a bunch of extras throughout on data transformations and data cleaning

Additional resources

For general background on statistics:

For more on the gtsummary package, see their website, especially the FAQ and gallery of example tables

Shameless Plug #1

Want to go through this material again? It’s posted as an interactive tutorial online as part of DART (Data and Analytics for Research Training)!

You can also explore the rest of the DART modules with the new module discovery app prototype

Shameless Plug #2

Have funding for your research project and interested in working with an experienced biostatistician or data scientist to analyze your data?

The Data Science and Biostatistics Unit (DSBU) is DBHi’s and CHOP Research Institute’s centralized service unit for biostatistics and data science analysis support. Reach out to Alexis Zavez (zaveza@chop.edu) or Keith Baxelbaum (baxelbaumk@chop.edu) for more info!

Summary Statistics in R

Join the CHOP R User Group

Come to R Office Hours!

R/Medicine is Accepting Abstracts!

Coming Soon - More R102 Sessions!

R 102: Summary Statistics

What we’re covering today

Thinking About Summary Statistics

Which statistic?

Central tendency

Variability

Learn more

Getting Summary Statistics in R

The packages we’ll be using today

Learn more

Load packages

The data

About these data

Learn more

Using summary()

Using summary()

Convert to factor

Convert to factor

Learn more

Troubleshooting

Summary again

Individual summary statistics

Checking the help documentation

Working around missing values

More individual statistics

Coding Challenge 1

Counts

More counts

Crosstabs

Learn more

Coding Challenge 2

Presenting Summary Statisticsin R

Creating Table 1 with gtsummary

Recoding variables for the table

Learn more

Troubleshooting

Our first attempt at Table 1

Our first attempt at Table 1

Customizing a gtsummary table

Save a little typing…

Add p-values

Customize p-values

Customize p-values

Customize p-values

A little encouragement…

What was that error?

Clean up labels

Exporting a table

Troubleshooting

Exporting a table

Learn more

Coding Challenge 3: Homework!

Coding Challenge 3: Homework!

What we covered

Additional resources

Shameless Plug #1

Shameless Plug #2

Summary Statistics
in R

R 102:
Summary Statistics

Using `summary()`

Using `summary()`

Presenting Summary Statistics
in R

Creating Table 1 with `gtsummary`

Customizing a `gtsummary` table