Summary Statistics
in R

Keith Baxelbaum, Rose Hartman, and
Alexis Zavez (presenter)

Data Science and Biostatistics Unit (DSBU) and
Arcus Education, DBHI

2024-04-08

  • Use keyboard arrow keys to
    • advance ( → ) and
    • go back ( ← )
  • Type “s” to see speaker notes
  • Type “?” to see other keyboard shortcuts

Join the CHOP R User Group

CHOPR hex sticker logo

  • Friendly help troubleshooting your R code
  • Announcements for upcoming talks, workshops, and conferences

Link to join: https://bit.ly/chopRusers

Come to R Office Hours!

  • Set up a meeting to get live help with your R code from our most experienced useRs
  • Office hours appointments can be one-on-one or open to the community

Link to calendar: https://bit.ly/chopROfficeHours

R/Medicine is Accepting Abstracts!

  • R-based tools should have a substantial role in the work

  • The content of the presentation should be of interest to the R/Medicine community

CFP ends Monday, April 22 at 11:59 PM EDT

R/Medicine 2024: June 10-14 (virtual)

More info: https://rconsortium.github.io/RMedicine_website/

Coming Soon - More R102 Sessions!

This is the second talk in a new series called R102: MasteRing the Fundamentals


Next up: Reshaping Data with tidyr, May 6th 12:00pm ET


Learn more about this new series, including dates and titles for each session:
https://arcus.github.io/r102/

R 102:
Summary Statistics

What we’re covering today

  • How to get summary stats in R for both continuous and categorical variables
  • How to create a publication-ready Table 1
  • NOT teaching the full analysis in R pipeline – we’re glossing over importing data and tidying data
  • NOT teaching formal statistical modeling (e.g. hypothesis testing)

Thinking About Summary Statistics

Summary statistics, also called descriptive statistics, serve a few important functions:

  • understand your data better during exploratory data analysis
  • give you an idea of what your study population was like at baseline
  • show preexisting differences between groups

Which statistic?

Categorical variables:

  • counts

Continuous variables:

  • central tendency
  • variability

Central tendency

Measures of central tendency:

  • mean
  • median

Variability

Measures of variability:

  • variance
  • standard deviation
  • interquartile range
  • range

Learn more

We won’t get into the mathematical definitions of each of these statistics here, but if you’d like a refresher, check out these excellent Khan academy videos:

Getting Summary Statistics in R

Two options:

  1. Work in the cloud: https://posit.cloud/content/6126628

  2. Work on your computer: https://github.com/arcus/education_r_environment

The packages we’ll be using today

tidyverse hex sticker logo. medicaldata hex sticker logo. gtsummary hex sticker logo.

Learn more

There’s a lot of helpful information (including examples and tutorials) on the package websites for each of the packages we’ll be using:

Load packages

Only if needed:

install.packages(c("tidyverse", "medicaldata", "gtsummary"))


Each R session:

library(tidyverse)
library(medicaldata) 
library(gtsummary)

The data

In the console or in the exercises rmd file, run the following command:

head(cytomegalovirus)
  ID age sex race                    diagnosis diagnosis.type
1  1  61   1    0       acute myeloid leukemia              1
2  2  62   1    1         non-Hodgkin lymphoma              0
3  3  63   0    1         non-Hodgkin lymphoma              0
4  4  33   0    1             Hodgkin lymphoma              0
5  5  54   0    1 acute lymphoblastic leukemia              0
6  6  55   1    1                myelofibrosis              1
  time.to.transplant prior.radiation prior.chemo prior.transplant recipient.cmv
1               5.16               0           2                0             1
2              79.05               1           3                0             0
3              35.58               0           4                0             1
4              33.02               1           4                0             1
5              11.40               0           5                0             1
6               2.43               0           0                0             1
  donor.cmv donor.sex TNC.dose CD34.dose CD3.dose CD8.dose TBI.dose C1/C2 aKIRs
1         0         0    18.31      2.29     3.21     0.95      200     0     1
2         0         1     4.26      2.04       NA       NA      200     1     5
3         1         0     8.09      6.97     2.19     0.59      200     0     3
4         0         1    21.02      6.09     4.87     2.32      200     0     2
5         1         0    14.70      2.36     6.55     2.40      400     0     6
6         1         1     4.29      6.91     2.53     0.86      200     0     2
  cmv time.to.cmv agvhd time.to.agvhd cgvhd time.to.cgvhd
1   1        3.91     1          3.55     0          6.28
2   0       65.12     0         65.12     0         65.12
3   0        3.75     0          3.75     0          3.75
4   0       48.49     1         28.55     1         10.45
5   0        4.37     1          2.79     0          4.37
6   1        4.53     1          3.88     0          6.87

About these data

To learn more about this dataset:

?cytomegalovirus

From the help documentation:

This data set contains 64 consecutive patients who underwent T-cell replete, matched sibling donor reduced-intensity conditioning allogeneic hematopoietic stem cell transplant. The primary risk factor of interest was the number of activating killer immunoglobulin-like receptors (aKIRs: 1-4 vs. 5-6).

Learn more

  • To learn more about the cytomegalovirus data and the study behind it, read Sobecks et al. 2011.
  • To learn more about the medicaldata R package these data are published in, see the medicaldata package website – and note that the maintainers are always looking for more data contributions!

Using summary()

First try running summary() on just the age column:

summary(cytomegalovirus$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  29.00   46.00   55.00   52.44   61.00   67.00 

Using summary()

Now try running summary on the whole dataframe all at once:

summary(cytomegalovirus)
       ID             age             sex              race       
 Min.   : 1.00   Min.   :29.00   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:16.75   1st Qu.:46.00   1st Qu.:0.0000   1st Qu.:1.0000  
 Median :32.50   Median :55.00   Median :1.0000   Median :1.0000  
 Mean   :32.50   Mean   :52.44   Mean   :0.5312   Mean   :0.9062  
 3rd Qu.:48.25   3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :64.00   Max.   :67.00   Max.   :1.0000   Max.   :1.0000  
                                                                  
  diagnosis         diagnosis.type   time.to.transplant prior.radiation 
 Length:64          Min.   :0.0000   Min.   :  1.84     Min.   :0.0000  
 Class :character   1st Qu.:0.0000   1st Qu.:  8.15     1st Qu.:0.0000  
 Mode  :character   Median :1.0000   Median : 13.70     Median :0.0000  
                    Mean   :0.5172   Mean   : 30.85     Mean   :0.1719  
                    3rd Qu.:1.0000   3rd Qu.: 33.28     3rd Qu.:0.0000  
                    Max.   :1.0000   Max.   :173.83     Max.   :1.0000  
                    NA's   :6        NA's   :1                          
  prior.chemo    prior.transplant recipient.cmv     donor.cmv     
 Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
 Median :2.000   Median :0.0000   Median :1.000   Median :1.0000  
 Mean   :2.188   Mean   :0.1875   Mean   :0.625   Mean   :0.5938  
 3rd Qu.:3.000   3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000  
 Max.   :8.000   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
                                                                  
   donor.sex      TNC.dose       CD34.dose         CD3.dose       CD8.dose    
 Min.   :0.0   Min.   : 2.06   Min.   : 2.040   Min.   :1.08   Min.   :0.160  
 1st Qu.:0.0   1st Qu.: 6.89   1st Qu.: 4.165   1st Qu.:3.08   1st Qu.:0.805  
 Median :0.5   Median :10.48   Median : 5.840   Median :3.98   Median :1.200  
 Mean   :0.5   Mean   :10.44   Mean   : 5.546   Mean   :4.30   Mean   :1.323  
 3rd Qu.:1.0   3rd Qu.:13.42   3rd Qu.: 6.990   3rd Qu.:5.48   3rd Qu.:1.663  
 Max.   :1.0   Max.   :21.02   Max.   :12.510   Max.   :8.18   Max.   :3.190  
                                                NA's   :3      NA's   :12     
    TBI.dose         C1/C2            aKIRs            cmv        
 Min.   :200.0   Min.   :0.0000   Min.   :1.000   Min.   :0.0000  
 1st Qu.:200.0   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.0000  
 Median :200.0   Median :1.0000   Median :2.000   Median :0.0000  
 Mean   :287.5   Mean   :0.5156   Mean   :2.844   Mean   :0.4062  
 3rd Qu.:400.0   3rd Qu.:1.0000   3rd Qu.:4.250   3rd Qu.:1.0000  
 Max.   :400.0   Max.   :1.0000   Max.   :6.000   Max.   :1.0000  
                                                                  
  time.to.cmv         agvhd        time.to.agvhd        cgvhd       
 Min.   : 0.430   Min.   :0.0000   Min.   : 0.660   Min.   :0.0000  
 1st Qu.: 2.712   1st Qu.:0.0000   1st Qu.: 2.623   1st Qu.:0.0000  
 Median : 5.030   Median :0.0000   Median : 5.800   Median :0.0000  
 Mean   :13.206   Mean   :0.4062   Mean   :15.605   Mean   :0.4375  
 3rd Qu.:16.012   3rd Qu.:1.0000   3rd Qu.:21.410   3rd Qu.:1.0000  
 Max.   :84.470   Max.   :1.0000   Max.   :85.190   Max.   :1.0000  
                                                                    
 time.to.cgvhd   
 Min.   : 0.820  
 1st Qu.: 3.848  
 Median : 6.735  
 Mean   :10.304  
 3rd Qu.:11.307  
 Max.   :65.120  
                 

Convert to factor

Convert diagnosis column from character to factor:

cytomegalovirus$diagnosis <- as.factor(cytomegalovirus$diagnosis)

Re-run the summary for that column:

summary(cytomegalovirus$diagnosis)
acute lymphoblastic leukemia       acute myeloid leukemia 
                           1                           12 
             aplastic anemia chronic lymphocytic leukemia 
                           1                            5 
    chronic myeloid leukemia            congenital anemia 
                           4                            1 
            Hodgkin lymphoma            multiple myelomas 
                           3                            7 
    myelodysplastic syndrome                myelofibrosis 
                           9                            4 
 myeloproliferative disorder         non-Hodgkin lymphoma 
                           1                           12 
        renal cell carcinoma 
                           4 

Convert to factor

cytomegalovirus <- cytomegalovirus |> 
  mutate(sex = factor(sex, levels = c(0, 1), labels = c("Female", "Male")),
         race = factor(race, levels = c(0, 1), labels = c("African-American", "White")),
         diagnosis.type = factor(diagnosis.type, levels = c(0, 1), labels = c("Lymphoid", "Myeloid")),
         prior.radiation = factor(prior.radiation, levels = c(0, 1), labels = c("No", "Yes")),
         prior.transplant = factor(prior.transplant, levels = c(0, 1), labels = c("No", "Yes")),
         donor.cmv = factor(donor.cmv, levels = c(0, 1), labels = c("Negative", "Positive")),
         recipient.cmv = factor(recipient.cmv, levels = c(0, 1), labels = c("Negative", "Positive")),
         donor.sex = factor(donor.sex, levels = c(0, 1), labels = c("Male", "Female")),
         `C1/C2` = factor(`C1/C2`, levels = c(0, 1), labels = c("Heterozygous", "Homozygous")),
         cmv = factor(cmv, levels = c(0, 1), labels = c("No", "Yes")),
         agvhd = factor(agvhd, levels = c(0, 1), labels = c("No", "Yes")),
         cgvhd = factor(cgvhd, levels = c(0, 1), labels = c("No", "Yes"))
         )

Learn more

A few things from the above code that you might want to look into further:

Troubleshooting

If you make a mistake modifying the data, how can you undo it?

If this were a dataset we read in from an external file (like a .csv), you could just read it in again to get a fresh copy. But how do you get a fresh copy of a built-in dataset?

To reset the data to its original state, run rm(cytomegalovirus) in the console. This will delete your current version of the data from R’s environment, and you’ll just be left with the original clean copy from the medicaldata package.

Summary again

Let’s run summary on the whole dataframe again to see the results now that we’ve cleaned up the column types.

summary(cytomegalovirus)
       ID             age            sex                   race   
 Min.   : 1.00   Min.   :29.00   Female:30   African-American: 6  
 1st Qu.:16.75   1st Qu.:46.00   Male  :34   White           :58  
 Median :32.50   Median :55.00                                    
 Mean   :32.50   Mean   :52.44                                    
 3rd Qu.:48.25   3rd Qu.:61.00                                    
 Max.   :64.00   Max.   :67.00                                    
                                                                  
                        diagnosis   diagnosis.type time.to.transplant
 acute myeloid leukemia      :12   Lymphoid:28     Min.   :  1.84    
 non-Hodgkin lymphoma        :12   Myeloid :30     1st Qu.:  8.15    
 myelodysplastic syndrome    : 9   NA's    : 6     Median : 13.70    
 multiple myelomas           : 7                   Mean   : 30.85    
 chronic lymphocytic leukemia: 5                   3rd Qu.: 33.28    
 chronic myeloid leukemia    : 4                   Max.   :173.83    
 (Other)                     :15                   NA's   :1         
 prior.radiation  prior.chemo    prior.transplant  recipient.cmv    donor.cmv 
 No :53          Min.   :0.000   No :52           Negative:24    Negative:26  
 Yes:11          1st Qu.:1.000   Yes:12           Positive:40    Positive:38  
                 Median :2.000                                                
                 Mean   :2.188                                                
                 3rd Qu.:3.000                                                
                 Max.   :8.000                                                
                                                                              
  donor.sex     TNC.dose       CD34.dose         CD3.dose       CD8.dose    
 Male  :32   Min.   : 2.06   Min.   : 2.040   Min.   :1.08   Min.   :0.160  
 Female:32   1st Qu.: 6.89   1st Qu.: 4.165   1st Qu.:3.08   1st Qu.:0.805  
             Median :10.48   Median : 5.840   Median :3.98   Median :1.200  
             Mean   :10.44   Mean   : 5.546   Mean   :4.30   Mean   :1.323  
             3rd Qu.:13.42   3rd Qu.: 6.990   3rd Qu.:5.48   3rd Qu.:1.663  
             Max.   :21.02   Max.   :12.510   Max.   :8.18   Max.   :3.190  
                                              NA's   :3      NA's   :12     
    TBI.dose              C1/C2        aKIRs        cmv      time.to.cmv    
 Min.   :200.0   Heterozygous:31   Min.   :1.000   No :38   Min.   : 0.430  
 1st Qu.:200.0   Homozygous  :33   1st Qu.:1.000   Yes:26   1st Qu.: 2.712  
 Median :200.0                     Median :2.000            Median : 5.030  
 Mean   :287.5                     Mean   :2.844            Mean   :13.206  
 3rd Qu.:400.0                     3rd Qu.:4.250            3rd Qu.:16.012  
 Max.   :400.0                     Max.   :6.000            Max.   :84.470  
                                                                            
 agvhd    time.to.agvhd    cgvhd    time.to.cgvhd   
 No :38   Min.   : 0.660   No :36   Min.   : 0.820  
 Yes:26   1st Qu.: 2.623   Yes:28   1st Qu.: 3.848  
          Median : 5.800            Median : 6.735  
          Mean   :15.605            Mean   :10.304  
          3rd Qu.:21.410            3rd Qu.:11.307  
          Max.   :85.190            Max.   :65.120  
                                                    

Individual summary statistics

What is the shortest time from diagnosis to transplant in the data?

min(cytomegalovirus$time.to.transplant)
[1] NA

Checking the help documentation

?min

From the help documentation for min:

Usage

max(..., na.rm = FALSE)

min(..., na.rm = FALSE)

Working around missing values

min(cytomegalovirus$time.to.transplant, na.rm = TRUE)
[1] 1.84

More individual statistics

What is the longest time from diagnosis to transplant?

max(cytomegalovirus$time.to.transplant, na.rm = TRUE)
[1] 173.83

What is the mean time to transplant?

mean(cytomegalovirus$time.to.transplant, na.rm = TRUE)
[1] 30.84556

More individual statistics:

median(cytomegalovirus$time.to.transplant, na.rm = TRUE)
[1] 13.7
var(cytomegalovirus$time.to.transplant, na.rm = TRUE) # variance
[1] 1638.105
sd(cytomegalovirus$time.to.transplant, na.rm = TRUE) # standard deviation
[1] 40.47351

Coding Challenge 1

Your turn!

Look in the summary_stats_exercises.rmd file to find your first coding challenge.

02:00

Counts

For categorical variables, summary statistics are just counts.

We’ll look at two ways to get the counts for a single categorical variable in R:

summary(cytomegalovirus$diagnosis.type)
Lymphoid  Myeloid     NA's 
      28       30        6 
table(cytomegalovirus$diagnosis.type)

Lymphoid  Myeloid 
      28       30 
table(cytomegalovirus$diagnosis.type, useNA = "always")

Lymphoid  Myeloid     <NA> 
      28       30        6 

More counts

To get counts of two categorical variables together (e.g. what is the diagnosis type breakdown by recipient CMV status?), use cross-tabulation, also called “crosstabs” or “two-way frequency tables”.

xtabs(~cmv + diagnosis.type, 
      data = cytomegalovirus)
     diagnosis.type
cmv   Lymphoid Myeloid
  No        15      19
  Yes       13      11

Crosstabs

To get the results of a chi-squared test on the crosstabs, you can apply the summary function to an xtabs object.

Option 1:

# you can nest the summary and xtabs functions
summary(xtabs(~cmv + diagnosis.type, 
              data = cytomegalovirus))

Option 2:

# or you can save the xtabs object and then run summary() on it
cmv_and_diagnosistype <- xtabs(~cmv + diagnosis.type, 
                               data = cytomegalovirus)
summary(cmv_and_diagnosistype)
Call: xtabs(formula = ~cmv + diagnosis.type, data = cytomegalovirus)
Number of cases in table: 58 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 0.569, df = 1, p-value = 0.4507

Learn more

For beautifully printed crosstabs, see the tbl_cross function in the gtsummary package.

Coding Challenge 2

Your turn!

Look in the summary_stats_exercises.rmd file to find your next coding challenge.

02:00

Presenting Summary Statistics
in R

Table 1 from the published paper on the cytomegalovirus data. Columns are "1 to 4 aKIRs", "5 to 6 aKIRs" and "P value". There are rows for each of the following variables: age, sex, race, diagnosis category, months from diagnosis to treatment, prior radiation therapy, number of prior chemo regimens, prior transplants, recipient CMV seropositive, donor CMV seropositive, and recipient or donor CMV seropositive.

Table 1 summary statistics on the cytomegalovirus data

Creating Table 1 with gtsummary

We’ll recreate an approximation of the cytomegalovirus Table 1 using gtsummary:

Top half of Table 1 showing the column headers "1 to 4 aKIRs", "5 to 6 aKIRs" and "P value".

Recoding variables for the table

cytomegalovirus <- cytomegalovirus |> 
  mutate(aKIRs_groups = if_else(aKIRs > 4, 
                                "5 to 6 aKIRs", 
                                "1 to 4 aKIRs"),
         aKIRs_groups = factor(aKIRs_groups, 
                               levels = c("1 to 4 aKIRs", "5 to 6 aKIRs")),
         prior.chemo_groups = if_else(prior.chemo >= 4, 
                                      "≥ 4", 
                                      as.character(prior.chemo)),
         prior.chemo_groups = factor(prior.chemo_groups, 
                                     levels = c("0", "1", "2", "3", "≥ 4")))

The if_else function is what’s called a ternary operator, and it has three parts:

  • a conditional test
  • a value to use if the test returns TRUE
  • a value to use if the test returns FALSE.

Learn more

For more on using mutate and if_else, see see the R Basics: Data Transformation sections on mutate and logical operators to see more examples of the kinds of conditional tests you can put in.

Troubleshooting

Note that there are two different functions that look similar and do almost (but not quite!) exactly the same thing: if_else and ifelse.

if_else is part of the dplyr package, and ifelse is part of base R. This is a common point of confusion. There are a few differences in how the two functions work, but one of the big ones is that ifelse does NOT preserve missingness. It’s a subtle difference, but watch out!

Our first attempt at Table 1

Now we can use the new aKIRs_groups and prior.chemo_groups variables in our Table 1.

cytomegalovirus |>  
  select(aKIRs_groups, age, sex, race, 
         diagnosis.type, time.to.transplant, 
         prior.chemo_groups, prior.transplant, 
         cmv, donor.cmv) |> 
  tbl_summary(by = aKIRs_groups) 

Our first attempt at Table 1

Characteristic 1 to 4 aKIRs, N = 481 5 to 6 aKIRs, N = 161
age 57 (45, 61) 54 (52, 60)
sex

    Female 23 (48%) 7 (44%)
    Male 25 (52%) 9 (56%)
race

    African-American 5 (10%) 1 (6.3%)
    White 43 (90%) 15 (94%)
diagnosis.type

    Lymphoid 22 (51%) 6 (40%)
    Myeloid 21 (49%) 9 (60%)
    Unknown 5 1
time.to.transplant 16 (9, 35) 11 (5, 16)
    Unknown 1 0
prior.chemo_groups

    0 7 (15%) 3 (19%)
    1 10 (21%) 5 (31%)
    2 13 (27%) 4 (25%)
    3 7 (15%) 2 (13%)
    ≥ 4 11 (23%) 2 (13%)
prior.transplant 10 (21%) 2 (13%)
cmv 23 (48%) 3 (19%)
donor.cmv

    Negative 20 (42%) 6 (38%)
    Positive 28 (58%) 10 (63%)
1 Median (IQR); n (%)

Customizing a gtsummary table

This is a decent table! Notice there are a few things about this table that we might still want to change, for example:

  • row labels are taken from the variable names in the dataframe, so they’re lowercase and don’t use whitespace as we might like (e.g. “diagnosis.type” would be better as “Diagnosis Type”)
  • There’s no information about significance testing, like there is in the published table

Save a little typing…

table_data <- cytomegalovirus |>  
  select(aKIRs_groups, age, sex, race, 
         diagnosis.type, time.to.transplant, 
         prior.radiation, prior.chemo_groups, 
         prior.transplant, recipient.cmv, donor.cmv)

Add p-values

table_data |> 
  tbl_summary(by = aKIRs_groups) |> 
  add_p()
Characteristic 1 to 4 aKIRs, N = 481 5 to 6 aKIRs, N = 161 p-value2
age 57 (45, 61) 54 (52, 60) 0.7
sex

0.8
    Female 23 (48%) 7 (44%)
    Male 25 (52%) 9 (56%)
race

>0.9
    African-American 5 (10%) 1 (6.3%)
    White 43 (90%) 15 (94%)
diagnosis.type

0.5
    Lymphoid 22 (51%) 6 (40%)
    Myeloid 21 (49%) 9 (60%)
    Unknown 5 1
time.to.transplant 16 (9, 35) 11 (5, 16) 0.2
    Unknown 1 0
prior.radiation 9 (19%) 2 (13%) 0.7
prior.chemo_groups

0.9
    0 7 (15%) 3 (19%)
    1 10 (21%) 5 (31%)
    2 13 (27%) 4 (25%)
    3 7 (15%) 2 (13%)
    ≥ 4 11 (23%) 2 (13%)
prior.transplant 10 (21%) 2 (13%) 0.7
recipient.cmv

0.2
    Negative 16 (33%) 8 (50%)
    Positive 32 (67%) 8 (50%)
donor.cmv

0.8
    Negative 20 (42%) 6 (38%)
    Positive 28 (58%) 10 (63%)
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test; Fisher’s exact test

Customize p-values

The help documentation (?add_p.tbl_summary) explains a little more thoroughly:

Tests default to “kruskal.test” for continuous variables (“wilcox.test” when “by” variable has two levels), “chisq.test.no.correct” for categorical variables with all expected cell counts >=5, and “fisher.test” for categorical variables with any expected cell count <5.

Customize p-values

Make the p values round to 2 digits instead of 1, and use t.test instead of wilcoxon rank sum test:

table_data |> 
  tbl_summary(by = aKIRs_groups) |> 
  add_p(test = list(all_continuous() ~ "t.test",
                    all_categorical() ~ "chisq.test.no.correct"),
        test.args = all_continuous() ~ list(var.equal = TRUE),
        pvalue_fun = function(x) style_pvalue(x, digits = 2))

Customize p-values

Characteristic 1 to 4 aKIRs, N = 481 5 to 6 aKIRs, N = 161 p-value2
age 57 (45, 61) 54 (52, 60) 0.43
sex

0.77
    Female 23 (48%) 7 (44%)
    Male 25 (52%) 9 (56%)
race

0.62
    African-American 5 (10%) 1 (6.3%)
    White 43 (90%) 15 (94%)
diagnosis.type

0.46
    Lymphoid 22 (51%) 6 (40%)
    Myeloid 21 (49%) 9 (60%)
    Unknown 5 1
time.to.transplant 16 (9, 35) 11 (5, 16) 0.86
    Unknown 1 0
prior.radiation 9 (19%) 2 (13%) 0.57
prior.chemo_groups

0.85
    0 7 (15%) 3 (19%)
    1 10 (21%) 5 (31%)
    2 13 (27%) 4 (25%)
    3 7 (15%) 2 (13%)
    ≥ 4 11 (23%) 2 (13%)
prior.transplant 10 (21%) 2 (13%) 0.46
recipient.cmv

0.23
    Negative 16 (33%) 8 (50%)
    Positive 32 (67%) 8 (50%)
donor.cmv

0.77
    Negative 20 (42%) 6 (38%)
    Positive 28 (58%) 10 (63%)
1 Median (IQR); n (%)
2 Two Sample t-test; Pearson’s Chi-squared test

A little encouragement…

If you’re feeling like these arguments look confusing (why does it start with list()? What does the ~ mean??), you’re not alone!

As you get further away from the default settings in gtsummary (and other packages), your code is likely to become less intuitive and harder to read. That’s okay!

What was that error?

When you ran the above code, you may have noticed that R printed out several warnings with text like Warning for variable 'race': simpleWarning in stats::chisq.test(x = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 2L, : Chi-squared approximation may be incorrect.

Clean up labels

One glaring issue that remains is the row labels. We don’t want labels like “prior.chemo_groups” in our final table!

We can clean those up with the label argument in the tbl_summary command:

table_data |> 
  tbl_summary(by = aKIRs_groups, 
              label = list(age ~ "Age at transplant (yrs)",
                           sex ~ "Sex",
                           race ~ "Race",
                           diagnosis.type ~ "Diagnostic category",
                           time.to.transplant ~ "Months from diagnosis to transplant",
                           prior.radiation ~ "Prior radiation therapy",
                           prior.chemo_groups ~ "Number of prior chemotherapy regimens",
                           prior.transplant ~ "Prior transplants",
                           recipient.cmv ~ "Recipient CMV seropositive",
                           donor.cmv ~ "Donor CMV seropositive")) |> 
  add_p(test = list(all_continuous() ~ "t.test",
                    all_categorical() ~ "chisq.test.no.correct"),
        test.args = all_continuous() ~ list(var.equal = TRUE),
        pvalue_fun = function(x) style_pvalue(x, digits = 2))
Characteristic 1 to 4 aKIRs, N = 481 5 to 6 aKIRs, N = 161 p-value2
Age at transplant (yrs) 57 (45, 61) 54 (52, 60) 0.43
Sex

0.77
    Female 23 (48%) 7 (44%)
    Male 25 (52%) 9 (56%)
Race

0.62
    African-American 5 (10%) 1 (6.3%)
    White 43 (90%) 15 (94%)
Diagnostic category

0.46
    Lymphoid 22 (51%) 6 (40%)
    Myeloid 21 (49%) 9 (60%)
    Unknown 5 1
Months from diagnosis to transplant 16 (9, 35) 11 (5, 16) 0.86
    Unknown 1 0
Prior radiation therapy 9 (19%) 2 (13%) 0.57
Number of prior chemotherapy regimens

0.85
    0 7 (15%) 3 (19%)
    1 10 (21%) 5 (31%)
    2 13 (27%) 4 (25%)
    3 7 (15%) 2 (13%)
    ≥ 4 11 (23%) 2 (13%)
Prior transplants 10 (21%) 2 (13%) 0.46
Recipient CMV seropositive

0.23
    Negative 16 (33%) 8 (50%)
    Positive 32 (67%) 8 (50%)
Donor CMV seropositive

0.77
    Negative 20 (42%) 6 (38%)
    Positive 28 (58%) 10 (63%)
1 Median (IQR); n (%)
2 Two Sample t-test; Pearson’s Chi-squared test

Exporting a table

So how do you get the table from R to Word? You can save the table as a Word doc, which you can then open and copy into your article.

To save a table as a Word file, add the following two lines to the end of your table commands (you can change “my_table.docx” to be whatever you want the file name to be):

  as_flex_table() |> 
  flextable::save_as_docx(path = "my_table.docx")

Troubleshooting

Note that this requires a function from the flextable package. If you’ve never used flextable before, you may need to run install.packages("flextable") before this code will work.

Exporting a table

For example, to save the table above, you could do the following:

table_data |> 
  tbl_summary(by = aKIRs_groups, 
              label = list(age ~ "Age at transplant (yrs)",
                           sex ~ "Sex",
                           race ~ "Race",
                           diagnosis.type ~ "Diagnostic category",
                           time.to.transplant ~ "Months from diagnosis to transplant",
                           prior.radiation ~ "Prior radiation therapy",
                           prior.chemo_groups ~ "Number of prior chemotherapy regimens",
                           prior.transplant ~ "Prior transplants",
                           recipient.cmv ~ "Recipient CMV seropositive",
                           donor.cmv ~ "Donor CMV seropositive")) |> 
  add_p(test = list(all_continuous() ~ "t.test",
                    all_categorical() ~ "chisq.test.no.correct"),
        test.args = all_continuous() ~ list(var.equal = TRUE),
        pvalue_fun = function(x) style_pvalue(x, digits = 2)) |> 
  as_flex_table() |> 
  flextable::save_as_docx(path = "my_table.docx")

Learn more

For a review of how to specify a path, see the explanation of absolute and relative paths in our Directories and File Paths module.

Coding Challenge 3: Homework!

For your final coding challenge, return to RStudio and follow the instructions there to create a Table 1 for a brand new data set, the laryngoscope data.

Here’s the published table you’ll be trying to recreate:

Table 1 from the published article on the laryngoscope data.

For inspiration (and lots of examples of things to try), check out the tbl_summary help documentation.

Coding Challenge 3: Homework!

Don’t struggle in silence!

  • Ask questions and share tips on the CHOPR slack
  • Come to R Office Hours to show off your progress and get help
  • There’s a solution available in summary_stats_solutions.Rmd, but you’ll learn a lot more if you try it yourself first

What we covered

  • getting example data from the medicaldata package
  • summary()
  • min(), max(), mean(), median(), var(), sd(), quantile()
  • table()
  • xtabs()
  • building tables with gtsummary package, specifically using tbl_summary()
  • a bunch of extras throughout on data transformations and data cleaning

Additional resources

For general background on statistics:

For more on the gtsummary package, see their website, especially the FAQ and gallery of example tables

Shameless Plug #1

Want to go through this material again? It’s posted as an interactive tutorial online as part of DART (Data and Analytics for Research Training)!

You can also explore the rest of the DART modules with the new module discovery app prototype

Shameless Plug #2

Have funding for your research project and interested in working with an experienced biostatistician or data scientist to analyze your data?

The Data Science and Biostatistics Unit (DSBU) is DBHi’s and CHOP Research Institute’s centralized service unit for biostatistics and data science analysis support. Reach out to Alexis Zavez (zaveza@chop.edu) or Keith Baxelbaum (baxelbaumk@chop.edu) for more info!