Missing Values
in R

Rose Hartman

Arcus Education, DBHi

2024-03-04

Use keyboard arrow keys to
- advance ( → ) and
- go back ( ← )
Type “s” to see speaker notes
Type “?” to see other keyboard shortcuts

Join the CHOP R User Group

CHOPR hex sticker logo

Friendly help troubleshooting your R code
Announcements for upcoming talks, workshops, and conferences

Link to join: https://bit.ly/chopRusers

Come to R Office Hours!

Set up a meeting to get live help with your R code from our most experienced useRs
Office hours appointments can be one-on-one or open to the community

Link to calendar: https://bit.ly/chopROfficeHours

Coming soon!

This is the first talk in a new series called R102: MasteRing the Fundamentals

Next up: Summary Statistics in R, April 8th 12:00pm ET

Learn more about this new series, including dates and titles for each session:
https://arcus.github.io/r102/

Missing Values in R

Okay! So this talk is a quick dip into some 102-level R. In other words, if you’ve had a little exposure to R before, such as through an Intro to R for Clinical Data workshop, this is hopefully the right level for you now. If you’re completely brand new to R, first of all: Welcome! You may find it tricky to actively follow along with the code today since I’m going to skim over some of the initial steps, but go ahead and give it a try, or just listen and watch if that feels more like the right speed.

Our topic is missing values in R, one of the major stumbling blocks folks typically encounter when they start trying to use R to analyze data “in the wild”. Real data are messy, and missingness is one of the main kinds of mess you’ll have to deal with!

If you have questions during the talk, feel free to put them in the chat. We have several very friendly R experts on the call today helping out, so they may be able to chime in and answer your questions, otherwise I can.

What we’re covering today

How to check the number and location of missing values in a dataframe
How to mark values as missing
How to use common arguments like na.rm and na.action to control how functions handle missingness
How to remove cases with missing values from a dataframe
NOT teaching statistical remedies for missingness, like imputation (but ask me about that later if you’re curious!)

Why check for missingness?

Checking for missing data can help you know whether the data were read in correctly
Missingness will also impact your effective sample size
If your analysis will involve fixing missingness statistically, the first step is always describing the missingness

Okay, so why would you want to check for missingness? [CLICK] Checking missing values can help you check whether the data were read in correctly. In many cases, you know beforehand whether there should be any missing values on particular variables in your data because you know about how the data were collected, etc. After you import the data into R, checking for missingness can give you a sense for whether the data were imported correctly. For example, if you were expecting some missingness on a given variable and you see none, it can give you a hint that the missing values in the raw data aren’t being correctly interpreted by R.

[CLICK] Missingness will also impact your effective sample size. When you have missing values, the sample size that you can actually use for your analysis is often reduced. For example, if you collect 100 samples, but 20 of them are missing at least some of the measurements, you might only have 80 complete samples you can analyze.

[CLICK] And lastly, if your analysis will involve fixing missingness statistically, the first step is always describing the missingness If you have substantial missingness (a good rule of thumb is if 5% or more of your data are missing that would be considered substantial), then you may need special statistical techniques to be able to analyze the data without introducing bias. As I said, such techniques are outside the scope of this talk, unfortunately, but there are lots of excellent papers on this, and tutorials available with instructions.

Learn more

For an excellent introduction to different types of missing data and how to handle them statistically, read Rubin’s classic paper Inference and Missing Data.

What does “missing” look like
in R?

NA

Rarely:

NA_integer_
NA_real_
NA_complex_
NA_character_

For example

Here’s an example of what some data with missing values might look like when printed in R:

sensor_id	PM2.5	PM10	O3	NO2
0001	10	25	0.0	67
0002	13	21	NA	71
0003	9	NA	NA	64

Learn more

What about NULL and NaN?

If you’d like to learn more, check out this blog post explaining the difference between NA and NULL and the missing values chapter of R for Data Science (2e).

If you’re just beginning in R, you can safely ignore the differences between NA, NaN, and NULL for now.

How to check for missing values

Open the data in the Data Viewer and scan visually for NAs (to see examples of how to use View(), see this tutorial on RStudio’s data viewer)
summary()
Many more options! For some handy visualizations of missingness, see the visdat package, and the missmap function from the Amelia package.

[CLICK] One way to see missing data in R is to take a look at the full dataset, either by printing it in the console or using View() to see it opened like a spreadsheet. Then you can just scan visually for NA cells, as in the example table in the previous section. If your dataset is anything larger than a handful of rows and columns, though, you’ll want a way to summarize that information without having to count everything by hand yourself.

[CLICK] There are many functions available to check for missing values in R, but one especially handy one is the summary() function. It gives you some basic summary information about each variable in your data (minimum and maximum values, etc.), and will also tell you how many missing values you have for each variable. Because it gives you summary statistics and missingness information at the same time, many people like to use it as a quick way to check their data as a start to their exploratory data analysis.

[CLICK] There are also many more options for exploring your missing data! I like the visdat and Amelia packages in particular, but there are more. I think summary() is a great place to start, though, so that’s what we’ll focus on today.

Working with missing values in R

Two options:

Work in the cloud: https://posit.cloud/content/7522885
Work on your computer: https://github.com/arcus/r102

Okay, enough preamble, it’s time to start coding! By far the best way to learn R is to practice, so work through this code yourself as you follow along.

This link will take you to Posit Cloud, which gives you a way to work with the code right in your browser without having to install anything on your machine. You will need to create a free account if you don’t already have one. I’ll click that link now, and log in with a different account than the one I used to create it so you can see what it looks like. It will take a few minutes to load.

You can also get all of the code for this talk directly from our GitHub and download it to work on your own machine. If you want to go this route, go to our GitHub repo and then find this green “Code” button. If click it you’ll see you have several options, one of which is downloading a zip file – click that and it will download all the files you need for this talk. Once it’s done downloading, double click it to unzip the file. If you’re comfortable using git, you can also clone the repo, or fork it if you’d like a personal copy. And if you don’t know what cloning and forking are, no worries! Just use the zip file.

Learn more

tidyverse hex sticker logo.

Learn more about the tidyverse packages on the tidyverse website!

Load packages

Only if needed:

install.packages("tidyverse")

Each R session:

library(tidyverse)

Open the file missing_values_exercises.rmd in the exercises folder.

The data

In the console or in the exercises rmd file, run the following command:

head(msleep)

                        name      genus  vore        order conservation
1                    Cheetah   Acinonyx carni    Carnivora           lc
2                 Owl monkey      Aotus  omni     Primates         <NA>
3            Mountain beaver Aplodontia herbi     Rodentia           nt
4 Greater short-tailed shrew    Blarina  omni Soricomorpha           lc
5                        Cow        Bos herbi Artiodactyla domesticated
6           Three-toed sloth   Bradypus herbi       Pilosa         <NA>
  sleep_total sleep_rem sleep_cycle awake brainwt  bodywt
1        12.1        NA          NA  11.9      NA  50.000
2        17.0       1.8          NA   7.0 0.01550   0.480
3        14.4       2.4          NA   9.6      NA   1.350
4        14.9       2.3   0.1333333   9.1 0.00029   0.019
5         4.0       0.7   0.6666667  20.0 0.42300 600.000
6        14.4       2.2   0.7666667   9.6      NA   3.850

This command is also written out for you in the missing_values_exercises.rmd file, in the next code chunk.

Let’s take a look at the data. [CLICK] You should see the first six rows of the msleep data frame, which look like this. Note that this is one of the example datasets that comes built-in when you install the tidyverse package, so it’s already available to you without you having to read it in or download anything.

For those of you that have worked in R before, you know importing data is a whole thing, so we’re definitely skipping over a potentially tricky bit by using built-in data, but we only have so much time today and I wanted to be able to put as much time as possible towards actually working with the missing values. So we’re just merrily skipping past all the importing and tidying that would normally happen.

About these data

To learn more about this dataset:

?msleep

How to check for missing values

summary(msleep)

     name              genus               vore              order          
 Length:83          Length:83          Length:83          Length:83         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 conservation        sleep_total      sleep_rem      sleep_cycle    
 Length:83          Min.   : 1.90   Min.   :0.100   Min.   :0.1167  
 Class :character   1st Qu.: 7.85   1st Qu.:0.900   1st Qu.:0.1833  
 Mode  :character   Median :10.10   Median :1.500   Median :0.3333  
                    Mean   :10.43   Mean   :1.875   Mean   :0.4396  
                    3rd Qu.:13.75   3rd Qu.:2.400   3rd Qu.:0.5792  
                    Max.   :19.90   Max.   :6.600   Max.   :1.5000  
                                    NA's   :22      NA's   :51      
     awake          brainwt            bodywt        
 Min.   : 4.10   Min.   :0.00014   Min.   :   0.005  
 1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174  
 Median :13.90   Median :0.01240   Median :   1.670  
 Mean   :13.57   Mean   :0.28158   Mean   : 166.136  
 3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750  
 Max.   :22.10   Max.   :5.71200   Max.   :6654.000  
                 NA's   :27

So let’s run the summary() function now, to see how many missing values we have for each of the variables in this data frame.

[CLICK] The first thing to notice here is that R is actually giving us different information for each variable depending on whether it’s a character or numeric variable. The first few variables are character variables, and it doesn’t try to print things like minimum and maximum for characters because it doesn’t make sense to find the minimum of text. But look at the later variables, starting with sleep_total.

Notice the bottom row of the summary statistics for sleep_rem, sleep_cycle and brainwt. It gives the count of NA values for each variable. The other numeric variables (sleep_total, awake, and bodywt) don’t show anything for the NA count, which means they have no missing values.

The summary command is most useful for numeric and factor variables. It doesn’t show us anything useful for the character variables. We can get better output for those by converting them to factors, though.

Come back to these slides later if you like :)

Remember, if you’re viewing these slides online, you can hit s on your keyboard to show the speaker notes.

Convert to factor

We could convert those variables to factors with a mutate command for each of the rows we want to convert, like this:

msleep <- msleep |>
  mutate(name = as.factor(name),
         genus = as.factor(genus),
         vore = as.factor(vore),
         order = as.factor(order),
         conservation = as.factor(conservation))

Convert to factor

But in cases like this where we want to convert several variables all in the same way, we can do it faster with the across command:

msleep <- msleep |>
  mutate(across(
    where(is.character), 
    as.factor))

Let’s break that down. We’re using the across command to do the same thing across several columns.

[CLICK] The first argument of across needs to tell it which columns to use. Here, we’re telling it which columns to use with where(is.character) — that will check each column against the test is.character and return TRUE if the column is of type character and FALSE if it’s anything else. where(is.character) will therefore give us a list of all of the columns that are character columns. [CLICK] The second argument of across is what you want done to those columns. Here, we’re saying we want it to apply the function as.factor. [CLICK] So taken together, this mutate command will pick all the columns from the data that are of type character and convert them to factor. Handy!

Note that either way of converting these columns (either individually, or all at once using the across function) will result in the same clean dataframe.

Learn more

A few things from the above code that you might want to look into further:

Pipes! See the 2nd edition R4DS section on pipes
If you’re curious, a comparison of the new (|>) and old (%>%) pipes
More about mutate and data transformation in general in R4DS section on mutate
More about across in the tidyverse “colwise” vignette

Summary again

Now let’s try summary again to see if we get more informative results for those first few columns:

summary(msleep)

                        name             genus         vore   
 African elephant         : 1   Panthera    : 3   carni  :19  
 African giant pouched rat: 1   Spermophilus: 3   herbi  :32  
 African striped mouse    : 1   Equus       : 2   insecti: 5  
 Arctic fox               : 1   Vulpes      : 2   omni   :20  
 Arctic ground squirrel   : 1   Acinonyx    : 1   NA's   : 7  
 Asian elephant           : 1   Aotus       : 1               
 (Other)                  :77   (Other)     :71               
          order          conservation  sleep_total      sleep_rem    
 Rodentia    :22   cd          : 2    Min.   : 1.90   Min.   :0.100  
 Carnivora   :12   domesticated:10    1st Qu.: 7.85   1st Qu.:0.900  
 Primates    :12   en          : 4    Median :10.10   Median :1.500  
 Artiodactyla: 6   lc          :27    Mean   :10.43   Mean   :1.875  
 Soricomorpha: 5   nt          : 4    3rd Qu.:13.75   3rd Qu.:2.400  
 Cetacea     : 3   vu          : 7    Max.   :19.90   Max.   :6.600  
 (Other)     :23   NA's        :29                    NA's   :22     
  sleep_cycle         awake          brainwt            bodywt        
 Min.   :0.1167   Min.   : 4.10   Min.   :0.00014   Min.   :   0.005  
 1st Qu.:0.1833   1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174  
 Median :0.3333   Median :13.90   Median :0.01240   Median :   1.670  
 Mean   :0.4396   Mean   :13.57   Mean   :0.28158   Mean   : 166.136  
 3rd Qu.:0.5792   3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750  
 Max.   :1.5000   Max.   :22.10   Max.   :5.71200   Max.   :6654.000  
 NA's   :51                       NA's   :27

Troubleshooting

If you make a mistake modifying the data, how can you undo it?

If this were a dataset we read in from an external file (like a .csv), you could just read it in again to get a fresh copy. But how do you get a fresh copy of a built-in dataset?

To reset the data to its original state, run rm(msleep) in the console. This will delete your current version of the data from R’s environment, and you’ll just be left with the original clean copy from the ggplot2 package.

Filtering

As a reminder, filter() selects just the rows from a dataframe that return TRUE for the logical test you put in.

A symbolic dataframe of four rows with a header, with two rows selected (highlighted in a different color), is transformed into a new dataframe with just the two selected rows and a header.

For example:

filter(msleep, vore == "carni")

Filtering

                         name         genus  vore           order conservation
1                     Cheetah      Acinonyx carni       Carnivora           lc
2           Northern fur seal   Callorhinus carni       Carnivora           vu
3                         Dog         Canis carni       Carnivora domesticated
4        Long-nosed armadillo       Dasypus carni       Cingulata           lc
5                Domestic cat         Felis carni       Carnivora domesticated
6                 Pilot whale Globicephalus carni         Cetacea           cd
7                   Gray seal  Haliochoerus carni       Carnivora           lc
8        Thick-tailed opposum    Lutreolina carni Didelphimorphia           lc
9                  Slow loris     Nyctibeus carni        Primates         <NA>
10 Northern grasshopper mouse     Onychomys carni        Rodentia           lc
11                      Tiger      Panthera carni       Carnivora           en
12                     Jaguar      Panthera carni       Carnivora           nt
13                       Lion      Panthera carni       Carnivora           vu
14               Caspian seal         Phoca carni       Carnivora           vu
15            Common porpoise      Phocoena carni         Cetacea           vu
16       Bottle-nosed dolphin      Tursiops carni         Cetacea         <NA>
17                      Genet       Genetta carni       Carnivora         <NA>
18                 Arctic fox        Vulpes carni       Carnivora         <NA>
19                    Red fox        Vulpes carni       Carnivora         <NA>
   sleep_total sleep_rem sleep_cycle awake brainwt  bodywt
1         12.1        NA          NA 11.90      NA  50.000
2          8.7       1.4   0.3833333 15.30      NA  20.490
3         10.1       2.9   0.3333333 13.90  0.0700  14.000
4         17.4       3.1   0.3833333  6.60  0.0108   3.500
5         12.5       3.2   0.4166667 11.50  0.0256   3.300
6          2.7       0.1          NA 21.35      NA 800.000
7          6.2       1.5          NA 17.80  0.3250  85.000
8         19.4       6.6          NA  4.60      NA   0.370
9         11.0        NA          NA 13.00  0.0125   1.400
10        14.5        NA          NA  9.50      NA   0.028
11        15.8        NA          NA  8.20      NA 162.564
12        10.4        NA          NA 13.60  0.1570 100.000
13        13.5        NA          NA 10.50      NA 161.499
14         3.5       0.4          NA 20.50      NA  86.000
15         5.6        NA          NA 18.45      NA  53.180
16         5.2        NA          NA 18.80      NA 173.330
17         6.3       1.3          NA 17.70  0.0175   2.000
18        12.5        NA          NA 11.50  0.0445   3.380
19         9.8       2.4   0.3500000 14.20  0.0504   4.230

Troubleshooting

Remember that the double equals sign is a comparison — in the above code it’s asking whether vore is equal to “carni”, while a single equals sign is a “setter”, and it will try to make vore equal to “carni”.

Filtering

These are some of the logical tests you might use:

logical condition	means	example
`x < y`	less than	`sleep_total < 10`
`x > y`	greater than	`sleep_total > 4`
`x == y`	equal to	`vore == "carni"`
`x != y`	not equal to	`vore != "carni"`

Learn more

For more details about how filter() works, see the DART tutorial on data transformation, including The filter() function.

Filtering with `NA`s

Let’s use filter to take a look at just the rows that have missing values for the brainwt variable.

filter(msleep, brainwt == NA)

 [1] name         genus        vore         order        conservation
 [6] sleep_total  sleep_rem    sleep_cycle  awake        brainwt     
[11] bodywt      
<0 rows> (or 0-length row.names)

Filtering with `NA`s

So, let’s try again to filter the data to just show rows where we have missing values for brainwt:

filter(msleep, is.na(brainwt))

                             name         genus  vore           order
1                         Cheetah      Acinonyx carni       Carnivora
2                 Mountain beaver    Aplodontia herbi        Rodentia
3                Three-toed sloth      Bradypus herbi          Pilosa
4               Northern fur seal   Callorhinus carni       Carnivora
5                    Vesper mouse       Calomys  <NA>        Rodentia
6                          Grivet Cercopithecus  omni        Primates
7       Western american chipmunk      Eutamias herbi        Rodentia
8                         Giraffe       Giraffa herbi    Artiodactyla
9                     Pilot whale Globicephalus carni         Cetacea
10                 Mongoose lemur         Lemur herbi        Primates
11           Thick-tailed opposum    Lutreolina carni Didelphimorphia
12               Mongolian gerbil      Meriones herbi        Rodentia
13                          Vole       Microtus herbi        Rodentia
14           Round-tailed muskrat      Neofiber herbi        Rodentia
15                           Degu       Octodon herbi        Rodentia
16     Northern grasshopper mouse     Onychomys carni        Rodentia
17                          Tiger      Panthera carni       Carnivora
18                           Lion      Panthera carni       Carnivora
19                          Potto  Perodicticus  omni        Primates
20                     Deer mouse    Peromyscus  <NA>        Rodentia
21                   Caspian seal         Phoca carni       Carnivora
22                Common porpoise      Phocoena carni         Cetacea
23                        Potoroo      Potorous herbi   Diprotodontia
24          African striped mouse     Rhabdomys  omni        Rodentia
25 Golden-mantled ground squirrel  Spermophilus herbi        Rodentia
26      Eastern american chipmunk        Tamias herbi        Rodentia
27           Bottle-nosed dolphin      Tursiops carni         Cetacea
   conservation sleep_total sleep_rem sleep_cycle awake brainwt  bodywt
1            lc        12.1        NA          NA 11.90      NA  50.000
2            nt        14.4       2.4          NA  9.60      NA   1.350
3          <NA>        14.4       2.2   0.7666667  9.60      NA   3.850
4            vu         8.7       1.4   0.3833333 15.30      NA  20.490
5          <NA>         7.0        NA          NA 17.00      NA   0.045
6            lc        10.0       0.7          NA 14.00      NA   4.750
7          <NA>        14.9        NA          NA  9.10      NA   0.071
8            cd         1.9       0.4          NA 22.10      NA 899.995
9            cd         2.7       0.1          NA 21.35      NA 800.000
10           vu         9.5       0.9          NA 14.50      NA   1.670
11           lc        19.4       6.6          NA  4.60      NA   0.370
12           lc        14.2       1.9          NA  9.80      NA   0.053
13         <NA>        12.8        NA          NA 11.20      NA   0.035
14           nt        14.6        NA          NA  9.40      NA   0.266
15           lc         7.7       0.9          NA 16.30      NA   0.210
16           lc        14.5        NA          NA  9.50      NA   0.028
17           en        15.8        NA          NA  8.20      NA 162.564
18           vu        13.5        NA          NA 10.50      NA 161.499
19           lc        11.0        NA          NA 13.00      NA   1.100
20         <NA>        11.5        NA          NA 12.50      NA   0.021
21           vu         3.5       0.4          NA 20.50      NA  86.000
22           vu         5.6        NA          NA 18.45      NA  53.180
23         <NA>        11.1       1.5          NA 12.90      NA   1.100
24         <NA>         8.7        NA          NA 15.30      NA   0.044
25           lc        15.9       3.0          NA  8.10      NA   0.205
26         <NA>        15.8        NA          NA  8.20      NA   0.112
27         <NA>         5.2        NA          NA 18.80      NA 173.330

Coding Challenge 1

Your turn!

Look in the exercises/missing_values_exercises.rmd file to find your first coding challenge.

02:00

Assigning values to missing

You may need to assign values to NA as part of data cleaning.

You may need to assign values to NA as part of data cleaning. When you’re working with data, as part of your data exploration, you may notice values for some variables that you know are impossible or extremely unlikely.

For example, imagine you have a dataset called df with a variable called rating. You know it’s on a scale from 1-5, so respondents weren’t able to rate it as anything outside of those options. And yet, when you read in the data you see there are a handful of -99 scores for this variable. Depending on the data collection mechanism, these may be the result of typos during data entry or a systematic way of marking invalid or missing responses.

Either way, you don’t want to treat those -99 scores as actual responses to the question (imagine how they would distort the estimate of the mean for that variable!); instead, you can mark them missing so they’ll be excluded from analysis.

Assigning values to missing

Here’s some code to assign the -99 values in our pretend dataset to missing:

df <- mutate(df, 
             rating = ifelse(
               rating == -99, 
               NA, 
               rating
               ))

Let’s break that down.

[CLICK] We’re using a mutate command to change (“mutate”) a variable (a column of data), and we’re setting what the new value of this column should be by using a special conditional function, [CLICK] the ifelse function. The ifelse command is what’s called a ternary operator, and it has three parts:

[CLICK] a conditional test [CLICK] a value to use if the test returns TRUE [CLICK] a value to use if the test returns FALSE

[CLICK] Our ifelse statement begins with a conditional test, in this case rating == -99. For each value in rating, it will run the test and return either TRUE or FALSE. If it returns TRUE, then it assigns the next argument, in this case NA, which will mark that value missing. If it returns FALSE, then it assigns the last argument, in this case rating, which will leave the value untouched. So for any rating values that equal -99, we’re asking it to replace them with NA, otherwise leave them as they were.

Coding Challenge 2

Your turn!

Go back to the exercises/missing_values_exercises.rmd file to find your second coding challenge.

02:00

Learn more

For more on mutate and ifelse, see the R Basics: Data Transformation sections on mutate and logical operators, and the ifelse section in the free online book Advanced R.

Troubleshooting

Are ifelse and if_else the same thing?

Almost, but not quite!

The function ifelse with no underscore is part of base R, and if_else is part of the dplyr package. They are very similar and do almost exactly the same thing, but the reason the dplyr developers bothered to write a new if_else fuction when the base ifelse was already available is because they wanted to make it stricter.

In general, stricter is better with functions. A function built with rigid requirements can be irritating because it throws errors more often, but it’s actually much more dangerous to have loose functions quietly doing unexpected things than to have strict ones throwing errors.

So what’s happening in this particular case is that if_else has a requirement ifelse doesn’t, which is that whatever you put in for the “if TRUE” part of the operator has to be of the same type as whatever you put in for the “if FALSE” part. That’s a reasonable requirement, because the output of either of these if else functions is a single vector that will have “if TRUE” values wherever the test returns TRUE and “if FALSE” values wherever the test returns FALSE.
This runs fine:

msleep <- mutate(msleep, 
                       sleep_total = ifelse(sleep_total > 18, 
                                            NA, 
                                            sleep_total))

But this generates an error:

msleep <- mutate(msleep, 
                       sleep_total = if_else(sleep_total > 18, 
                                             NA, 
                                             sleep_total))

Remember when I told you there were secretly different kinds of NA under the hood, for the different data types in R, but that it would almost never come up and you wouldn’t have to worry about that? So this is unfortunately one of the rare times when it does come up. Sorry!

Just plain NA like we have here is assumed to be the NA for logical vectors unless you specify otherwise. So when if_else checks the “if TRUE” and “if FALSE” arguments to see if they’re the same data type, it thinks your NA is for a logical vector and then it see that the values from sleep_total are numeric, and that’s why it refuses to run. It thinks you’re trying to mix apples and oranges. If you explicitly tell it you want numeric NAs, then it will run just fine.

This works :)

msleep <- mutate(msleep, 
                       sleep_total = if_else(sleep_total > 18, 
                                             NA_real_, 
                                             sleep_total))

A note about removing outliers

These are all perfectly reasonable situations in which to mark values as missing:

Extremely unlikely outliers
Values that are logically impossible
Values that you know are meant to mark missing or invalid responses

Learn more

Be very cautious of removing outliers that might be valid data, though!

To learn more, read The Extent and Consequences of P-Hacking in Science.

Note, however that you should be cautious when excluding observations in general. Dropping outliers can become problematic when you might be excluding valid data (i.e. not typos or impossible values). This is an especially important problem when you decide whether or not to drop outliers after checking the results of your analysis both ways, because that can lead to unintentional bias in your results. The process of running your statistical tests, identifying outliers, then running the tests again with outliers removed is a very common but very problematic procedure. As a matter of fact, I was taught to do exactly that when I was in grad school, and some of you may have been taught that, too. But it biases your results and contributes to the replicability crisis, so it’s no longer considered acceptable practice.

Working around missing values

What happens when you try to do things (statistical tests, visualizations, etc.) with missing values?

Two basic options for how functions handle missingess

Run the function just on whatever data aren’t missing (“listwise deletion” or “complete case analysis”)
Throw an error

More complex functions mean more options for missingness

We’ll talk about missing values in the following example functions:

mean, which calculates the mean or average of a set of numbers
prcomp, which runs a Principal Components Analysis (PCA)
geom_point, from the ggplot2 data visualization package, which is used to create scatterplots
cor, which is used to generate a correlation matrix

A little encouragement…

Are these functions new to you? No problem!

You don’t need to understand the underlying statistics for any of these functions to work through the R code and learn from the missingness examples.

`na.rm`

Many functions in R have an argument na.rm with options TRUE or FALSE.

For example, check out the help documentation for mean:

?mean

Under Arguments, you’ll see the description for na.rm is

a logical value indicating whether NA values should be stripped before the computation proceeds.

`na.rm`

Let’s try that with the brainwt variable:

mean(msleep$brainwt)

[1] NA

mean(msleep$brainwt, na.rm = TRUE)

[1] 0.2815814

We’ll try running mean on the brainwt variable from our msleep data. Remember that brainwt is one of the ones that does have some missing values. And we won’t specify anything for the na.rm argument initially, which means it will just use its default setting, which we know from the help documentation is na.rm = FALSE.

[CLICK] There are some missing values in brainwt, so right now R is trying to take the mean of several numbers and a handful of NAs — there’s no way to get an average for something that’s not there, so the result is NA.

[CLICK] Try modifying the code to change the behavior to na.rm = TRUE

[CLICK] Now we get a value! That’s the average of the available data, ignoring the missing ones.

You’ll see na.rm as an argument for many functions in R, usually with the default set to FALSE, as it is for mean.

`na.action`

For more complex statistical functions, there’s often an na.action argument instead of na.rm.

Let’s take a look at the help documentation for prcomp:

?prcomp

Under Arguments, you’ll see the description for na.action is

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit.

For more complex statistical functions, there’s often an na.action argument instead of na.rm.

We’ll start with the function prcomp, which is used to run Principal Components Analysis (PCA). For the purposes of this example, it’s perfectly fine if you haven’t run a PCA before, or even if you’ve never heard of it — we’ll just be using it as an example to explore how R handles missing values in statistical analyses.

[CLICK] Like many R functions, prcomp has an argument called na.action which controls what happens when the function encounters missing values. The possible options are usually na.fail, na.omit, and na.exclude.

[CLICK] We can see from the help documentation that, unless you’ve changed some of your settings in R, the default option for na.action will be na.omit.

The na.action argument is computationally a bit more complicated than na.rm which just can be TRUE or FALSE, but actually they actually work kind of similarly.

The na.fail option for na.action is a lot like using na.rm = FALSE. When you have na.action = na.fail, you’ll get an error if you try to run the function on data with missingness.

The na.omit and na.exclude options both work a lot like na.rm = TRUE; they will skip over missing values and run the function just on whatever data are available. The difference between na.omit and na.exclude is pretty subtle, and in my experience it only really matters if you’re going to be using the resulting model objects in kind of advanced ways.

One big difference is that while na.rm = FALSE is usually the default, for na.action it’s usually na.omit, which means the function will remove missing values.

`na.action`

If we run prcomp without specifying anything for na.action, it will use this default behavior:

prcomp(~ sleep_total + sleep_rem + sleep_cycle + awake + brainwt + bodywt, 
       data = msleep)

Standard deviations (1, .., p=6):
[1] 1.444588e+02 4.701466e+00 8.006693e-01 3.689651e-01 1.163014e-01
[6] 2.739849e-15

Rotation (n x k) = (6 x 6):
                      PC1         PC2           PC3           PC4           PC5
sleep_total -0.0173107805  0.70127490  8.667993e-02  0.0199139434 -0.0019684171
sleep_rem   -0.0030493314  0.12265440 -9.923548e-01 -0.0037779333 -0.0128203467
sleep_cycle  0.0010997075 -0.02487631  1.324014e-03  0.8138323340 -0.5805645002
awake        0.0173107805 -0.70127490 -8.667993e-02 -0.0199139434  0.0019684171
brainwt      0.0009228891 -0.01241062 -1.426396e-02  0.5804041369  0.8141085321
bodywt       0.9996946105  0.02469960 -1.332017e-05 -0.0007529244 -0.0002201911
                      PC6
sleep_total -7.071068e-01
sleep_rem   -1.826319e-16
sleep_cycle -8.221972e-16
awake       -7.071068e-01
brainwt      1.482059e-17
bodywt       2.461221e-18

`na.action`

Let’s run prcomp() with na.fail for the na.action instead:

prcomp(~ sleep_total + sleep_rem + sleep_cycle + awake + brainwt + bodywt, 
       data = msleep, 
       na.action = na.fail)

Error in na.fail.default(list(sleep_total = c(12.1, 17, 14.4, 14.9, 4,  : 
  missing values in object

Missing values warnings

Some R functions will show a warning by default when they remove cases with missing values. For example, let’s use ggplot to create a scatterplot of bodywt and brainwt:

ggplot(msleep, aes(x=bodywt, y=brainwt)) +
  geom_point()

Warning message: Removed 27 rows containing missing values (geom_point).

Scatterplot with bodywt on the x-axis and brainwt on the y-axis.

Other arguments for handling missingness

Let’s take a look at the help documentation for cor:

?cor

na.rm
logical. Should missing values be removed?

use an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings “everything”, “all.obs”, “complete.obs”, “na.or.complete”, or “pairwise.complete.obs”.

In the interest of time I’m going to skip over this last example here as it’s much less broadly applicable than the others we’ve talked about.

Some functions have more complicated options for how to handle missingness. For example, if you want to get correlations (or covariances) for several variables in your data, you have several options.

There are two arguments related to missing values: na.rm and use. You can scroll down to the Details section to read more about how they work.

Note: The differences between the options is subtle, so don’t stress if it feels like you don’t understand what they all mean. In most cases, people want either use = "everything" or use = "pairwise.complete.obs" when they generate a correlation or covariance matrix, so those are the two most important options to focus on. Importantly, the default value for use is “everything”.

Other arguments for handling missingness

Let’s look at a correlation matrix using the msleep data:

msleep |> 
  select(sleep_total, sleep_rem, sleep_cycle, awake, brainwt, bodywt) |> 
  cor()

            sleep_total sleep_rem sleep_cycle      awake brainwt     bodywt
sleep_total   1.0000000        NA          NA -0.9999986      NA -0.3120106
sleep_rem            NA         1          NA         NA      NA         NA
sleep_cycle          NA        NA           1         NA      NA         NA
awake        -0.9999986        NA          NA  1.0000000      NA  0.3119801
brainwt              NA        NA          NA         NA       1         NA
bodywt       -0.3120106        NA          NA  0.3119801      NA  1.0000000

Other arguments for handling missingness

Let’s try again, but this time change the behavior to calculate correlations for all pairwise complete observations:

msleep |> 
  select(sleep_total, sleep_rem, sleep_cycle, awake, brainwt, bodywt) |> 
  cor(use = "pairwise.complete.obs")

            sleep_total  sleep_rem sleep_cycle      awake    brainwt     bodywt
sleep_total   1.0000000  0.7517550  -0.4737127 -0.9999986 -0.3604874 -0.3120106
sleep_rem     0.7517550  1.0000000  -0.3381235 -0.7517713 -0.2213348 -0.3276507
sleep_cycle  -0.4737127 -0.3381235   1.0000000  0.4737127  0.8516203  0.4178029
awake        -0.9999986 -0.7517713   0.4737127  1.0000000  0.3604874  0.3119801
brainwt      -0.3604874 -0.2213348   0.8516203  0.3604874  1.0000000  0.9337822
bodywt       -0.3120106 -0.3276507   0.4178029  0.3119801  0.9337822  1.0000000

Check your understanding!

True or False: The output from R functions usually tells you how missing data were handled.

True!

False!

Check your understanding!

If you want R to skip over missing values and give you the results based only on the available data, which argument might you use?

na.rm = TRUE

na.rm = FALSE

Filtering out missing values

Sometimes you want to create a new version of your data that excludes cases with missing values. This can be especially important if you want to make sure you’re using a consistent dataset across several related analyses.

For example, let’s say you ran a study on the relationship between cortisol levels and time spent on a challenging task. In addition to measuring cortisol and how long participants spent on the task, you also collected a set of demographic variables and background information like age, education level, etc.

Because the main focus of your analysis is on cortisol, you want to exclude any participants who have missing cortisol values (perhaps because of problems with the saliva sample they provided), even from parts of the analysis that don’t use the cortisol data directly. For example, you may begin your analysis by describing your participants with summary statistics on their demographic information (e.g. what is the median age in this sample?). If some of those participants have missing cortisol values, you want them excluded from the demographic summaries as well. The most straightforward way to do this is to save a new version of your data that only includes the observations you want in your final analysis, and to use that for all of your calculations.

Filtering out missing values

There are a few different options for removing rows with missing values in R:

na.omit()
filter() with is.na()

Filtering out missing values

If you want a version of your data with no missing values in it, you can use na.omit to remove any rows with missing values.

msleep_nomissing <- na.omit(msleep)

How many rows were in the original data?

nrow(msleep)

[1] 83

How many complete rows are in the data?

nrow(msleep_nomissing)

[1] 20

Filtering out missing values

If we want to create a version of the data that doesn’t have any missing values for brainwt (but allows missing values for other variables), we can do that with filter:

msleep_nomissing_brainwt <- filter(msleep, !is.na(brainwt))

Check your understanding!

Consider the following code. Do you think it will run without an error?

model_data <- msleep |> 
  select(sleep_total, brainwt) |> 
  na.omit()

lm(sleep_total ~ brainwt, 
   data = model_data,
   na.action = na.fail)


Call:
lm(formula = sleep_total ~ brainwt, data = model_data, na.action = na.fail)

Coefficients:
(Intercept)      brainwt  
     10.631       -1.633

Take your time with this one. I’ll give you a couple minutes to think it through, then we’ll discuss. And here’s a tip: You can try running this code yourself if you want to test it!

This was a tricky one! Let’s step through it together.

There’s two main pieces to the code here. The first piece saves a dataframe called model_data, and then the second piece runs the lm() function. If you’ve never used lm() before, I hope you didn’t let that trip you up! Even if you have no idea what the lm() function is doing, you can notice it has an na.action argument, just like we looked at for prcomp().

So if we look at the bit with lm(), it looks like lm() is doing something with the sleep_total and brainwt variables, and it says data = model_data, so that suggests it’s working with that dataframe. Then we see it has na.action = na.fail. That means the function should throw and error and not run if there are missing values present. When we tried that with prcomp() we did get an error right away, because there were missing values in the data. Are there missing values in the data here?

To answer that, let’s look back at that first bit of code. It starts with the msleep dataframe and puts it through a couple steps with pipes. The first step is the select() function. We haven’t talked about that much today; select() takes a dataframe and pulls out just the columns you list, dropping any others. So here it will take the msleep dataframe and then keep just the sleep_total and brainwt columns. Then there’s another pipe sending that two-column dataframe to the na.omit() function. As we talked about, na.omit() removes any rows that have missing data on any variables. So in this case, what we’ll be left with at the end is a dataframe that has just the columns sleep_total and brainwt, and only the observations that had no missing values on either of those. So there will be no missing values at all in that model_data dataframe. If there were missing values, we would expect the lm() function to throw an error since it has na.action = na.fail, but here we know there will be no missing values at all so this should run with no errors.

[CLICK] And indeed it does!

What we covered

Using summary() to quickly check a dataframe for missing values
Filtering using is.na() to test for missingness
Assigning values to NA using ifelse or if_else
The na.rm argument
The na.action argument
Using na.omit() to remove all rows with missingness
a bunch of extras throughout on data transformations and data cleaning

Additional resources

There are many excellent tutorials online about missing data in R. Many of them cover a lot of the same information presented here, but you may find a different perspective helpful to consolidate your learning. Here are a couple good ones:

For more powerful analysis of missing data in R, checkout the Amelia package and the mice package.

Practice opportunity

Want to go through this material again? It’s posted as an interactive tutorial online as part of DART (Data and Analytics for Research Training)!

Thank you! Questions?

filter(topics, is.na(understanding))

Missing Values in R

Join the CHOP R User Group

Come to R Office Hours!

Coming soon!

Missing Values in R

What we’re covering today

Why check for missingness?

Learn more

What does “missing” look like in R?

For example

Learn more

How to check for missing values

Working with missing values in R

Learn more

Load packages

The data

About these data

How to check for missing values

Come back to these slides later if you like :)

Convert to factor

Convert to factor

Learn more

Summary again

Troubleshooting

Filtering

Filtering

Troubleshooting

Filtering

Learn more

Filtering with NAs

Filtering with NAs

Coding Challenge 1

Assigning values to missing

Assigning values to missing

Coding Challenge 2

Learn more

Troubleshooting

A note about removing outliers

Learn more

Working around missing values

Two basic options for how functions handle missingess

More complex functions mean more options for missingness

A little encouragement…

na.rm

na.rm

na.action

na.action

na.action

Missing values warnings

Other arguments for handling missingness

Other arguments for handling missingness

Other arguments for handling missingness

Check your understanding!

Check your understanding!

Filtering out missing values

Filtering out missing values

Filtering out missing values

Filtering out missing values

Check your understanding!

What we covered

Additional resources

Practice opportunity

Thank you! Questions?

Missing Values
in R

What does “missing” look like
in R?

Filtering with `NA`s

Filtering with `NA`s

`na.rm`

`na.rm`

`na.action`

`na.action`

`na.action`