How to check the number and location of missing values in a dataframe
How to mark values as missing
How to use common arguments like na.rm and na.action to control how functions handle missingness
How to remove cases with missing values from a dataframe
NOT teaching statistical remedies for missingness, like imputation (but ask me about that later if you’re curious!)
Why check for missingness?
Checking for missing data can help you know whether the data were read in correctly
Missingness will also impact your effective sample size
If your analysis will involve fixing missingness statistically, the first step is always describing the missingness
Learn more
For an excellent introduction to different types of missing data and how to handle them statistically, read Rubin’s classic paper Inference and Missing Data.
What does “missing” look like in R?
NA
Rarely:
NA_integer_
NA_real_
NA_complex_
NA_character_
For example
Here’s an example of what some data with missing values might look like when printed in R:
Open the file missing_values_exercises.rmd in the exercises folder.
The data
In the console or in the exercises rmd file, run the following command:
head(msleep)
name genus vore order conservation
1 Cheetah Acinonyx carni Carnivora lc
2 Owl monkey Aotus omni Primates <NA>
3 Mountain beaver Aplodontia herbi Rodentia nt
4 Greater short-tailed shrew Blarina omni Soricomorpha lc
5 Cow Bos herbi Artiodactyla domesticated
6 Three-toed sloth Bradypus herbi Pilosa <NA>
sleep_total sleep_rem sleep_cycle awake brainwt bodywt
1 12.1 NA NA 11.9 NA 50.000
2 17.0 1.8 NA 7.0 0.01550 0.480
3 14.4 2.4 NA 9.6 NA 1.350
4 14.9 2.3 0.1333333 9.1 0.00029 0.019
5 4.0 0.7 0.6666667 20.0 0.42300 600.000
6 14.4 2.2 0.7666667 9.6 NA 3.850
About these data
To learn more about this dataset:
?msleep
How to check for missing values
summary(msleep)
name genus vore order
Length:83 Length:83 Length:83 Length:83
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
conservation sleep_total sleep_rem sleep_cycle
Length:83 Min. : 1.90 Min. :0.100 Min. :0.1167
Class :character 1st Qu.: 7.85 1st Qu.:0.900 1st Qu.:0.1833
Mode :character Median :10.10 Median :1.500 Median :0.3333
Mean :10.43 Mean :1.875 Mean :0.4396
3rd Qu.:13.75 3rd Qu.:2.400 3rd Qu.:0.5792
Max. :19.90 Max. :6.600 Max. :1.5000
NA's :22 NA's :51
awake brainwt bodywt
Min. : 4.10 Min. :0.00014 Min. : 0.005
1st Qu.:10.25 1st Qu.:0.00290 1st Qu.: 0.174
Median :13.90 Median :0.01240 Median : 1.670
Mean :13.57 Mean :0.28158 Mean : 166.136
3rd Qu.:16.15 3rd Qu.:0.12550 3rd Qu.: 41.750
Max. :22.10 Max. :5.71200 Max. :6654.000
NA's :27
Come back to these slides later if you like :)
Remember, if you’re viewing these slides online, you can hit s on your keyboard to show the speaker notes.
Convert to factor
We could convert those variables to factors with a mutate command for each of the rows we want to convert, like this:
Now let’s try summary again to see if we get more informative results for those first few columns:
summary(msleep)
name genus vore
African elephant : 1 Panthera : 3 carni :19
African giant pouched rat: 1 Spermophilus: 3 herbi :32
African striped mouse : 1 Equus : 2 insecti: 5
Arctic fox : 1 Vulpes : 2 omni :20
Arctic ground squirrel : 1 Acinonyx : 1 NA's : 7
Asian elephant : 1 Aotus : 1
(Other) :77 (Other) :71
order conservation sleep_total sleep_rem
Rodentia :22 cd : 2 Min. : 1.90 Min. :0.100
Carnivora :12 domesticated:10 1st Qu.: 7.85 1st Qu.:0.900
Primates :12 en : 4 Median :10.10 Median :1.500
Artiodactyla: 6 lc :27 Mean :10.43 Mean :1.875
Soricomorpha: 5 nt : 4 3rd Qu.:13.75 3rd Qu.:2.400
Cetacea : 3 vu : 7 Max. :19.90 Max. :6.600
(Other) :23 NA's :29 NA's :22
sleep_cycle awake brainwt bodywt
Min. :0.1167 Min. : 4.10 Min. :0.00014 Min. : 0.005
1st Qu.:0.1833 1st Qu.:10.25 1st Qu.:0.00290 1st Qu.: 0.174
Median :0.3333 Median :13.90 Median :0.01240 Median : 1.670
Mean :0.4396 Mean :13.57 Mean :0.28158 Mean : 166.136
3rd Qu.:0.5792 3rd Qu.:16.15 3rd Qu.:0.12550 3rd Qu.: 41.750
Max. :1.5000 Max. :22.10 Max. :5.71200 Max. :6654.000
NA's :51 NA's :27
Troubleshooting
If you make a mistake modifying the data, how can you undo it?
If this were a dataset we read in from an external file (like a .csv), you could just read it in again to get a fresh copy. But how do you get a fresh copy of a built-in dataset?
To reset the data to its original state, run rm(msleep) in the console. This will delete your current version of the data from R’s environment, and you’ll just be left with the original clean copy from the ggplot2 package.
Filtering
As a reminder, filter() selects just the rows from a dataframe that return TRUE for the logical test you put in.
For example:
filter(msleep, vore =="carni")
Filtering
name genus vore order conservation
1 Cheetah Acinonyx carni Carnivora lc
2 Northern fur seal Callorhinus carni Carnivora vu
3 Dog Canis carni Carnivora domesticated
4 Long-nosed armadillo Dasypus carni Cingulata lc
5 Domestic cat Felis carni Carnivora domesticated
6 Pilot whale Globicephalus carni Cetacea cd
7 Gray seal Haliochoerus carni Carnivora lc
8 Thick-tailed opposum Lutreolina carni Didelphimorphia lc
9 Slow loris Nyctibeus carni Primates <NA>
10 Northern grasshopper mouse Onychomys carni Rodentia lc
11 Tiger Panthera carni Carnivora en
12 Jaguar Panthera carni Carnivora nt
13 Lion Panthera carni Carnivora vu
14 Caspian seal Phoca carni Carnivora vu
15 Common porpoise Phocoena carni Cetacea vu
16 Bottle-nosed dolphin Tursiops carni Cetacea <NA>
17 Genet Genetta carni Carnivora <NA>
18 Arctic fox Vulpes carni Carnivora <NA>
19 Red fox Vulpes carni Carnivora <NA>
sleep_total sleep_rem sleep_cycle awake brainwt bodywt
1 12.1 NA NA 11.90 NA 50.000
2 8.7 1.4 0.3833333 15.30 NA 20.490
3 10.1 2.9 0.3333333 13.90 0.0700 14.000
4 17.4 3.1 0.3833333 6.60 0.0108 3.500
5 12.5 3.2 0.4166667 11.50 0.0256 3.300
6 2.7 0.1 NA 21.35 NA 800.000
7 6.2 1.5 NA 17.80 0.3250 85.000
8 19.4 6.6 NA 4.60 NA 0.370
9 11.0 NA NA 13.00 0.0125 1.400
10 14.5 NA NA 9.50 NA 0.028
11 15.8 NA NA 8.20 NA 162.564
12 10.4 NA NA 13.60 0.1570 100.000
13 13.5 NA NA 10.50 NA 161.499
14 3.5 0.4 NA 20.50 NA 86.000
15 5.6 NA NA 18.45 NA 53.180
16 5.2 NA NA 18.80 NA 173.330
17 6.3 1.3 NA 17.70 0.0175 2.000
18 12.5 NA NA 11.50 0.0445 3.380
19 9.8 2.4 0.3500000 14.20 0.0504 4.230
Troubleshooting
Remember that the double equals sign is a comparison — in the above code it’s asking whether vore is equal to “carni”, while a single equals sign is a “setter”, and it will try to makevore equal to “carni”.
Filtering
These are some of the logical tests you might use:
logical condition
means
example
x < y
less than
sleep_total < 10
x > y
greater than
sleep_total > 4
x == y
equal to
vore == "carni"
x != y
not equal to
vore != "carni"
Learn more
For more details about how filter() works, see the DART tutorial on data transformation, including The filter() function.
Filtering with NAs
Let’s use filter to take a look at just the rows that have missing values for the brainwt variable.
filter(msleep, brainwt ==NA)
[1] name genus vore order conservation
[6] sleep_total sleep_rem sleep_cycle awake brainwt
[11] bodywt
<0 rows> (or 0-length row.names)
Filtering with NAs
So, let’s try again to filter the data to just show rows where we have missing values for brainwt:
filter(msleep, is.na(brainwt))
name genus vore order
1 Cheetah Acinonyx carni Carnivora
2 Mountain beaver Aplodontia herbi Rodentia
3 Three-toed sloth Bradypus herbi Pilosa
4 Northern fur seal Callorhinus carni Carnivora
5 Vesper mouse Calomys <NA> Rodentia
6 Grivet Cercopithecus omni Primates
7 Western american chipmunk Eutamias herbi Rodentia
8 Giraffe Giraffa herbi Artiodactyla
9 Pilot whale Globicephalus carni Cetacea
10 Mongoose lemur Lemur herbi Primates
11 Thick-tailed opposum Lutreolina carni Didelphimorphia
12 Mongolian gerbil Meriones herbi Rodentia
13 Vole Microtus herbi Rodentia
14 Round-tailed muskrat Neofiber herbi Rodentia
15 Degu Octodon herbi Rodentia
16 Northern grasshopper mouse Onychomys carni Rodentia
17 Tiger Panthera carni Carnivora
18 Lion Panthera carni Carnivora
19 Potto Perodicticus omni Primates
20 Deer mouse Peromyscus <NA> Rodentia
21 Caspian seal Phoca carni Carnivora
22 Common porpoise Phocoena carni Cetacea
23 Potoroo Potorous herbi Diprotodontia
24 African striped mouse Rhabdomys omni Rodentia
25 Golden-mantled ground squirrel Spermophilus herbi Rodentia
26 Eastern american chipmunk Tamias herbi Rodentia
27 Bottle-nosed dolphin Tursiops carni Cetacea
conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt
1 lc 12.1 NA NA 11.90 NA 50.000
2 nt 14.4 2.4 NA 9.60 NA 1.350
3 <NA> 14.4 2.2 0.7666667 9.60 NA 3.850
4 vu 8.7 1.4 0.3833333 15.30 NA 20.490
5 <NA> 7.0 NA NA 17.00 NA 0.045
6 lc 10.0 0.7 NA 14.00 NA 4.750
7 <NA> 14.9 NA NA 9.10 NA 0.071
8 cd 1.9 0.4 NA 22.10 NA 899.995
9 cd 2.7 0.1 NA 21.35 NA 800.000
10 vu 9.5 0.9 NA 14.50 NA 1.670
11 lc 19.4 6.6 NA 4.60 NA 0.370
12 lc 14.2 1.9 NA 9.80 NA 0.053
13 <NA> 12.8 NA NA 11.20 NA 0.035
14 nt 14.6 NA NA 9.40 NA 0.266
15 lc 7.7 0.9 NA 16.30 NA 0.210
16 lc 14.5 NA NA 9.50 NA 0.028
17 en 15.8 NA NA 8.20 NA 162.564
18 vu 13.5 NA NA 10.50 NA 161.499
19 lc 11.0 NA NA 13.00 NA 1.100
20 <NA> 11.5 NA NA 12.50 NA 0.021
21 vu 3.5 0.4 NA 20.50 NA 86.000
22 vu 5.6 NA NA 18.45 NA 53.180
23 <NA> 11.1 1.5 NA 12.90 NA 1.100
24 <NA> 8.7 NA NA 15.30 NA 0.044
25 lc 15.9 3.0 NA 8.10 NA 0.205
26 <NA> 15.8 NA NA 8.20 NA 0.112
27 <NA> 5.2 NA NA 18.80 NA 173.330
Coding Challenge 1
Your turn!
Look in the exercises/missing_values_exercises.rmd file to find your first coding challenge.
02:00
Assigning values to missing
You may need to assign values to NA as part of data cleaning.
Assigning values to missing
Here’s some code to assign the -99 values in our pretend dataset to missing:
The function ifelse with no underscore is part of base R, and if_else is part of the dplyr package. They are very similar and do almost exactly the same thing, but the reason the dplyr developers bothered to write a new if_else fuction when the base ifelse was already available is because they wanted to make it stricter.
In general, stricter is better with functions. A function built with rigid requirements can be irritating because it throws errors more often, but it’s actually much more dangerous to have loose functions quietly doing unexpected things than to have strict ones throwing errors.
So what’s happening in this particular case is that if_else has a requirement ifelse doesn’t, which is that whatever you put in for the “if TRUE” part of the operator has to be of the same type as whatever you put in for the “if FALSE” part. That’s a reasonable requirement, because the output of either of these if else functions is a single vector that will have “if TRUE” values wherever the test returns TRUE and “if FALSE” values wherever the test returns FALSE. This runs fine:
Remember when I told you there were secretly different kinds of NA under the hood, for the different data types in R, but that it would almost never come up and you wouldn’t have to worry about that? So this is unfortunately one of the rare times when it does come up. Sorry!
Just plain NA like we have here is assumed to be the NA for logical vectors unless you specify otherwise. So when if_else checks the “if TRUE” and “if FALSE” arguments to see if they’re the same data type, it thinks your NA is for a logical vector and then it see that the values from sleep_total are numeric, and that’s why it refuses to run. It thinks you’re trying to mix apples and oranges. If you explicitly tell it you want numeric NAs, then it will run just fine.
You don’t need to understand the underlying statistics for any of these functions to work through the R code and learn from the missingness examples.
na.rm
Many functions in R have an argument na.rm with options TRUE or FALSE.
For example, check out the help documentation for mean:
?mean
Under Arguments, you’ll see the description for na.rm is
a logical value indicating whether NA values should be stripped before the computation proceeds.
na.rm
Let’s try that with the brainwt variable:
mean(msleep$brainwt)
[1] NA
mean(msleep$brainwt, na.rm =TRUE)
[1] 0.2815814
na.action
For more complex statistical functions, there’s often an na.action argument instead of na.rm.
Let’s take a look at the help documentation for prcomp:
?prcomp
Under Arguments, you’ll see the description for na.action is
a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit.
na.action
If we run prcomp without specifying anything for na.action, it will use this default behavior:
Error in na.fail.default(list(sleep_total = c(12.1, 17, 14.4, 14.9, 4, :
missing values in object
Missing values warnings
Some R functions will show a warning by default when they remove cases with missing values. For example, let’s use ggplot to create a scatterplot of bodywt and brainwt:
Let’s take a look at the help documentation for cor:
?cor
na.rm
logical. Should missing values be removed?
use an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings “everything”, “all.obs”, “complete.obs”, “na.or.complete”, or “pairwise.complete.obs”.
Other arguments for handling missingness
Let’s look at a correlation matrix using the msleep data:
sleep_total sleep_rem sleep_cycle awake brainwt bodywt
sleep_total 1.0000000 NA NA -0.9999986 NA -0.3120106
sleep_rem NA 1 NA NA NA NA
sleep_cycle NA NA 1 NA NA NA
awake -0.9999986 NA NA 1.0000000 NA 0.3119801
brainwt NA NA NA NA 1 NA
bodywt -0.3120106 NA NA 0.3119801 NA 1.0000000
Other arguments for handling missingness
Let’s try again, but this time change the behavior to calculate correlations for all pairwise complete observations:
True or False: The output from R functions usually tells you how missing data were handled.
True!
False!
Check your understanding!
If you want R to skip over missing values and give you the results based only on the available data, which argument might you use?
na.rm = TRUE
na.rm = FALSE
Filtering out missing values
Sometimes you want to create a new version of your data that excludes cases with missing values. This can be especially important if you want to make sure you’re using a consistent dataset across several related analyses.
Filtering out missing values
There are a few different options for removing rows with missing values in R:
na.omit()
filter() with is.na()
Filtering out missing values
If you want a version of your data with no missing values in it, you can use na.omit to remove any rows with missing values.
msleep_nomissing <-na.omit(msleep)
How many rows were in the original data?
nrow(msleep)
[1] 83
How many complete rows are in the data?
nrow(msleep_nomissing)
[1] 20
Filtering out missing values
If we want to create a version of the data that doesn’t have any missing values for brainwt (but allows missing values for other variables), we can do that with filter:
Using summary() to quickly check a dataframe for missing values
Filtering using is.na() to test for missingness
Assigning values to NA using ifelse or if_else
The na.rm argument
The na.action argument
Using na.omit() to remove all rows with missingness
a bunch of extras throughout on data transformations and data cleaning
Additional resources
There are many excellent tutorials online about missing data in R. Many of them cover a lot of the same information presented here, but you may find a different perspective helpful to consolidate your learning. Here are a couple good ones: