Missing Values
in R

Rose Hartman

Arcus Education, DBHi

2024-03-04

  • Use keyboard arrow keys to
    • advance ( → ) and
    • go back ( ← )
  • Type “s” to see speaker notes
  • Type “?” to see other keyboard shortcuts

Join the CHOP R User Group

CHOPR hex sticker logo

  • Friendly help troubleshooting your R code
  • Announcements for upcoming talks, workshops, and conferences

Link to join: https://bit.ly/chopRusers

Come to R Office Hours!

  • Set up a meeting to get live help with your R code from our most experienced useRs
  • Office hours appointments can be one-on-one or open to the community

Link to calendar: https://bit.ly/chopROfficeHours

Coming soon!

This is the first talk in a new series called R102: MasteRing the Fundamentals


Next up: Summary Statistics in R, April 8th 12:00pm ET


Learn more about this new series, including dates and titles for each session:
https://arcus.github.io/r102/

Missing Values in R

What we’re covering today

  • How to check the number and location of missing values in a dataframe
  • How to mark values as missing
  • How to use common arguments like na.rm and na.action to control how functions handle missingness
  • How to remove cases with missing values from a dataframe
  • NOT teaching statistical remedies for missingness, like imputation (but ask me about that later if you’re curious!)

Why check for missingness?

  • Checking for missing data can help you know whether the data were read in correctly
  • Missingness will also impact your effective sample size
  • If your analysis will involve fixing missingness statistically, the first step is always describing the missingness

Learn more

For an excellent introduction to different types of missing data and how to handle them statistically, read Rubin’s classic paper Inference and Missing Data.

What does “missing” look like
in R?

NA

Rarely:

  • NA_integer_
  • NA_real_
  • NA_complex_
  • NA_character_

For example

Here’s an example of what some data with missing values might look like when printed in R:

sensor_id PM2.5 PM10 O3 NO2
0001 10 25 0.0 67
0002 13 21 NA 71
0003 9 NA NA 64

Learn more

What about NULL and NaN?

If you’d like to learn more, check out this blog post explaining the difference between NA and NULL and the missing values chapter of R for Data Science (2e).


If you’re just beginning in R, you can safely ignore the differences between NA, NaN, and NULL for now.

How to check for missing values

Working with missing values in R

Two options:

  1. Work in the cloud: https://posit.cloud/content/7522885
  2. Work on your computer: https://github.com/arcus/r102

Learn more

tidyverse hex sticker logo.

Learn more about the tidyverse packages on the tidyverse website!

Load packages

Only if needed:

install.packages("tidyverse")


Each R session:

library(tidyverse)


Open the file missing_values_exercises.rmd in the exercises folder.

The data

In the console or in the exercises rmd file, run the following command:

head(msleep) 
                        name      genus  vore        order conservation
1                    Cheetah   Acinonyx carni    Carnivora           lc
2                 Owl monkey      Aotus  omni     Primates         <NA>
3            Mountain beaver Aplodontia herbi     Rodentia           nt
4 Greater short-tailed shrew    Blarina  omni Soricomorpha           lc
5                        Cow        Bos herbi Artiodactyla domesticated
6           Three-toed sloth   Bradypus herbi       Pilosa         <NA>
  sleep_total sleep_rem sleep_cycle awake brainwt  bodywt
1        12.1        NA          NA  11.9      NA  50.000
2        17.0       1.8          NA   7.0 0.01550   0.480
3        14.4       2.4          NA   9.6      NA   1.350
4        14.9       2.3   0.1333333   9.1 0.00029   0.019
5         4.0       0.7   0.6666667  20.0 0.42300 600.000
6        14.4       2.2   0.7666667   9.6      NA   3.850

About these data

To learn more about this dataset:

?msleep

How to check for missing values

summary(msleep)
     name              genus               vore              order          
 Length:83          Length:83          Length:83          Length:83         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 conservation        sleep_total      sleep_rem      sleep_cycle    
 Length:83          Min.   : 1.90   Min.   :0.100   Min.   :0.1167  
 Class :character   1st Qu.: 7.85   1st Qu.:0.900   1st Qu.:0.1833  
 Mode  :character   Median :10.10   Median :1.500   Median :0.3333  
                    Mean   :10.43   Mean   :1.875   Mean   :0.4396  
                    3rd Qu.:13.75   3rd Qu.:2.400   3rd Qu.:0.5792  
                    Max.   :19.90   Max.   :6.600   Max.   :1.5000  
                                    NA's   :22      NA's   :51      
     awake          brainwt            bodywt        
 Min.   : 4.10   Min.   :0.00014   Min.   :   0.005  
 1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174  
 Median :13.90   Median :0.01240   Median :   1.670  
 Mean   :13.57   Mean   :0.28158   Mean   : 166.136  
 3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750  
 Max.   :22.10   Max.   :5.71200   Max.   :6654.000  
                 NA's   :27                          

Come back to these slides later if you like :)


Remember, if you’re viewing these slides online, you can hit s on your keyboard to show the speaker notes.

Convert to factor

We could convert those variables to factors with a mutate command for each of the rows we want to convert, like this:

msleep <- msleep |>
  mutate(name = as.factor(name),
         genus = as.factor(genus),
         vore = as.factor(vore),
         order = as.factor(order),
         conservation = as.factor(conservation))

Convert to factor

But in cases like this where we want to convert several variables all in the same way, we can do it faster with the across command:

msleep <- msleep |>
  mutate(across(
    where(is.character), 
    as.factor))

Learn more

A few things from the above code that you might want to look into further:

Summary again

Now let’s try summary again to see if we get more informative results for those first few columns:

summary(msleep)
                        name             genus         vore   
 African elephant         : 1   Panthera    : 3   carni  :19  
 African giant pouched rat: 1   Spermophilus: 3   herbi  :32  
 African striped mouse    : 1   Equus       : 2   insecti: 5  
 Arctic fox               : 1   Vulpes      : 2   omni   :20  
 Arctic ground squirrel   : 1   Acinonyx    : 1   NA's   : 7  
 Asian elephant           : 1   Aotus       : 1               
 (Other)                  :77   (Other)     :71               
          order          conservation  sleep_total      sleep_rem    
 Rodentia    :22   cd          : 2    Min.   : 1.90   Min.   :0.100  
 Carnivora   :12   domesticated:10    1st Qu.: 7.85   1st Qu.:0.900  
 Primates    :12   en          : 4    Median :10.10   Median :1.500  
 Artiodactyla: 6   lc          :27    Mean   :10.43   Mean   :1.875  
 Soricomorpha: 5   nt          : 4    3rd Qu.:13.75   3rd Qu.:2.400  
 Cetacea     : 3   vu          : 7    Max.   :19.90   Max.   :6.600  
 (Other)     :23   NA's        :29                    NA's   :22     
  sleep_cycle         awake          brainwt            bodywt        
 Min.   :0.1167   Min.   : 4.10   Min.   :0.00014   Min.   :   0.005  
 1st Qu.:0.1833   1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174  
 Median :0.3333   Median :13.90   Median :0.01240   Median :   1.670  
 Mean   :0.4396   Mean   :13.57   Mean   :0.28158   Mean   : 166.136  
 3rd Qu.:0.5792   3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750  
 Max.   :1.5000   Max.   :22.10   Max.   :5.71200   Max.   :6654.000  
 NA's   :51                       NA's   :27                          

Troubleshooting

If you make a mistake modifying the data, how can you undo it?

If this were a dataset we read in from an external file (like a .csv), you could just read it in again to get a fresh copy. But how do you get a fresh copy of a built-in dataset?

To reset the data to its original state, run rm(msleep) in the console. This will delete your current version of the data from R’s environment, and you’ll just be left with the original clean copy from the ggplot2 package.

Filtering

As a reminder, filter() selects just the rows from a dataframe that return TRUE for the logical test you put in.

A symbolic dataframe of four rows with a header, with two rows selected (highlighted in a different color), is transformed into a new dataframe with just the two selected rows and a header.


For example:

filter(msleep, vore == "carni")

Filtering

                         name         genus  vore           order conservation
1                     Cheetah      Acinonyx carni       Carnivora           lc
2           Northern fur seal   Callorhinus carni       Carnivora           vu
3                         Dog         Canis carni       Carnivora domesticated
4        Long-nosed armadillo       Dasypus carni       Cingulata           lc
5                Domestic cat         Felis carni       Carnivora domesticated
6                 Pilot whale Globicephalus carni         Cetacea           cd
7                   Gray seal  Haliochoerus carni       Carnivora           lc
8        Thick-tailed opposum    Lutreolina carni Didelphimorphia           lc
9                  Slow loris     Nyctibeus carni        Primates         <NA>
10 Northern grasshopper mouse     Onychomys carni        Rodentia           lc
11                      Tiger      Panthera carni       Carnivora           en
12                     Jaguar      Panthera carni       Carnivora           nt
13                       Lion      Panthera carni       Carnivora           vu
14               Caspian seal         Phoca carni       Carnivora           vu
15            Common porpoise      Phocoena carni         Cetacea           vu
16       Bottle-nosed dolphin      Tursiops carni         Cetacea         <NA>
17                      Genet       Genetta carni       Carnivora         <NA>
18                 Arctic fox        Vulpes carni       Carnivora         <NA>
19                    Red fox        Vulpes carni       Carnivora         <NA>
   sleep_total sleep_rem sleep_cycle awake brainwt  bodywt
1         12.1        NA          NA 11.90      NA  50.000
2          8.7       1.4   0.3833333 15.30      NA  20.490
3         10.1       2.9   0.3333333 13.90  0.0700  14.000
4         17.4       3.1   0.3833333  6.60  0.0108   3.500
5         12.5       3.2   0.4166667 11.50  0.0256   3.300
6          2.7       0.1          NA 21.35      NA 800.000
7          6.2       1.5          NA 17.80  0.3250  85.000
8         19.4       6.6          NA  4.60      NA   0.370
9         11.0        NA          NA 13.00  0.0125   1.400
10        14.5        NA          NA  9.50      NA   0.028
11        15.8        NA          NA  8.20      NA 162.564
12        10.4        NA          NA 13.60  0.1570 100.000
13        13.5        NA          NA 10.50      NA 161.499
14         3.5       0.4          NA 20.50      NA  86.000
15         5.6        NA          NA 18.45      NA  53.180
16         5.2        NA          NA 18.80      NA 173.330
17         6.3       1.3          NA 17.70  0.0175   2.000
18        12.5        NA          NA 11.50  0.0445   3.380
19         9.8       2.4   0.3500000 14.20  0.0504   4.230

Troubleshooting

Remember that the double equals sign is a comparison — in the above code it’s asking whether vore is equal to “carni”, while a single equals sign is a “setter”, and it will try to make vore equal to “carni”.

Filtering

These are some of the logical tests you might use:

logical condition means example
x < y less than sleep_total < 10
x > y greater than sleep_total > 4
x == y equal to vore == "carni"
x != y not equal to vore != "carni"

Learn more

For more details about how filter() works, see the DART tutorial on data transformation, including The filter() function.

Filtering with NAs

Let’s use filter to take a look at just the rows that have missing values for the brainwt variable.

filter(msleep, brainwt == NA)
 [1] name         genus        vore         order        conservation
 [6] sleep_total  sleep_rem    sleep_cycle  awake        brainwt     
[11] bodywt      
<0 rows> (or 0-length row.names)

Filtering with NAs

So, let’s try again to filter the data to just show rows where we have missing values for brainwt:

filter(msleep, is.na(brainwt))
                             name         genus  vore           order
1                         Cheetah      Acinonyx carni       Carnivora
2                 Mountain beaver    Aplodontia herbi        Rodentia
3                Three-toed sloth      Bradypus herbi          Pilosa
4               Northern fur seal   Callorhinus carni       Carnivora
5                    Vesper mouse       Calomys  <NA>        Rodentia
6                          Grivet Cercopithecus  omni        Primates
7       Western american chipmunk      Eutamias herbi        Rodentia
8                         Giraffe       Giraffa herbi    Artiodactyla
9                     Pilot whale Globicephalus carni         Cetacea
10                 Mongoose lemur         Lemur herbi        Primates
11           Thick-tailed opposum    Lutreolina carni Didelphimorphia
12               Mongolian gerbil      Meriones herbi        Rodentia
13                          Vole       Microtus herbi        Rodentia
14           Round-tailed muskrat      Neofiber herbi        Rodentia
15                           Degu       Octodon herbi        Rodentia
16     Northern grasshopper mouse     Onychomys carni        Rodentia
17                          Tiger      Panthera carni       Carnivora
18                           Lion      Panthera carni       Carnivora
19                          Potto  Perodicticus  omni        Primates
20                     Deer mouse    Peromyscus  <NA>        Rodentia
21                   Caspian seal         Phoca carni       Carnivora
22                Common porpoise      Phocoena carni         Cetacea
23                        Potoroo      Potorous herbi   Diprotodontia
24          African striped mouse     Rhabdomys  omni        Rodentia
25 Golden-mantled ground squirrel  Spermophilus herbi        Rodentia
26      Eastern american chipmunk        Tamias herbi        Rodentia
27           Bottle-nosed dolphin      Tursiops carni         Cetacea
   conservation sleep_total sleep_rem sleep_cycle awake brainwt  bodywt
1            lc        12.1        NA          NA 11.90      NA  50.000
2            nt        14.4       2.4          NA  9.60      NA   1.350
3          <NA>        14.4       2.2   0.7666667  9.60      NA   3.850
4            vu         8.7       1.4   0.3833333 15.30      NA  20.490
5          <NA>         7.0        NA          NA 17.00      NA   0.045
6            lc        10.0       0.7          NA 14.00      NA   4.750
7          <NA>        14.9        NA          NA  9.10      NA   0.071
8            cd         1.9       0.4          NA 22.10      NA 899.995
9            cd         2.7       0.1          NA 21.35      NA 800.000
10           vu         9.5       0.9          NA 14.50      NA   1.670
11           lc        19.4       6.6          NA  4.60      NA   0.370
12           lc        14.2       1.9          NA  9.80      NA   0.053
13         <NA>        12.8        NA          NA 11.20      NA   0.035
14           nt        14.6        NA          NA  9.40      NA   0.266
15           lc         7.7       0.9          NA 16.30      NA   0.210
16           lc        14.5        NA          NA  9.50      NA   0.028
17           en        15.8        NA          NA  8.20      NA 162.564
18           vu        13.5        NA          NA 10.50      NA 161.499
19           lc        11.0        NA          NA 13.00      NA   1.100
20         <NA>        11.5        NA          NA 12.50      NA   0.021
21           vu         3.5       0.4          NA 20.50      NA  86.000
22           vu         5.6        NA          NA 18.45      NA  53.180
23         <NA>        11.1       1.5          NA 12.90      NA   1.100
24         <NA>         8.7        NA          NA 15.30      NA   0.044
25           lc        15.9       3.0          NA  8.10      NA   0.205
26         <NA>        15.8        NA          NA  8.20      NA   0.112
27         <NA>         5.2        NA          NA 18.80      NA 173.330

Coding Challenge 1


Your turn!


Look in the exercises/missing_values_exercises.rmd file to find your first coding challenge.

02:00

Assigning values to missing

You may need to assign values to NA as part of data cleaning.

Assigning values to missing

Here’s some code to assign the -99 values in our pretend dataset to missing:

df <- mutate(df, 
             rating = ifelse(
               rating == -99, 
               NA, 
               rating
               ))

Coding Challenge 2


Your turn!


Go back to the exercises/missing_values_exercises.rmd file to find your second coding challenge.

02:00

Learn more

For more on mutate and ifelse, see the R Basics: Data Transformation sections on mutate and logical operators, and the ifelse section in the free online book Advanced R.

Troubleshooting

Are ifelse and if_else the same thing?

Almost, but not quite!


The function ifelse with no underscore is part of base R, and if_else is part of the dplyr package. They are very similar and do almost exactly the same thing, but the reason the dplyr developers bothered to write a new if_else fuction when the base ifelse was already available is because they wanted to make it stricter.

In general, stricter is better with functions. A function built with rigid requirements can be irritating because it throws errors more often, but it’s actually much more dangerous to have loose functions quietly doing unexpected things than to have strict ones throwing errors.

So what’s happening in this particular case is that if_else has a requirement ifelse doesn’t, which is that whatever you put in for the “if TRUE” part of the operator has to be of the same type as whatever you put in for the “if FALSE” part. That’s a reasonable requirement, because the output of either of these if else functions is a single vector that will have “if TRUE” values wherever the test returns TRUE and “if FALSE” values wherever the test returns FALSE.
This runs fine:

msleep <- mutate(msleep, 
                       sleep_total = ifelse(sleep_total > 18, 
                                            NA, 
                                            sleep_total))

But this generates an error:

msleep <- mutate(msleep, 
                       sleep_total = if_else(sleep_total > 18, 
                                             NA, 
                                             sleep_total))

Remember when I told you there were secretly different kinds of NA under the hood, for the different data types in R, but that it would almost never come up and you wouldn’t have to worry about that? So this is unfortunately one of the rare times when it does come up. Sorry!

Just plain NA like we have here is assumed to be the NA for logical vectors unless you specify otherwise. So when if_else checks the “if TRUE” and “if FALSE” arguments to see if they’re the same data type, it thinks your NA is for a logical vector and then it see that the values from sleep_total are numeric, and that’s why it refuses to run. It thinks you’re trying to mix apples and oranges. If you explicitly tell it you want numeric NAs, then it will run just fine.


This works :)

msleep <- mutate(msleep, 
                       sleep_total = if_else(sleep_total > 18, 
                                             NA_real_, 
                                             sleep_total))

A note about removing outliers

These are all perfectly reasonable situations in which to mark values as missing:

  • Extremely unlikely outliers
  • Values that are logically impossible
  • Values that you know are meant to mark missing or invalid responses

Learn more

Be very cautious of removing outliers that might be valid data, though!


To learn more, read The Extent and Consequences of P-Hacking in Science.

Working around missing values

What happens when you try to do things (statistical tests, visualizations, etc.) with missing values?

Two basic options for how functions handle missingess

More complex functions mean more options for missingness

We’ll talk about missing values in the following example functions:

A little encouragement…

Are these functions new to you? No problem!

You don’t need to understand the underlying statistics for any of these functions to work through the R code and learn from the missingness examples.

na.rm

Many functions in R have an argument na.rm with options TRUE or FALSE.

For example, check out the help documentation for mean:

?mean


Under Arguments, you’ll see the description for na.rm is

a logical value indicating whether NA values should be stripped before the computation proceeds.

na.rm

Let’s try that with the brainwt variable:

mean(msleep$brainwt)
[1] NA


mean(msleep$brainwt, na.rm = TRUE)
[1] 0.2815814

na.action

For more complex statistical functions, there’s often an na.action argument instead of na.rm.


Let’s take a look at the help documentation for prcomp:

?prcomp

Under Arguments, you’ll see the description for na.action is

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit.

na.action

If we run prcomp without specifying anything for na.action, it will use this default behavior:

prcomp(~ sleep_total + sleep_rem + sleep_cycle + awake + brainwt + bodywt, 
       data = msleep)
Standard deviations (1, .., p=6):
[1] 1.444588e+02 4.701466e+00 8.006693e-01 3.689651e-01 1.163014e-01
[6] 2.739849e-15

Rotation (n x k) = (6 x 6):
                      PC1         PC2           PC3           PC4           PC5
sleep_total -0.0173107805  0.70127490  8.667993e-02  0.0199139434 -0.0019684171
sleep_rem   -0.0030493314  0.12265440 -9.923548e-01 -0.0037779333 -0.0128203467
sleep_cycle  0.0010997075 -0.02487631  1.324014e-03  0.8138323340 -0.5805645002
awake        0.0173107805 -0.70127490 -8.667993e-02 -0.0199139434  0.0019684171
brainwt      0.0009228891 -0.01241062 -1.426396e-02  0.5804041369  0.8141085321
bodywt       0.9996946105  0.02469960 -1.332017e-05 -0.0007529244 -0.0002201911
                      PC6
sleep_total -7.071068e-01
sleep_rem   -1.826319e-16
sleep_cycle -8.221972e-16
awake       -7.071068e-01
brainwt      1.482059e-17
bodywt       2.461221e-18

na.action

Let’s run prcomp() with na.fail for the na.action instead:

prcomp(~ sleep_total + sleep_rem + sleep_cycle + awake + brainwt + bodywt, 
       data = msleep, 
       na.action = na.fail)
Error in na.fail.default(list(sleep_total = c(12.1, 17, 14.4, 14.9, 4,  : 
  missing values in object

Missing values warnings

Some R functions will show a warning by default when they remove cases with missing values. For example, let’s use ggplot to create a scatterplot of bodywt and brainwt:

ggplot(msleep, aes(x=bodywt, y=brainwt)) +
  geom_point()

Warning message: Removed 27 rows containing missing values (geom_point).

Scatterplot with bodywt on the x-axis and brainwt on the y-axis.

Other arguments for handling missingness

Let’s take a look at the help documentation for cor:

?cor

na.rm
logical. Should missing values be removed?

use an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings “everything”, “all.obs”, “complete.obs”, “na.or.complete”, or “pairwise.complete.obs”.

Other arguments for handling missingness

Let’s look at a correlation matrix using the msleep data:

msleep |> 
  select(sleep_total, sleep_rem, sleep_cycle, awake, brainwt, bodywt) |> 
  cor()
            sleep_total sleep_rem sleep_cycle      awake brainwt     bodywt
sleep_total   1.0000000        NA          NA -0.9999986      NA -0.3120106
sleep_rem            NA         1          NA         NA      NA         NA
sleep_cycle          NA        NA           1         NA      NA         NA
awake        -0.9999986        NA          NA  1.0000000      NA  0.3119801
brainwt              NA        NA          NA         NA       1         NA
bodywt       -0.3120106        NA          NA  0.3119801      NA  1.0000000

Other arguments for handling missingness

Let’s try again, but this time change the behavior to calculate correlations for all pairwise complete observations:

msleep |> 
  select(sleep_total, sleep_rem, sleep_cycle, awake, brainwt, bodywt) |> 
  cor(use = "pairwise.complete.obs")
            sleep_total  sleep_rem sleep_cycle      awake    brainwt     bodywt
sleep_total   1.0000000  0.7517550  -0.4737127 -0.9999986 -0.3604874 -0.3120106
sleep_rem     0.7517550  1.0000000  -0.3381235 -0.7517713 -0.2213348 -0.3276507
sleep_cycle  -0.4737127 -0.3381235   1.0000000  0.4737127  0.8516203  0.4178029
awake        -0.9999986 -0.7517713   0.4737127  1.0000000  0.3604874  0.3119801
brainwt      -0.3604874 -0.2213348   0.8516203  0.3604874  1.0000000  0.9337822
bodywt       -0.3120106 -0.3276507   0.4178029  0.3119801  0.9337822  1.0000000

Check your understanding!


True or False: The output from R functions usually tells you how missing data were handled.

True!

False!

Check your understanding!


If you want R to skip over missing values and give you the results based only on the available data, which argument might you use?

na.rm = TRUE

na.rm = FALSE

Filtering out missing values

Sometimes you want to create a new version of your data that excludes cases with missing values. This can be especially important if you want to make sure you’re using a consistent dataset across several related analyses.

Filtering out missing values

There are a few different options for removing rows with missing values in R:

  • na.omit()
  • filter() with is.na()

Filtering out missing values

If you want a version of your data with no missing values in it, you can use na.omit to remove any rows with missing values.

msleep_nomissing <- na.omit(msleep)


How many rows were in the original data?

nrow(msleep) 
[1] 83

How many complete rows are in the data?

nrow(msleep_nomissing)
[1] 20

Filtering out missing values

If we want to create a version of the data that doesn’t have any missing values for brainwt (but allows missing values for other variables), we can do that with filter:

msleep_nomissing_brainwt <- filter(msleep, !is.na(brainwt))

Check your understanding!


Consider the following code. Do you think it will run without an error?

model_data <- msleep |> 
  select(sleep_total, brainwt) |> 
  na.omit()

lm(sleep_total ~ brainwt, 
   data = model_data,
   na.action = na.fail)

Call:
lm(formula = sleep_total ~ brainwt, data = model_data, na.action = na.fail)

Coefficients:
(Intercept)      brainwt  
     10.631       -1.633  

What we covered

  • Using summary() to quickly check a dataframe for missing values
  • Filtering using is.na() to test for missingness
  • Assigning values to NA using ifelse or if_else
  • The na.rm argument
  • The na.action argument
  • Using na.omit() to remove all rows with missingness
  • a bunch of extras throughout on data transformations and data cleaning

Additional resources

There are many excellent tutorials online about missing data in R. Many of them cover a lot of the same information presented here, but you may find a different perspective helpful to consolidate your learning. Here are a couple good ones:

For more powerful analysis of missing data in R, checkout the Amelia package and the mice package.

Practice opportunity

Want to go through this material again? It’s posted as an interactive tutorial online as part of DART (Data and Analytics for Research Training)!

Thank you! Questions?

filter(topics, is.na(understanding))