Chapter 12 Transform data

In the previous section, we let ggplot do all the work for us in calculating averages. This is fine, but it would also be great if we could do these calculations ourselves so that we can have more flexibility in what we want to calculate and visualize. In order to do so, we need to learn a bit more about coding/programming!

Often we need to create new variables or create subsets of our dataframe (selecting particular variables or cases) before we can visualise and analyse. We’ll do this by using the functions provided in the dplyr package, that are much more intuitive and consistent to use than the base R-functions. To be able to make use of the dplyr-package, we have to install it and load it. Instead, we’ll install and load the tidyverse-package, this is a collection of packages (including dplyr) that are extremely useful.

# install.packages("tidyverse")
library(tidyverse)

We’ll focus on six functions:

  1. select() which allows you to select variables (or columns)
  2. filter() which allows you to select cases (or rows)
  3. mutate() which allows you to create new variables
  4. case_when() which allows you to create new variables on the basis of multiple if-then statements
  5. summarise() which allows you to summarise variables
  6. group_by() which allows you to summarise variables across groups

We shall make use of the mpg dataframe again that is loaded with the ggplot2-package:

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

12.1 The pipe: %>%

We’ll now learn a weird bit of coding that makes it much more easy to ‘read’ and understand your code (also for future you!). We’ll make use of the “pipe” (the symbol: %>%). Let’s see it in action:

12.2 select()

Let us see how we can create a new dataset where we only select the variables manufacturer, cty, and class:

mpg %>% select(manufacturer, cty, class)
## # A tibble: 234 × 3
##    manufacturer   cty class  
##    <chr>        <int> <chr>  
##  1 audi            18 compact
##  2 audi            21 compact
##  3 audi            20 compact
##  4 audi            21 compact
##  5 audi            16 compact
##  6 audi            18 compact
##  7 audi            18 compact
##  8 audi            18 compact
##  9 audi            16 compact
## 10 audi            20 compact
## # ℹ 224 more rows

(if we wanted to store the results, as we typically would want to, then we can just do: mpg_select <- mpg %>% select(manufacturer, cty, class)).

Or let’s select all variables except trans and drv:

mpg %>% select(-trans, -drv)
## # A tibble: 234 × 9
##    manufacturer model      displ  year   cyl   cty   hwy fl    class  
##    <chr>        <chr>      <dbl> <int> <int> <int> <int> <chr> <chr>  
##  1 audi         a4           1.8  1999     4    18    29 p     compact
##  2 audi         a4           1.8  1999     4    21    29 p     compact
##  3 audi         a4           2    2008     4    20    31 p     compact
##  4 audi         a4           2    2008     4    21    30 p     compact
##  5 audi         a4           2.8  1999     6    16    26 p     compact
##  6 audi         a4           2.8  1999     6    18    26 p     compact
##  7 audi         a4           3.1  2008     6    18    27 p     compact
##  8 audi         a4 quattro   1.8  1999     4    18    26 p     compact
##  9 audi         a4 quattro   1.8  1999     4    16    25 p     compact
## 10 audi         a4 quattro   2    2008     4    20    28 p     compact
## # ℹ 224 more rows

If we want all variables from cyl to hwy, we can do:

mpg %>% select(cyl:hwy)
## # A tibble: 234 × 5
##      cyl trans      drv     cty   hwy
##    <int> <chr>      <chr> <int> <int>
##  1     4 auto(l5)   f        18    29
##  2     4 manual(m5) f        21    29
##  3     4 manual(m6) f        20    31
##  4     4 auto(av)   f        21    30
##  5     6 auto(l5)   f        16    26
##  6     6 manual(m5) f        18    26
##  7     6 auto(av)   f        18    27
##  8     4 manual(m5) 4        18    26
##  9     4 auto(l5)   4        16    25
## 10     4 manual(m6) 4        20    28
## # ℹ 224 more rows

12.3 filter()

Selecting variables is handy, but not as crucial as filtering particular cases. What if we only want the subset of the data that includes women? What if we want to do an analyses on people with ages between 18 and 30? filter() is your friend.

Let’s say we only want to see the results for the manufacturer “audi”:

mpg %>% filter(manufacturer == "audi")
## # A tibble: 18 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## 11 audi         a4 quattro   2    2008     4 auto… 4        19    27 p     comp…
## 12 audi         a4 quattro   2.8  1999     6 auto… 4        15    25 p     comp…
## 13 audi         a4 quattro   2.8  1999     6 manu… 4        17    25 p     comp…
## 14 audi         a4 quattro   3.1  2008     6 auto… 4        17    25 p     comp…
## 15 audi         a4 quattro   3.1  2008     6 manu… 4        15    25 p     comp…
## 16 audi         a6 quattro   2.8  1999     6 auto… 4        15    24 p     mids…
## 17 audi         a6 quattro   3.1  2008     6 auto… 4        17    25 p     mids…
## 18 audi         a6 quattro   4.2  2008     8 auto… 4        16    23 p     mids…

Beware that in this case you have to put audi in quotes (““)! filter(manufacturer == audi) does not work! On the other hand, filter("manufacturer" == "audi") DOES work.

cty refers to the fuel efficiency of the different car models (in miles per gallon); let’s only select fuel-efficient cars with more than 20 miles per gallon.

mpg %>% filter(cty > 20)
## # A tibble: 45 × 11
##    manufacturer model  displ  year   cyl trans     drv     cty   hwy fl    class
##    <chr>        <chr>  <dbl> <int> <int> <chr>     <chr> <int> <int> <chr> <chr>
##  1 audi         a4       1.8  1999     4 manual(m… f        21    29 p     comp…
##  2 audi         a4       2    2008     4 auto(av)  f        21    30 p     comp…
##  3 chevrolet    malibu   2.4  2008     4 auto(l4)  f        22    30 r     mids…
##  4 honda        civic    1.6  1999     4 manual(m… f        28    33 r     subc…
##  5 honda        civic    1.6  1999     4 auto(l4)  f        24    32 r     subc…
##  6 honda        civic    1.6  1999     4 manual(m… f        25    32 r     subc…
##  7 honda        civic    1.6  1999     4 manual(m… f        23    29 p     subc…
##  8 honda        civic    1.6  1999     4 auto(l4)  f        24    32 r     subc…
##  9 honda        civic    1.8  2008     4 manual(m… f        26    34 r     subc…
## 10 honda        civic    1.8  2008     4 auto(l5)  f        25    36 r     subc…
## # ℹ 35 more rows

12.3.1 R’s operators

R uses the following operators:

  • >; greater than
  • >=; greater than or equal to
  • <; smaller than
  • <=; smaller than or equal to
  • ==; equal to
  • !=; not equal

R also has ‘Boolean’ operators that you can use to combine statements (for further explanation, see http://r4ds.had.co.nz/transform.html#filter-rows-with-filter);

  • &; and
  • |; or
  • !; not

Through these operators, we can combine above filters, if we want to look at fuel-efficient audis:

mpg %>% filter(manufacturer == "audi" & cty > 20)
## # A tibble: 2 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 2 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…

Let’s look at fuel-inefficient audis (remember that the ! means NOT):

mpg %>% filter(manufacturer == "audi" & !(cty > 20))
## # A tibble: 16 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  3 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  4 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  5 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  6 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  7 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
##  8 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
##  9 audi         a4 quattro   2    2008     4 auto… 4        19    27 p     comp…
## 10 audi         a4 quattro   2.8  1999     6 auto… 4        15    25 p     comp…
## 11 audi         a4 quattro   2.8  1999     6 manu… 4        17    25 p     comp…
## 12 audi         a4 quattro   3.1  2008     6 auto… 4        17    25 p     comp…
## 13 audi         a4 quattro   3.1  2008     6 manu… 4        15    25 p     comp…
## 14 audi         a6 quattro   2.8  1999     6 auto… 4        15    24 p     mids…
## 15 audi         a6 quattro   3.1  2008     6 auto… 4        17    25 p     mids…
## 16 audi         a6 quattro   4.2  2008     8 auto… 4        16    23 p     mids…

Let’s look at three different manufacturers; audi, ford, and honda:

mpg %>% filter(manufacturer == "audi" | manufacturer == "ford" | manufacturer == "honda")
## # A tibble: 52 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 42 more rows

Although this works like a charm, the code is relatively long, and selecting cases on the basis of different categories is something we often do, so luckily for us there is a useful short-hand:

mpg %>% filter(manufacturer %in% c("audi", "ford", "honda"))
## # A tibble: 52 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 42 more rows

12.3.2 Dealing with missing values

Missing values in R are signified by NA (Not Available). is.na() is a function that tells you whether particular values is NA or not. For instance:

temp <- c(1, 2, NA, 4, 5)
is.na(temp)
## [1] FALSE FALSE  TRUE FALSE FALSE

Missing values have an impact on some functions:

mean(temp)
## [1] NA

That was not expected. It’s because R doesn’t know how to calculate an average when it has to take into account an NA-value. But we can tell the mean-function to ignore these NA-values (see ?mean) byt adding na.rm = TRUE as an argument:

mean(temp, na.rm = TRUE)
## [1] 3

That’s more like it!

We can use that same command within filter. The mpg dataset has no missing values, so let’s create our own dataframe.

temp_df <- data.frame(
  VarA = temp,
  varB = c(10, 11, 12, 13, 14)
)
temp_df
##   VarA varB
## 1    1   10
## 2    2   11
## 3   NA   12
## 4    4   13
## 5    5   14

And apply a filter to select cases that do not have NAs.

temp_df %>% filter(!is.na(VarA))
##   VarA varB
## 1    1   10
## 2    2   11
## 3    4   13
## 4    5   14

12.4 mutate()

Often we want to create new colums based on the information in other columns. mutate() is very useful for this purpose. mutate() will always add (a) new variable(s) to your dataframe.

Let’s try and create a mean-score for the two fuel-efficiency variables (cty and hwy):

mpg %>% mutate(mean_fe = (cty + hwy) / 2)
## # A tibble: 234 × 12
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: mean_fe <dbl>

We can calculate multiple columns at a time. Let’s make new variables that contain fuel efficiency in kilometers (1 mile is 1.609344 kilometer) per liter (1 gallon is 3.78541178 liter). To do this we need to multiple the cty and hwy variables by 1.609344 and then divide it by 3.78541178. We can also use the variables that we have just defined in creating new variables:

mpg %>% mutate(
  cty_km_l = cty * 1.609344 / 3.78541178,
  hwy_km_l = hwy * 1.609344 / 3.78541178,
  mean_fe_km_l = (cty_km_l + hwy_km_l) / 2,
  mean_fe_rnd = round(mean_fe_km_l, 2)
)
## # A tibble: 234 × 15
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows
## # ℹ 4 more variables: cty_km_l <dbl>, hwy_km_l <dbl>, mean_fe_km_l <dbl>,
## #   mean_fe_rnd <dbl>

12.4.1 case_when()

You can also easily change or createvariables with case_when() inside mutate().

Let’s change our “drv”-variable that consists of three categories (“f” for frontwheeldrive, “r” for rearwheeldrive, and “4” for fourwheeldrive), into a variable with two categories (“two” for twowheeldrive and “four” for fourwheeldrive):

mpg %>% mutate(drv_bin = case_when(
  drv == "f" ~ "two",
  drv == "r" ~ "two",
  drv == "4" ~ "four"
))
## # A tibble: 234 × 12
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: drv_bin <chr>

Or let’s say another mistake was made. For all audis with fourwheeldrive the cty-variable was 3 too high. Let’s make a new variable that corrects this mistake:

mpg %>%
  mutate(cty_rec = case_when(
    manufacturer == "audi" & drv == "4" ~ cty - 3,
    !(manufacturer == "audi" & drv == "4") ~ as.double(cty)
  ))
## # A tibble: 234 × 12
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: cty_rec <dbl>

12.5 summarise()

summarise() allows you to summarise the information in the variables. summarise() expects a vector (a variable) and transforms it into one number. It is very different to mutate() in this sense, which always returns a new variable.

Let’s see an example:

mpg %>% summarise(mean_cty = mean(cty))
## # A tibble: 1 × 1
##   mean_cty
##      <dbl>
## 1     16.9

Again, we can calculate multiple summary-variables at the same time:

mpg %>% summarise(
  n_cty = n(),
  mean_cty = mean(cty),
  sd_cty = sd(cty),
  se_cty = sd_cty / sqrt(n_cty)
)
## # A tibble: 1 × 4
##   n_cty mean_cty sd_cty se_cty
##   <int>    <dbl>  <dbl>  <dbl>
## 1   234     16.9   4.26  0.278

12.5.1 group_by()

The summarise() is particularly useful in combination with the group_by() function, that groups the dataframe by the categories in a variable that you have specified.

Let’s try the above code, but now grouping on manufacturer:

mpg %>%
  group_by(manufacturer) %>%
  summarise(
    n_cty = n(),
    mean_cty = mean(cty),
    sd_cty = sd(cty),
    se_cty = sd_cty / sqrt(n_cty)
  )
## # A tibble: 15 × 5
##    manufacturer n_cty mean_cty sd_cty se_cty
##    <chr>        <int>    <dbl>  <dbl>  <dbl>
##  1 audi            18     17.6  1.97   0.465
##  2 chevrolet       19     15    2.92   0.671
##  3 dodge           37     13.1  2.49   0.409
##  4 ford            25     14    1.91   0.383
##  5 honda            9     24.4  1.94   0.648
##  6 hyundai         14     18.6  1.50   0.401
##  7 jeep             8     13.5  2.51   0.886
##  8 land rover       4     11.5  0.577  0.289
##  9 lincoln          3     11.3  0.577  0.333
## 10 mercury          4     13.2  0.5    0.25 
## 11 nissan          13     18.1  3.43   0.950
## 12 pontiac          5     17    1      0.447
## 13 subaru          14     19.3  0.914  0.244
## 14 toyota          34     18.5  4.05   0.694
## 15 volkswagen      27     20.9  4.56   0.877

Woah not bad!

We can also use multiple grouping variables:

mpg %>%
  group_by(drv, year) %>%
  summarise(
    n_cty = n(),
    mean_cty = mean(cty),
    sd_cty = sd(cty),
    se_cty = sd_cty / sqrt(n_cty)
  )
## `summarise()` has grouped output by 'drv'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 6
## # Groups:   drv [3]
##   drv    year n_cty mean_cty sd_cty se_cty
##   <chr> <int> <int>    <dbl>  <dbl>  <dbl>
## 1 4      1999    49     14.2   2.56  0.366
## 2 4      2008    54     14.4   3.15  0.429
## 3 f      1999    57     20.0   4.09  0.542
## 4 f      2008    49     20.0   3.04  0.434
## 5 r      1999    11     14     2.76  0.831
## 6 r      2008    14     14.1   1.79  0.479

12.6 Putting it all together

Let’s try and put some of the things we have learned above together in one “pipe”:

mpg %>%
  select(class, year, cty, hwy) %>%
  filter(year == 2008) %>%
  mutate(hwy_km_l = hwy * 1.609344 / 3.78541178) %>%
  group_by(class) %>%
  summarise(
    n_hwy = n(),
    mean_hwy = mean(hwy_km_l),
    sd_hwy = sd(hwy_km_l),
    se_hwy = sd_hwy / sqrt(n_hwy)
  )
## # A tibble: 7 × 5
##   class      n_hwy mean_hwy sd_hwy se_hwy
##   <chr>      <int>    <dbl>  <dbl>  <dbl>
## 1 2seater        3    10.6   0.425  0.245
## 2 compact       22    12.2   1.30   0.277
## 3 midsize       21    11.9   1.07   0.234
## 4 minivan        5     9.44  1.25   0.561
## 5 pickup        17     7.20  1.19   0.288
## 6 subcompact    16    11.5   2.05   0.512
## 7 suv           33     7.92  1.43   0.249

Now we’re getting somewhere!

12.7 Assignments

  1. From the mpg-dataset that is stored within mpg, select only the displ, year, and class variables. Use the pipe (%>%)!

  2. Select all cases for the year 2008. Use the pipe (%>%)!

  3. Create a mean and standard deviation for the displ-variable for the different classes of cars (the class variable). Remember, use the group_by and summarise function. Use the pipe (%>%)!

  4. Put all the above actions into one long “pipe”!

12.8 Further reading

R for Data Science’s chapter on transforming data: http://r4ds.had.co.nz/transform.html. RStudio’s cheatshees are also very helpful! See the “Data Transformation Cheat Sheet” on https://www.rstudio.com/resources/cheatsheets/.