Chapter 12 Transform data
In the previous section, we let ggplot do all the work for us in calculating averages. This is fine, but it would also be great if we could do these calculations ourselves so that we can have more flexibility in what we want to calculate and visualize. In order to do so, we need to learn a bit more about coding/programming!
Often we need to create new variables or create subsets of our dataframe (selecting particular variables or cases) before we can visualise and analyse. We’ll do this by using the functions provided in the dplyr
package, that are much more intuitive and consistent to use than the base R-functions. To be able to make use of the dplyr
-package, we have to install it and load it. Instead, we’ll install and load the tidyverse
-package, this is a collection of packages (including dplyr
) that are extremely useful.
We’ll focus on six functions:
select()
which allows you to select variables (or columns)filter()
which allows you to select cases (or rows)mutate()
which allows you to create new variablescase_when()
which allows you to create new variables on the basis of multiple if-then statementssummarise()
which allows you to summarise variablesgroup_by()
which allows you to summarise variables across groups
We shall make use of the mpg
dataframe again that is loaded with the ggplot2
-package:
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
12.1 The pipe: %>%
We’ll now learn a weird bit of coding that makes it much more easy to ‘read’ and understand your code (also for future you!). We’ll make use of the “pipe” (the symbol: %>%
). Let’s see it in action:
12.2 select()
Let us see how we can create a new dataset where we only select the variables manufacturer, cty, and class:
## # A tibble: 234 × 3
## manufacturer cty class
## <chr> <int> <chr>
## 1 audi 18 compact
## 2 audi 21 compact
## 3 audi 20 compact
## 4 audi 21 compact
## 5 audi 16 compact
## 6 audi 18 compact
## 7 audi 18 compact
## 8 audi 18 compact
## 9 audi 16 compact
## 10 audi 20 compact
## # ℹ 224 more rows
(if we wanted to store the results, as we typically would want to, then we can just do: mpg_select <- mpg %>% select(manufacturer, cty, class)
).
Or let’s select all variables except trans
and drv
:
## # A tibble: 234 × 9
## manufacturer model displ year cyl cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 18 29 p compact
## 2 audi a4 1.8 1999 4 21 29 p compact
## 3 audi a4 2 2008 4 20 31 p compact
## 4 audi a4 2 2008 4 21 30 p compact
## 5 audi a4 2.8 1999 6 16 26 p compact
## 6 audi a4 2.8 1999 6 18 26 p compact
## 7 audi a4 3.1 2008 6 18 27 p compact
## 8 audi a4 quattro 1.8 1999 4 18 26 p compact
## 9 audi a4 quattro 1.8 1999 4 16 25 p compact
## 10 audi a4 quattro 2 2008 4 20 28 p compact
## # ℹ 224 more rows
If we want all variables from cyl to hwy, we can do:
## # A tibble: 234 × 5
## cyl trans drv cty hwy
## <int> <chr> <chr> <int> <int>
## 1 4 auto(l5) f 18 29
## 2 4 manual(m5) f 21 29
## 3 4 manual(m6) f 20 31
## 4 4 auto(av) f 21 30
## 5 6 auto(l5) f 16 26
## 6 6 manual(m5) f 18 26
## 7 6 auto(av) f 18 27
## 8 4 manual(m5) 4 18 26
## 9 4 auto(l5) 4 16 25
## 10 4 manual(m6) 4 20 28
## # ℹ 224 more rows
12.3 filter()
Selecting variables is handy, but not as crucial as filtering particular cases. What if we only want the subset of the data that includes women? What if we want to do an analyses on people with ages between 18 and 30? filter()
is your friend.
Let’s say we only want to see the results for the manufacturer “audi”:
## # A tibble: 18 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## 11 audi a4 quattro 2 2008 4 auto… 4 19 27 p comp…
## 12 audi a4 quattro 2.8 1999 6 auto… 4 15 25 p comp…
## 13 audi a4 quattro 2.8 1999 6 manu… 4 17 25 p comp…
## 14 audi a4 quattro 3.1 2008 6 auto… 4 17 25 p comp…
## 15 audi a4 quattro 3.1 2008 6 manu… 4 15 25 p comp…
## 16 audi a6 quattro 2.8 1999 6 auto… 4 15 24 p mids…
## 17 audi a6 quattro 3.1 2008 6 auto… 4 17 25 p mids…
## 18 audi a6 quattro 4.2 2008 8 auto… 4 16 23 p mids…
Beware that in this case you have to put audi in quotes (““)! filter(manufacturer == audi)
does not work! On the other hand, filter("manufacturer" == "audi")
DOES work.
cty
refers to the fuel efficiency of the different car models (in miles per gallon); let’s only select fuel-efficient cars with more than 20 miles per gallon.
## # A tibble: 45 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 manual(m… f 21 29 p comp…
## 2 audi a4 2 2008 4 auto(av) f 21 30 p comp…
## 3 chevrolet malibu 2.4 2008 4 auto(l4) f 22 30 r mids…
## 4 honda civic 1.6 1999 4 manual(m… f 28 33 r subc…
## 5 honda civic 1.6 1999 4 auto(l4) f 24 32 r subc…
## 6 honda civic 1.6 1999 4 manual(m… f 25 32 r subc…
## 7 honda civic 1.6 1999 4 manual(m… f 23 29 p subc…
## 8 honda civic 1.6 1999 4 auto(l4) f 24 32 r subc…
## 9 honda civic 1.8 2008 4 manual(m… f 26 34 r subc…
## 10 honda civic 1.8 2008 4 auto(l5) f 25 36 r subc…
## # ℹ 35 more rows
12.3.1 R’s operators
R uses the following operators:
>
; greater than>=
; greater than or equal to<
; smaller than<=
; smaller than or equal to==
; equal to!=
; not equal
R also has ‘Boolean’ operators that you can use to combine statements (for further explanation, see http://r4ds.had.co.nz/transform.html#filter-rows-with-filter);
&
; and|
; or!
; not
Through these operators, we can combine above filters, if we want to look at fuel-efficient audis:
## # A tibble: 2 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 2 audi a4 2 2008 4 auto(av) f 21 30 p compa…
Let’s look at fuel-inefficient audis (remember that the !
means NOT):
## # A tibble: 16 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 2 2008 4 manu… f 20 31 p comp…
## 3 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 4 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 5 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 6 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 7 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 8 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## 9 audi a4 quattro 2 2008 4 auto… 4 19 27 p comp…
## 10 audi a4 quattro 2.8 1999 6 auto… 4 15 25 p comp…
## 11 audi a4 quattro 2.8 1999 6 manu… 4 17 25 p comp…
## 12 audi a4 quattro 3.1 2008 6 auto… 4 17 25 p comp…
## 13 audi a4 quattro 3.1 2008 6 manu… 4 15 25 p comp…
## 14 audi a6 quattro 2.8 1999 6 auto… 4 15 24 p mids…
## 15 audi a6 quattro 3.1 2008 6 auto… 4 17 25 p mids…
## 16 audi a6 quattro 4.2 2008 8 auto… 4 16 23 p mids…
Let’s look at three different manufacturers; audi, ford, and honda:
## # A tibble: 52 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 42 more rows
Although this works like a charm, the code is relatively long, and selecting cases on the basis of different categories is something we often do, so luckily for us there is a useful short-hand:
## # A tibble: 52 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 42 more rows
12.3.2 Dealing with missing values
Missing values in R are signified by NA
(Not Available). is.na()
is a function that tells you whether particular values is NA or not. For instance:
## [1] FALSE FALSE TRUE FALSE FALSE
Missing values have an impact on some functions:
## [1] NA
That was not expected. It’s because R doesn’t know how to calculate an average when it has to take into account an NA
-value. But we can tell the mean
-function to ignore these NA
-values (see ?mean
) byt adding na.rm = TRUE
as an argument:
## [1] 3
That’s more like it!
We can use that same command within filter
. The mpg
dataset has no missing values, so let’s create our own dataframe.
## VarA varB
## 1 1 10
## 2 2 11
## 3 NA 12
## 4 4 13
## 5 5 14
And apply a filter to select cases that do not have NAs.
## VarA varB
## 1 1 10
## 2 2 11
## 3 4 13
## 4 5 14
12.4 mutate()
Often we want to create new colums based on the information in other columns. mutate()
is very useful for this purpose. mutate()
will always add (a) new variable(s) to your dataframe.
Let’s try and create a mean-score for the two fuel-efficiency variables (cty and hwy):
## # A tibble: 234 × 12
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: mean_fe <dbl>
We can calculate multiple columns at a time. Let’s make new variables that contain fuel efficiency in kilometers (1 mile is 1.609344 kilometer) per liter (1 gallon is 3.78541178 liter). To do this we need to multiple the cty
and hwy
variables by 1.609344 and then divide it by 3.78541178. We can also use the variables that we have just defined in creating new variables:
mpg %>% mutate(
cty_km_l = cty * 1.609344 / 3.78541178,
hwy_km_l = hwy * 1.609344 / 3.78541178,
mean_fe_km_l = (cty_km_l + hwy_km_l) / 2,
mean_fe_rnd = round(mean_fe_km_l, 2)
)
## # A tibble: 234 × 15
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
## # ℹ 4 more variables: cty_km_l <dbl>, hwy_km_l <dbl>, mean_fe_km_l <dbl>,
## # mean_fe_rnd <dbl>
12.4.1 case_when()
You can also easily change or createvariables with case_when()
inside mutate()
.
Let’s change our “drv”-variable that consists of three categories (“f” for frontwheeldrive, “r” for rearwheeldrive, and “4” for fourwheeldrive), into a variable with two categories (“two” for twowheeldrive and “four” for fourwheeldrive):
## # A tibble: 234 × 12
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: drv_bin <chr>
Or let’s say another mistake was made. For all audis with fourwheeldrive the cty
-variable was 3 too high. Let’s make a new variable that corrects this mistake:
mpg %>%
mutate(cty_rec = case_when(
manufacturer == "audi" & drv == "4" ~ cty - 3,
!(manufacturer == "audi" & drv == "4") ~ as.double(cty)
))
## # A tibble: 234 × 12
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: cty_rec <dbl>
12.5 summarise()
summarise()
allows you to summarise the information in the variables. summarise()
expects a vector (a variable) and transforms it into one number. It is very different to mutate()
in this sense, which always returns a new variable.
Let’s see an example:
## # A tibble: 1 × 1
## mean_cty
## <dbl>
## 1 16.9
Again, we can calculate multiple summary-variables at the same time:
mpg %>% summarise(
n_cty = n(),
mean_cty = mean(cty),
sd_cty = sd(cty),
se_cty = sd_cty / sqrt(n_cty)
)
## # A tibble: 1 × 4
## n_cty mean_cty sd_cty se_cty
## <int> <dbl> <dbl> <dbl>
## 1 234 16.9 4.26 0.278
12.5.1 group_by()
The summarise()
is particularly useful in combination with the group_by()
function, that groups the dataframe by the categories in a variable that you have specified.
Let’s try the above code, but now grouping on manufacturer:
mpg %>%
group_by(manufacturer) %>%
summarise(
n_cty = n(),
mean_cty = mean(cty),
sd_cty = sd(cty),
se_cty = sd_cty / sqrt(n_cty)
)
## # A tibble: 15 × 5
## manufacturer n_cty mean_cty sd_cty se_cty
## <chr> <int> <dbl> <dbl> <dbl>
## 1 audi 18 17.6 1.97 0.465
## 2 chevrolet 19 15 2.92 0.671
## 3 dodge 37 13.1 2.49 0.409
## 4 ford 25 14 1.91 0.383
## 5 honda 9 24.4 1.94 0.648
## 6 hyundai 14 18.6 1.50 0.401
## 7 jeep 8 13.5 2.51 0.886
## 8 land rover 4 11.5 0.577 0.289
## 9 lincoln 3 11.3 0.577 0.333
## 10 mercury 4 13.2 0.5 0.25
## 11 nissan 13 18.1 3.43 0.950
## 12 pontiac 5 17 1 0.447
## 13 subaru 14 19.3 0.914 0.244
## 14 toyota 34 18.5 4.05 0.694
## 15 volkswagen 27 20.9 4.56 0.877
Woah not bad!
We can also use multiple grouping variables:
mpg %>%
group_by(drv, year) %>%
summarise(
n_cty = n(),
mean_cty = mean(cty),
sd_cty = sd(cty),
se_cty = sd_cty / sqrt(n_cty)
)
## `summarise()` has grouped output by 'drv'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 6
## # Groups: drv [3]
## drv year n_cty mean_cty sd_cty se_cty
## <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 4 1999 49 14.2 2.56 0.366
## 2 4 2008 54 14.4 3.15 0.429
## 3 f 1999 57 20.0 4.09 0.542
## 4 f 2008 49 20.0 3.04 0.434
## 5 r 1999 11 14 2.76 0.831
## 6 r 2008 14 14.1 1.79 0.479
12.6 Putting it all together
Let’s try and put some of the things we have learned above together in one “pipe”:
mpg %>%
select(class, year, cty, hwy) %>%
filter(year == 2008) %>%
mutate(hwy_km_l = hwy * 1.609344 / 3.78541178) %>%
group_by(class) %>%
summarise(
n_hwy = n(),
mean_hwy = mean(hwy_km_l),
sd_hwy = sd(hwy_km_l),
se_hwy = sd_hwy / sqrt(n_hwy)
)
## # A tibble: 7 × 5
## class n_hwy mean_hwy sd_hwy se_hwy
## <chr> <int> <dbl> <dbl> <dbl>
## 1 2seater 3 10.6 0.425 0.245
## 2 compact 22 12.2 1.30 0.277
## 3 midsize 21 11.9 1.07 0.234
## 4 minivan 5 9.44 1.25 0.561
## 5 pickup 17 7.20 1.19 0.288
## 6 subcompact 16 11.5 2.05 0.512
## 7 suv 33 7.92 1.43 0.249
Now we’re getting somewhere!
12.7 Assignments
From the
mpg
-dataset that is stored withinmpg
, select only thedispl
,year
, andclass
variables. Use the pipe (%>%
)!Select all cases for the
year
2008. Use the pipe (%>%
)!Create a mean and standard deviation for the
displ
-variable for the different classes of cars (theclass
variable). Remember, use thegroup_by
andsummarise
function. Use the pipe (%>%
)!Put all the above actions into one long “pipe”!
12.8 Further reading
R for Data Science’s chapter on transforming data: http://r4ds.had.co.nz/transform.html. RStudio’s cheatshees are also very helpful! See the “Data Transformation Cheat Sheet” on https://www.rstudio.com/resources/cheatsheets/.