Chapter 11 What did you miss today?

We barely scratched the surface of what R can do, but most of it we won’t ever need. We came quite far today, but here are some things that we did not address, but that you will certainly encounter when you start using R:

11.1 Lists

A vector is a sequence of ordered data elements of the same basic type. A List is a sequence of elements of any type. For example:

list(A = c(1,2,3,4), B = "Bla", C = TRUE, D = data.frame(X = c(8,9), Y = c("M", "F")) )

## $A
## [1] 1 2 3 4
## 
## $B
## [1] "Bla"
## 
## $C
## [1] TRUE
## 
## $D
##   X Y
## 1 8 M
## 2 9 F

11.2 Factors

Most of the variables that were not numerical that we encountered were character-variables. Historically, these type of variables used to be factor-variables. Factors are a bit different, because they are variables that have a fixed and known set of possible values. They are a vector of integers with character-labels attached to them (e.g., 1 = “male”, 2 = “female”, 3 = “unknown”) The advantage of factors over character variables is that you can determine their ordering. This is particularly handy when visualising data and you want the variable to be ordered in a particular way (as we have done today!) or when you want to change reference categories in analyses.

The forcats-package is your friend. Also, see http://r4ds.had.co.nz/factors.html.

11.3 Creating your own functions

A huge part of R is building functions. All the packages that we have used consist of functions to do something. But we can also make functions ourselves, tailored to our needs.

As an example, often we want to standardize our variables before analysis, i.e., substract the mean from each value and divide by the standard deviation. For any continuous variable, we can do:

library(tidyverse)
df <- mpg # mpg dataset from ggplot2
df %>% mutate(cty_stand = (cty - mean(cty)) / sd(cty),
              hwy_stand = (hwy - mean(hwy)) / sd(hwy),
              displ_stand = (displ - mean(displ)) / sd(displ))

## # A tibble: 234 × 14
##    manufac…¹ model displ  year   cyl trans drv     cty   hwy fl    class cty_s…²
##    <chr>     <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>   <dbl>
##  1 audi      a4      1.8  1999     4 auto… f        18    29 p     comp…   0.268
##  2 audi      a4      1.8  1999     4 manu… f        21    29 p     comp…   0.973
##  3 audi      a4      2    2008     4 manu… f        20    31 p     comp…   0.738
##  4 audi      a4      2    2008     4 auto… f        21    30 p     comp…   0.973
##  5 audi      a4      2.8  1999     6 auto… f        16    26 p     comp…  -0.202
##  6 audi      a4      2.8  1999     6 manu… f        18    26 p     comp…   0.268
##  7 audi      a4      3.1  2008     6 auto… f        18    27 p     comp…   0.268
##  8 audi      a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…   0.268
##  9 audi      a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…  -0.202
## 10 audi      a4 q…   2    2008     4 manu… 4        20    28 p     comp…   0.738
## # … with 224 more rows, 2 more variables: hwy_stand <dbl>, displ_stand <dbl>,
## #   and abbreviated variable names ¹manufacturer, ²cty_stand

We can also create our own function, which will save us some typing:

standardise <- function(variable) {
  (variable - mean(variable)) / sd(variable)
}

We can now use this function:

df %>% mutate(cty_stand = standardise(cty),
              hwy_stand = standardise(hwy),
              displ_stand = standardise(displ))

## # A tibble: 234 × 14
##    manufac…¹ model displ  year   cyl trans drv     cty   hwy fl    class cty_s…²
##    <chr>     <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>   <dbl>
##  1 audi      a4      1.8  1999     4 auto… f        18    29 p     comp…   0.268
##  2 audi      a4      1.8  1999     4 manu… f        21    29 p     comp…   0.973
##  3 audi      a4      2    2008     4 manu… f        20    31 p     comp…   0.738
##  4 audi      a4      2    2008     4 auto… f        21    30 p     comp…   0.973
##  5 audi      a4      2.8  1999     6 auto… f        16    26 p     comp…  -0.202
##  6 audi      a4      2.8  1999     6 manu… f        18    26 p     comp…   0.268
##  7 audi      a4      3.1  2008     6 auto… f        18    27 p     comp…   0.268
##  8 audi      a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…   0.268
##  9 audi      a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…  -0.202
## 10 audi      a4 q…   2    2008     4 manu… 4        20    28 p     comp…   0.738
## # … with 224 more rows, 2 more variables: hwy_stand <dbl>, displ_stand <dbl>,
## #   and abbreviated variable names ¹manufacturer, ²cty_stand

Another advantage is that we can very quickly see what is going on (the variable is standardised some way).

11.4 Combining files

Combining and merging files can be easily done in R. This chapter is useful: http://r4ds.had.co.nz/relational-data.html

11.5 Updating packages and R

A great feature of R is that because so many people monitor it, it is constantly updated, and bugs are being fixed. Annoyingly, this means that you have to repeatedly update R and your packages. Packages is easy, if you simply run install.packages("my package") again, you will update your existing package (although beware; things may have changed!). You can also do this easily with RStudio (under “tools”); you can also choose to update all packages in RStudio.

Unfortunately, updating R is not as easy. This involves downloading the newest version of R from https://www.r-project.org/ and installing it. You don’t have to download the newest version, each time there is a new version. But you will find that some packages and some functions will stop running on older versions of R, necessitating an update.