Chapter 4 Describe data quickly

4.1 Summarising the variables in the dataset

We already touched upon how to create some of our own summary measures for particular variables. Often, when we start analysing, we would like to get an overall grasp of our data. str() and summary() are useful. Let’s see what they do:

data_tooth <- ToothGrowth # dataset that is built-in to R
str(data_tooth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
summary(data_tooth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

A very useful package has been created recently that lets you do the same but a bit better. This package is called skimr (McNamara et al. (2018)).

install.packages("skimr")
library(skimr)
skim(data_tooth)
Table 4.1: Data summary
Name data_tooth
Number of rows 60
Number of columns 3
_______________________
Column type frequency:
factor 1
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
supp 0 1 FALSE 2 OJ: 30, VC: 30

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
len 0 1 18.81 7.65 4.2 13.07 19.25 25.27 33.9 ▅▃▅▇▂
dose 0 1 1.17 0.63 0.5 0.50 1.00 2.00 2.0 ▇▇▁▁▇

4.1.1 Summarising categorical variables

If you quickly want to get some more information on your categorical variable, you can use the table()-function while specifying a variable of interest. For instance:

table(data_tooth$supp) # table(data_tooth["supp"]) is the same
## 
## OJ VC 
## 30 30

4.2 Correlation table

For a quick correlation table, you can use the cor() function.

cor(data_tooth)
## Error in cor(data_tooth): 'x' must be numeric

It only works with numeric variables, so let’s select them first.

cor(data_tooth[, c("dose", "len")])
##           dose       len
## dose 1.0000000 0.8026913
## len  0.8026913 1.0000000

4.3 Further reading

For more information on the skimr()-package: https://github.com/ropenscilabs/skimr.

4.3.1 References

McNamara, Amelia, Eduardo Arino de la Rubia, Hao Zhu, Shannon Ellis, and Michael Quinn. 2018. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.