Chapter 4 Describe data quickly
4.1 Summarising the variables in the dataset
We already touched upon how to create some of our own summary measures for particular variables. Often, when we start analysing, we would like to get an overall grasp of our data. str()
and summary()
are useful. Let’s see what they do:
<- ToothGrowth # dataset that is built-in to R
data_tooth str(data_tooth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
summary(data_tooth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
A very useful package has been created recently that lets you do the same but a bit better. This package is called skimr
(McNamara et al. (2018)).
install.packages("skimr")
library(skimr)
skim(data_tooth)
Name | data_tooth |
Number of rows | 60 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
supp | 0 | 1 | FALSE | 2 | OJ: 30, VC: 30 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
len | 0 | 1 | 18.81 | 7.65 | 4.2 | 13.07 | 19.25 | 25.27 | 33.9 | ▅▃▅▇▂ |
dose | 0 | 1 | 1.17 | 0.63 | 0.5 | 0.50 | 1.00 | 2.00 | 2.0 | ▇▇▁▁▇ |
4.2 Correlation table
For a quick correlation table, you can use the cor()
function.
cor(data_tooth)
## Error in cor(data_tooth): 'x' must be numeric
It only works with numeric variables, so let’s select them first.
cor(data_tooth[, c("dose", "len")])
## dose len
## dose 1.0000000 0.8026913
## len 0.8026913 1.0000000
4.3 Further reading
For more information on the skimr()
-package: https://github.com/ropenscilabs/skimr.
4.3.1 References
McNamara, Amelia, Eduardo Arino de la Rubia, Hao Zhu, Shannon Ellis, and Michael Quinn. 2018. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.