Chapter 4 Describe data quickly

4.1 Summarising the variables in the dataset

We already touched upon how to create some of our own summary measures for particular variables. Often, when we start analysing, we would like to get an overall grasp of our data. str() and summary() are useful. Let’s see what they do:

data_tooth <- ToothGrowth # dataset that is built-in to R
str(data_tooth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

summary(data_tooth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

A very useful package has been created recently that lets you do the same but a bit better. This package is called skimr (McNamara et al. (2018)).

install.packages("skimr")

library(skimr)
skim(data_tooth)

Table 4.1: Data summary
Name	data_tooth
Number of rows	60
Number of columns	3
_______________________
Column type frequency:
factor	1
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
supp	0	1	FALSE	2	OJ: 30, VC: 30

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
len	0	1	18.81	7.65	4.2	13.07	19.25	25.27	33.9	▅▃▅▇▂
dose	0	1	1.17	0.63	0.5	0.50	1.00	2.00	2.0	▇▇▁▁▇

4.1.1 Summarising categorical variables

If you quickly want to get some more information on your categorical variable, you can use the table()-function while specifying a variable of interest. For instance:

table(data_tooth$supp) # table(data_tooth["supp"]) is the same

## 
## OJ VC 
## 30 30

4.2 Correlation table

For a quick correlation table, you can use the cor() function.

cor(data_tooth)

## Error in cor(data_tooth): 'x' must be numeric

It only works with numeric variables, so let’s select them first.

cor(data_tooth[, c("dose", "len")])

##           dose       len
## dose 1.0000000 0.8026913
## len  0.8026913 1.0000000

4.3 Further reading

For more information on the skimr()-package: https://github.com/ropenscilabs/skimr.

4.3.1 References

McNamara, Amelia, Eduardo Arino de la Rubia, Hao Zhu, Shannon Ellis, and Michael Quinn. 2018. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.