# Chapter 4 Describe data quickly

## 4.1 Summarising the variables in the dataset

We already touched upon how to create some of our own summary measures for particular variables. Often, when we start analysing, we would like to get an overall grasp of our data. `str()` and `summary()` are useful. Let’s see what they do:

``````data_tooth <- ToothGrowth # dataset that is built-in to R
str(data_tooth)``````
``````## 'data.frame':    60 obs. of  3 variables:
##  \$ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  \$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  \$ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...``````
``summary(data_tooth)``
``````##       len        supp         dose
##  Min.   : 4.20   OJ:30   Min.   :0.500
##  1st Qu.:13.07   VC:30   1st Qu.:0.500
##  Median :19.25           Median :1.000
##  Mean   :18.81           Mean   :1.167
##  3rd Qu.:25.27           3rd Qu.:2.000
##  Max.   :33.90           Max.   :2.000``````

A very useful package has been created recently that lets you do the same but a bit better. This package is called `skimr` (McNamara et al. (2018)).

``install.packages("skimr")``
``````library(skimr)
skim(data_tooth)``````
 Name data_tooth Number of rows 60 Number of columns 3 _______________________ Column type frequency: factor 1 numeric 2 ________________________ Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
supp 0 1 FALSE 2 OJ: 30, VC: 30

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
len 0 1 18.81 7.65 4.2 13.07 19.25 25.27 33.9 ▅▃▅▇▂
dose 0 1 1.17 0.63 0.5 0.50 1.00 2.00 2.0 ▇▇▁▁▇

### 4.1.1 Summarising categorical variables

If you quickly want to get some more information on your categorical variable, you can use the `table()`-function while specifying a variable of interest. For instance:

``table(data_tooth\$supp) # table(data_tooth["supp"]) is the same``
``````##
## OJ VC
## 30 30``````

## 4.2 Correlation table

For a quick correlation table, you can use the `cor()` function.

``cor(data_tooth)``
``## Error in cor(data_tooth): 'x' must be numeric``

It only works with numeric variables, so let’s select them first.

``cor(data_tooth[, c("dose", "len")])``
``````##           dose       len
## dose 1.0000000 0.8026913
## len  0.8026913 1.0000000``````

For more information on the `skimr()`-package: https://github.com/ropenscilabs/skimr.