Chapter 5 Distribution of a single variable
Typically a first step when analyzing data, is checking the distributions of the variables given their central role in deciding which analysis strategy to follow. We’ll look at some ways to do this, while at the same time changing some elements of the graphs that we get.
Two popular ways of showing a distribution are histograms and density plots; both give good ideas about the shape of the distribution.
5.1 Histogram
Making histograms is rather straightforward in ggplot, because there is a seperate geom
for it, namely geom_histogram
. Let’s make a histogram of the mileage per galon of fuel for the cars in the mpg
dataset.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We have a histogram!
What does the histogram tell us?
Although the graph is fine, R tells us that “stat_bin()
using bins = 30
. Pick better value with binwidth
”. To understand this a bit better, we must realize that a histogram counts the occurrences of particular values of the variable. For this, it makes use of “bins”, ranges of values. For instance, if we would choose a binwidth
of 5, this would mean that the first bin equals 0-4, the second bin equals 5-9, et cetera. Choosing a binwidth that is too large or too small will result in somewhat funky histograms as we’ll see below.
Now let’s try and choose a different binwidth; R already gives us a hint as to how: by specifying binwidth
:
This graph resembles the previous one, but isn’t identical!
Let’s try other binwidths:
And another one:
What’s going on here?
Create a histogram of the variable ‘displ’ in the mpg-dataset [‘displ’ refers to the engine displacement, in litres]. What is a sensible binwidth?
Create a histogram of the variable ‘year’ in the mpg-dataset [‘year’ refers to the year that the model came out]. What can you conclude from this graph?
Now, what happens when we specify a y-variable?
R gives an error. Maybe this is also not very surprising; what is the y-variable supposed to do? A histogram is a rather specific type of graph where the numbers (or percentages) of occurences are put on the y-axis from the variable that is chosen as x-variable. So the y is already defined (more or less), and our specification doesn’t work. The function geom_histogram()
does some of the computing for us, with the function stat_bin()
that apparently cannot be used with a y-variable.
And what happens if we try to make a histogram of a non-continuous variable (in this case the model of the car)?
## Error: StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat="count"?
Clearly the geom_histogram()
function doesn’t like it when the x variable is not continuous, and a plot is not provided. It give us some suggestion how to resolve it, but we won’t go into that. If you want help in R, you can type ?geom_histogram
which will give you some information. I find the information provided by R rather difficult to read, so I’ll typically use google. The ‘official’ ggplot-pages are incredibly helpful: http://ggplot2.tidyverse.org/reference/geom_histogram.html.
5.1.1 Tweaking the histogram
Now let’s quickly alter some of the features in the graph. We’ll learn more about this in later chapters, but it’s good to get an idea of what can be done.
We can easily change the colour of the bars. Let’s make them orange, because why not:
It worked! Sort of. We’ve coloured the outer sides of the bars, but the filling within is still black. So let’s change the filling as well (with the rather intuitive fill =
), and choose the colour blue:
What an ugly histogram!
If we want no fill, we will have to define the fill as NA
(which is the value of missing within R; NA
stands for Not Available):
We can also remove the border in a similar way:
The ggplot-cheatsheet tells us some of the other appearence-features we can in a histogram: “alpha, color, fill, linetype, size” (“weight” can be used to weigh the cases). Let’s try them all:
ggplot(mpg, aes(x = cty)) +
geom_histogram(
binwidth = 1,
colour = "purple", # Make the borders purple
fill = "yellow", # Make the fill of the bars yellow
alpha = 0.5, # Make the fill see through 50%
linetype = "dashed", # Make the borders dashed
size = 0.5 # Make the size of the borders smaller
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
5.2 Density plot
Another way of plotting a distribution of a single variable, is to make use of density plots. This is similar to a histogram, except that the distribution is based on a smoothening-function:
Not quite as informative as the histogram, but density plots can be handy when comparing distributions (as we’ll see shortly).
What happens if we add
binwidth = 1
to the density plot?
Density plots depend on a smoothening parameter, that depends the extent of smoothening through the points. Compare the following graphs:
5.2.1 Tweaking the density plot
We can tweak some of the features in ways that are very similar to the histogram.
The ggplot-cheatsheet tells us some of the other appearence-features we can use with a density plot: “alpha, color, fill, linetype, size” (“weight” can be used to weigh the cases). Let’s try them all:
ggplot(mpg, aes(x = cty)) +
geom_density(
colour = "purple", # Make the borders purple
fill = "yellow", # Make the fill of the bars yellow
alpha = 0.5, # Make the fill see through 50%
linetype = "dashed", # Make the borders dashed
size = 0.5 # Make the size of the borders smaller
)
Create a new density plot of the variable
cty
; remove the fill-colour, and choose as linetypedotted
; change the size to any value you’d like. [is alpha still sensible?]
For more info, see http://ggplot2.tidyverse.org/reference/geom_density.html. We’ll also see more density plots when we address the (beautiful) violin plots.
5.3 The frequency polygon
Two other ways of visualizing a distribution are a frequency polygon, and a dotplot. Both are less often used than histograms and density plots, but they have their use.
To get a frequency polygon, use:
The frequency polygon is very similar to the histogram in that the information in the graph is similar (and the calculations are identical and need a binwidth
).
When do you think a frequency polygon is preferred over histograms and density plots?
For more information on frequency polygons, visit http://ggplot2.tidyverse.org/reference/geom_histogram.html.
5.4 The dotplot
To get a dotplot, use:
Every dot is a case, and dots within a ‘bin’ are stacked on top of each other. This is very similar a histogram in this form. An alternative way of stacking the dots is:
The y-axis is not very helpful (in both cases; because it depends on the binning algorithm and some features that we can pass to the geom_dotplot
function), so we can supress it if we want to:
ggplot(mpg, aes(x = cty)) +
geom_dotplot(binwidth = 0.5, stackdir = "center") +
scale_y_continuous(name = NULL, breaks = NULL)
For more information on dotplots, see http://ggplot2.tidyverse.org/reference/geom_dotplot.html
Add a different
theme
to the plot you made last. See here