Chapter 10 Comparing distributions
Visualizing the distribution of our variables is an important first step in exploring and analyzing data. Often, a next step would be to compare the distributions of two or more groups. First we’ll learn how to do that with histograms and density plots that we have already learned about. We’ll also learn some novel ways that are often a bit more informative.
10.1 Histogram
We can compare two or more distributions by ‘mapping’ the variables to colours. For this, we have to specify the fill
or colour
within the aes()
. Let’s see whether the fuel efficiency depends on whether the car is a front-, rear, or 4-wheel drive (measured by the drv
variable).
ggplot(mpg, aes(x=cty, colour=drv)) + geom_histogram(binwidth=1)
Hhhmm, if we try hard, we can see that red (representing 4-wheel drives) is more on the left of the histogram (less fuel-efficient) and that the green (representing front-wheel drives) is more on the right (more fuel-efficient), but this graph is hardly ideal. This graph would be a bit more clearer, if the fill of the bars were also coloured, rather than black, so let’s try and improve:
ggplot(mpg, aes(x=cty, colour=drv, fill=drv)) + geom_histogram(binwidth=1)
10.1.1 Frequency polygon?
Is this a case where the frequency polygon (that represent the same information as a histogram), is a bit better?
ggplot(mpg, aes(x=cty, colour=drv, fill=drv)) + geom_freqpoly(binwidth=1)
Note that the frequency polygon has no “fill”, so the below code is more appropriate for the geom_freqploy()
function, which will lead to the identical graph.
ggplot(mpg, aes(x=cty, colour=drv)) + geom_freqpoly(binwidth=1)
How is the frequency polygon different to the histogram in this case?
10.1.2 Mapping versus setting colour
We have now seen two uses of ‘colour’ within the ggplot code. It’s important to learn the distinction between these ways.
We can ‘set’ the colour of particular feature in our graph, by specifying a particular colour:
ggplot(mpg, aes(x=cty)) + geom_histogram(binwidth=1, fill="orange")
Or we can ‘map’ values of a variable to colours. This needs to be done with the aes()
:
ggplot(mpg, aes(x=cty, fill=drv)) + geom_histogram(binwidth=1)
The distinction is rather evident.
What happens when we ‘set’ a colour using a variable?
ggplot(mpg, aes(x=cty)) + geom_histogram(binwidth=1, fill=drv)
Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomBar, : object 'drv' not found
This doesn’t work, because when we are setting a particular feature, R expects a name of a colour, not a variable. It doesn’t even know what ‘drv’ is, because it’s not looking in the dataset.
What happens when we ‘map’ the aesthetics to a colour-value?
ggplot(mpg, aes(x=cty, fill="orange")) + geom_histogram(binwidth=1)
Woah, that is quite unexpected.
What’s going on here?
10.2 Density plots
Histograms can be used to compare distributions, but they are not always ideal. Sometimes, density plots are more informative:
ggplot(mpg, aes(x=cty, colour=drv)) + geom_density()
This looks a bit like our frequency polygon.
What’s the difference between the frequency polygon and this density plot?
Let’s try to improve a bit:
ggplot(mpg, aes(x=cty, colour=drv, fill=drv)) + geom_density()
This looks quite pretty! But too bad that the red distribution is almost completely covered by the others! Perhaps we can make the colours a bit see-through:
ggplot(mpg, aes(x=cty, colour=drv, fill=drv)) + geom_density(alpha=0.5)
Not bad! It’s very clear that front-wheel drives are much more fuel-efficient than both rear- and 4-wheel drives.
10.3 Violin plots
A particular neat way of visualizing a comparison of distributions, is to use “violin plots”. These are essentially density plots next to one another (rather than overlapping). Note that we are now specifying an x-variable and a y-variable:
ggplot(mpg, aes(x=drv, y=cty)) + geom_violin()
This gives us a rather quick impression of the different distributions.
If we want the different distributions to have different colours (similar to the density plots), then we can similary ‘map’ it to the variable “drv”:
ggplot(mpg, aes(x=drv, y=cty, fill=drv)) + geom_violin()
It looks a bit prettier now. There are distinct views on whether this is a ‘appropriate’ thing to do: one the hand, some argue that the colour variable adds zero new information. The distinction of the different type of drives is already on the x-axis, and need not also be specified by different variable. Such people would think the colours lead to more distraction. On the other hand, there are those that think redundancy is a good thing. Also, the colours might be more attractive, such that the perceiver shows more interest / has more attention for the graph. I think both sides have a point. Perhaps for scientific publications, stick to the more basic version without the distraction, perhaps for presentations you can consider pretty colours that prevent people from nodding off.
10.4 Scatter plots
An underused but particularly useful way of comparing distributions is by use of a scatterplot:
ggplot(mpg, aes(x=drv, y=cty)) + geom_point()
A somewhat problematic feature of this graph, is that there are many more datapoints in the dataset, than we can see in the graph. This is because the datapoints are overlapping. Two ways for resolving this are to jitter the point (add a bit of random noise to the data, so that the datapoints will deviate slightly) or to adjust the size of the points dependent on their frequency.
10.4.1 Jitter
ggplot(mpg, aes(x=drv, y=cty)) + geom_jitter()
What is a disadvantage of a jitter-plot?
If we make use of the geom_jitter
without any specification, the function will jitter the data-points in both the horizontal and vertical dimension. This is not ideal, because, in this case, randomly shifting them in the vertical dimension, means that we cannot retrieve the raw data. So let’s try and only jitter in the horizontal direction:
ggplot(mpg, aes(x=drv, y=cty)) + geom_jitter(width = 0.4, height = 0)
10.4.2 Bubble plots
ggplot(mpg, aes(x=drv, y=cty)) + geom_count()
10.5 Boxplots
A different ways of comparing distributions is through boxplots. In contrast to the above graphs, with boxplots the actual distributions are not displayed, but several summary statistics of the distributions (e.g., median, interquartile range, outliers). Still boxplots are incredibly useful to get a quick view of the differences between the groups, where the bulk of the data lies between, and whether there are outliers:
ggplot(mpg, aes(x=drv, y=cty)) + geom_boxplot()
10.6 Comparing distributions 2.0
My strong preference when it comes to graphs is to show the raw data in addition to some sort of summary based on the data. This almost always involves showing the raw datapoints with geom_point()
(or geom_jitter()
or geom_count()
when points are overlapping). Two examples:
ggplot(mpg, aes(x=drv, y=cty)) + geom_violin() + geom_count(alpha=0.5)
ggplot(mpg, aes(x=drv, y=cty)) + geom_boxplot() + geom_jitter(colour="lightblue")
Why doesn’t the below code work? Also think about what is visualized in a histogram
ggplot(mpg, aes(x=cty, fill=cty)) + geom_histogram() + geom_point()