Making beautiful boxplots using plotnine in Python

written in python, plotnine

For the past year and a half, I have been switching gradually from using matplotlib to create graphs in Python to Hassan Kibirige’s wonderful port of R’s ggplot2, plotnine. When I was first starting to use this package, I found it was quite tricky to find clear instructions on how to make customisations or build some of the more specialised charts. As such, Mauricio and I decided to create a series of tutorials on how to build and customise a range of charts in plotnine, going step-by-step from a basic plot to a highly customised graph.

This week I am sharing our tutorial on how to create and customise boxplots in plotnine. If you enjoyed this blog post and found it useful, please consider buying our book! It contains chapters detailing how to build and customise 11 other chart types, including bar charts, line charts, scatterplots, histograms, density plots and regression plots (including all of the regression diagnostic plots available in R). Every purchase really helps us out with maintaining the content.

With that out of the way, let’s get on with creating our boxplot in plotnine! In this tutorial, we will work towards creating the line plot below. We will take you from a basic line plot and explain all the customisations we add to the code step-by-step.

The first step is to import all of the required packages. For this we needpandas and its DataFrame class to read in and manipulate our data, plotnine to get our data and create our graphs, and numpy to do some basic numeric calculations in our graphing.

import numpy as np
import pandas as pd

import plotnine

from plotnine import *
from plotnine import data
from pandas import DataFrame

We then need to load in the data, as below.

diamonds = data.diamonds

Basic ggplot structure

In order to initialise a boxplot we tell ggplot that diamonds is our data, and specify that our x-axis plots the cut variable and our y-axis plots the price variable. You may have noticed that we put our variables inside a method called aes. This is short for aesthetic mappings, and determines how the different variables you want to use will be mapped to parts of the graph. As you can see below, ggplot has mapped cut to the x-axis and price to the y-axis.

You might have also noticed that there is nothing in the plot. In order to render our data, we need to tell ggplot how we want to visually represent it.

p10 = ggplot(diamonds, aes("cut", "price"))
p10

Basic boxplot

We can do this using geoms. In the case of a boxplot, we use the geom_boxplot() geom.

p10 = ggplot(diamonds, aes("cut", "price")) + geom_boxplot()
p10

Customising axis labels

In order to change the axis labels, we have used the xlab and ylab options. In each, we add the desired name as an argument.

p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot()
    + xlab("Diamond cut")
    + ylab("Price of diamond (USD)")
)
p10

ggplot also allows for the use of multiline names (in both axes and titles). Here, we’ve changed the y-axis label so that it goes over two lines using the \n character to break the line.

p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot()
    + xlab("Diamond cut")
    + ylab("Price of diamond\n(USD)")
)
p10

Changing axis ticks

To change the x-axis tick marks, we can use the scale_x_continuous option. Similarly, to change the y-axis we can use the scale_y_continuous option. Here we will change the y-axis to every $2500 rather than the default of $5000. We can change the breaks using the breaks option, which takes a list of values as an argument. You can shortcut having to type in the whole list manually using numpy’s arange function which generates a sequence from your selected start, stop and step values respectively. Note that because of Python’s indexing, you need to set the stop argument to be one number more than your desired maximum.

Similarly, you can use the limits argument to define the minimum and maximum values of your axis. We’ve also included this in our scale_y_continuous option, increasing the maximum value to $20000.

p10 = (
    ggplot(diamonds, aes("cut", "price")) + geom_boxplot()
    + xlab("Diamond cut") + ylab("Price of diamond\n(USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), limits=[0, 20000])
)
p10

Adding a title

To add a title, we include the option ggtitle and include the name of the graph as a string argument.

p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot()
    + xlab("Diamond cut")
    + ylab("Price of diamond\n(USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
)
p10

Changing the colour of the boxes

To change the line and fill colours of the box plot, we add a valid colour to the colour and fill arguments in geom_boxplot(). plotnine uses the colour palette utilised by matplotlib, and the full set of named colours recognised by ggplot is here. Let’s try changing our box lines and fills to rebeccapurple and lightskyblue respectively.

p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot(colour="rebeccapurple", fill="lightskyblue")
    + xlab("Diamond cut")
    + ylab("Price of diamond\n(USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
)
p10

If you want to go beyond the options in the list above, you can also specify exact HEX colours by including them as a string preceded by a hash, e.g., "#FFFFFF". Below, we have called two shades of blue for the fill and lines using their HEX codes.

p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot(colour="#1F3552", fill="#4271AE")
    + xlab("Diamond cut")
    + ylab("Price of diamond\n(USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
)
p10

You can also specify the degree of transparency in the box fill area using the argument alpha in geom_boxplot(). This ranges from 0 to 1.

p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot(colour="#1F3552", fill="#4271AE", 
                   alpha=0.7)
    + xlab("Diamond cut")
    + ylab("Price of diamond\n(USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
)
p10

Finally, you can change the appearance of the outliers as well, using the arguments outlier.colour and outlier.shape in geom_boxplot to change the colour and shape respectively. The shape arguments for plotnine are the same as those available in matplotlib, and are therefore a little more limited than those in R’s implementation of ggplot2. Nonetheless, there is a good range of options. The allowed arguments are here. Here we will make the outliers small solid circles (using outlier.shape=".") and make them coloured steelblue (using outlier.colour="steelblue").

p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot(
        colour="#1F3552",
        fill="#4271AE",
        alpha=0.7,
        outlier_shape=".",
        outlier_colour="steelblue",
    )
    + xlab("Diamond cut")
    + ylab("Price of diamond\n(USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
)
p10

Using the white theme

As explained in the previous chapters, we can also change the overall look of the plot using themes. We’ll start using a simple theme customisation by adding theme_bw().

p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot(
        colour="#1F3552",
        fill="#4271AE",
        alpha=0.7,
        outlier_shape=".",
        outlier_colour="steelblue",
    )
    + xlab("Diamond cut")
    + ylab("Price of diamond\n(USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
    + theme_bw()
)
p10

Creating an XKCD style chart

Of course, you may want to create your own themes as well. ggplot allows for a very high degree of customisation, including allowing you to use imported fonts. plotnine already has a theme_xkcd() implementation, but we’ve instead created one from scratch to demonstrate how to use imported fonts and some of the other options in theme to tweak the overall look of the graph.

In order to create this chart, you first need to download the XKCD font, which Randall Munroe has kindly provided here. Once you have it, you can load it into Python using the matplotlib.font_manager class.

import matplotlib.font_manager as fm

fpath = "path/to/file/xkcd-Regular.otf"

As this is an imported font, we can’t change its size directly within the graph. Instead, we need to alter our imported font objects to change the size. As we want a different font size for the title and the body, we will create 2 different font objects, title_text and body_set.

We can then call methods on these objects (the list of available methods is here). For the title, we’ll change the font to size 18 and make it bold using the set_size() and set_weight methods. Similarly, we’ll change the body text to size 12.

# Create font objects
title_text = fm.FontProperties(fname=fpath)
body_text = fm.FontProperties(fname=fpath)

# Alter size and weight of font objects
title_text.set_size(18)
title_text.set_weight("bold")

body_text.set_size(12)

In order to get the plot to look more like the XKCD artstyle, we’ll make a few more changes. We can alter the values of axis_line_x and axis_line_y to change the thickness of the axis lines. We can also get rid of the boxes around the legend by setting the argument of legend_key to element_blank(). We can remove the grid line by changing the value of four parameters: panel_grid_major, panel_grid_minor, panel_border and panel_background. To use the XKCD font that we just imported, we need to change the values of both plot_title and text. Finally, to change the colour of the text to black (from its default grey), we change the values of axis_text_x and axis_text_y.

p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot(colour="black", fill="#56B4E9")
    + xlab("Diamond cut")
    + ylab("Price of diamond (USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
    + theme(
        axis_line_x=element_line(size=2, colour="black"),
        axis_line_y=element_line(size=2, colour="black"),
        panel_grid_major=element_blank(),
        panel_grid_minor=element_blank(),
        panel_border=element_blank(),
        panel_background=element_blank(),
        plot_title=element_text(fontproperties=title_text),
        text=element_text(fontproperties=body_text),
        axis_text_x=element_text(colour="black"),
        axis_text_y=element_text(colour="black"),
    )
)
p10

Using the ‘Five Thirty Eight’ theme

There are a wider range of pre-built themes available as part of the ggplot package (more information on these here). Below we’ve applied theme_538(), which approximates graphs in the nice FiveThirtyEight website. As you can see, we’ve used the commercially available fonts ‘Atlas Grotesk’ and ‘Decima Mono Pro’ in axis_title, plot_title and text. This is just to make the plots exactly like those on the site, and is entirely optional.

agm = "path/to/file/AtlasGrotesk-Medium.otf"
agr = "path/to/file/AtlasGrotesk-Regular.otf"
dp = "path/to/file/DecimaMonoPro.otf"

# Create font objects
title_text = fm.FontProperties(fname=agm)
axis_text = fm.FontProperties(fname=agr)
body_text = fm.FontProperties(fname=dp)

# Alter size and weight of font objects
title_text.set_size(16)
axis_text.set_size(12)
body_text.set_size(10)
p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot(colour="#1F3552", fill="#4271AE")
    + xlab("Diamond cut")
    + ylab("Price of diamond (USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
    + theme_538()
    + theme(
        axis_title=element_text(fontproperties=axis_text),
        plot_title=element_text(fontproperties=title_text),
        text=element_text(fontproperties=body_text),
    )
)
p10

Creating your own theme

Now that we’ve explored some of the options available in plot customisation, we can now build our own completely customised graph. Changing the size and colour arguments of axis_line allows us to thicken the lines and change their colour to black. Similarly, changing the colour argument passed to panel_grid_major means that all of our major grid lines are now light grey. We removed the minor grid lines and background by changing the arguments of panel_grid_minor, panel_border and panel_background, and finally, we’ve changed the font using the standard font Tahoma.

p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot(colour="#1F3552", fill="#4271AE")
    + xlab("Diamond cut")
    + ylab("Price of diamond (USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
    + theme(
        axis_line=element_line(size=1, colour="black"),
        panel_grid_major=element_line(colour="#d3d3d3"),
        panel_grid_minor=element_blank(),
        panel_border=element_blank(),
        panel_background=element_blank(),
        plot_title=element_text(size=15, family="Tahoma", 
                                face="bold"),
        text=element_text(family="Tahoma", size=11),
        axis_text_x=element_text(colour="black", size=10),
        axis_text_y=element_text(colour="black", size=10),
    )
)
p10

Boxplot extras

An extra feature you can add to boxplots is to overlay all of the points for that group on each boxplot in order to get an idea of the sample size of the group. This can be achieved using by adding the geom_jitter() option. As diamonds is a large dataset, we’ll first take a small sample to illustrate this.

diamonds_sample = diamonds.sample(1000)
p10 = (
    ggplot(diamonds_sample, aes("cut", "price"))
    + geom_boxplot(colour="#1F3552", fill="#4271AE")
    + geom_jitter()
    + xlab("Diamond cut")
    + ylab("Price of diamond (USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
    + theme(
        axis_line=element_line(size=1, colour="black"),
        panel_grid_major=element_line(colour="#d3d3d3"),
        panel_grid_minor=element_blank(),
        panel_border=element_blank(),
        panel_background=element_blank(),
        plot_title=element_text(size=15, family="Tahoma", 
                                face="bold"),
        text=element_text(family="Tahoma", size=11),
        axis_text_x=element_text(colour="black", size=10),
        axis_text_y=element_text(colour="black", size=10),
    )
)
p10

We can see that the Fair group has a smaller sample than the other categories, indicating that it may not give as reliable information as the other cut types.

Another thing you can do with your boxplot is add a notch to the box where the median sits to give a clearer visual indication of how the data are distributed within the IQR. You achieve this by adding the argument notch=True to the geom_boxplot() geom.

p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot(colour="#1F3552", fill="#4271AE", 
                   notch=True)
    + xlab("Diamond cut")
    + ylab("Price of diamond (USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
    + theme(
        axis_line=element_line(size=1, colour="black"),
        panel_grid_major=element_line(colour="#d3d3d3"),
        panel_grid_minor=element_blank(),
        panel_border=element_blank(),
        panel_background=element_blank(),
        plot_title=element_text(size=15, family="Tahoma", 
                                face="bold"),
        text=element_text(family="Tahoma", size=11),
        axis_text_x=element_text(colour="black", size=10),
        axis_text_y=element_text(colour="black", size=10),
    )
)
p10

Grouping by another variable

You can also easily group boxplots by the levels of another variable. There are two options, in separate (panel) plots, or in the same plot.

We first need to do a little data wrangling. To create our grouping variable, we’ll median-split carat so that this is categorical, and made it into a new labelled factor variable called carat_c.

In order to produce a panel plot by this categorical carat variable, we add the facet_grid(".~carat_c") option to the plot. Note that unlike in R’s ggplot, you need to include the arguments in facet_grid in quote marks.

diamonds["carat_c"] = pd.qcut(
    diamonds["carat"], 2, labels=["Lower carat", "Higher carat"]
)
p10 = (
    ggplot(diamonds, aes("cut", "price"))
    + geom_boxplot(colour="#1F3552", fill="#4271AE")
    + xlab("Diamond cut")
    + ylab("Price of diamond (USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
    + theme(
        axis_line=element_line(size=1, colour="black"),
        panel_grid_major=element_line(colour="#d3d3d3"),
        panel_grid_minor=element_blank(),
        panel_border=element_blank(),
        panel_background=element_blank(),
        plot_title=element_text(size=15, family="Tahoma", 
                                face="bold"),
        text=element_text(family="Tahoma", size=11),
        axis_text_x=element_text(colour="black", size=8),
        axis_text_y=element_text(colour="black", size=10),
    )
    + facet_grid(". ~ carat_c")
)
p10

In order to plot the two carat levels in the same plot, we need to add a couple of things. Firstly, in the ggplot function, we add a fill=carat_c argument to aes. Secondly, we change the manual colours using the schemes from ColorBrewer. Here we have used the scale_fill_brewer option with the quantitative scale Accent. More information on using scale_colour_brewer is here.

p10 = (
    ggplot(diamonds, aes("cut", "price", fill="carat_c"))
    + geom_boxplot()
    + xlab("Diamond cut")
    + ylab("Price of diamond (USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
    + theme(
        legend_direction="horizontal",
        legend_box_spacing=0.4,
        axis_line=element_line(size=1, colour="black"),
        panel_grid_major=element_line(colour="#d3d3d3"),
        panel_grid_minor=element_blank(),
        panel_border=element_blank(),
        panel_background=element_blank(),
        plot_title=element_text(size=15, family="Tahoma", 
                                face="bold"),
        text=element_text(family="Tahoma", size=11),
        axis_text_x=element_text(colour="black", size=10),
        axis_text_y=element_text(colour="black", size=10),
    )
    + scale_fill_brewer(type="qual", palette="Accent")
)
p10

Formatting the legend

Finally, we can format the legend. Firstly, we can change the position by adding the legend_position="bottom" argument to the theme option, which moves the legend under the plot. We can change the orientation of the legend to horizontal by then adding legend_direction="horizontal" to theme. We can also centre the legend by adding legend_title_align="center". We can adjust the legend position using legend_box_spacing=0.4. We can get rid of the grey background behind the legend keys using legend_key=element_blank(). Lastly, we can fix the title by adding the name="Diamond carat" argument to the scale_fill_brewer option.

p10 = (
    ggplot(diamonds, aes("cut", "price", fill="carat_c"))
    + geom_boxplot()
    + xlab("Diamond cut")
    + ylab("Price of diamond (USD)")
    + scale_y_continuous(breaks=np.arange(0, 20001, 2500), 
                         limits=[0, 20000])
    + ggtitle("Price of diamonds by cut")
    + theme(
        legend_position="bottom",
        legend_direction="horizontal",
        legend_title_align="center",
        legend_box_spacing=0.4,
        legend_key=element_blank(),
        axis_line=element_line(size=1, colour="black"),
        panel_grid_major=element_line(colour="#d3d3d3"),
        panel_grid_minor=element_blank(),
        panel_border=element_blank(),
        panel_background=element_blank(),
        plot_title=element_text(size=15, family="Tahoma", 
                                face="bold"),
        text=element_text(family="Tahoma", size=11),
        axis_text_x=element_text(colour="black", size=10),
        axis_text_y=element_text(colour="black", size=10),
    )
    + scale_fill_brewer(type="qual", palette="Accent", 
                        name="Diamond carat")
)
p10

And with that, we have recreated the plot at the beginning of this blog post! I hope you found this helped you get your feet wet with plotnine and see the potential you have to create plots as beautiful as those from ggplot.