# Two-group hypothesis testing: permutation tests

written in

In the last blog post I described how you could test whether the difference between two groups was statistically significant using an independent-samples t-test. (I will rely heavily on that blog post in this one, so I encourage you to at least skim it before reading this.) I used the example that your company (a retail website selling children’s toys) had launched two advertising campaigns and wanted to see whether they brought in different amounts of revenue. I cheekily assumed that the population distribution of amount spent per site visit was approximately normally distributed. However, this is unlikely to be the case - you are much more likely to have a large number of visitors that buy nothing, a smaller number spending a small to moderate amount, and then a minority of visitors spending a lot.

## What if my distributions are not normal?

(Image via Research Wahlberg)

In cases like this, we can’t use a t-test, so what can we do? We can instead rely on non-parametric methods. I will talk about one example, permutation tests, in this blog post. So how do they work? Well, when we collect our data (amount of money spent per visit), we assign it to a group depending on what advertising campaign the visit originated from. We then take the difference in the mean amount generated per campaign as our test statistic. What permutation tests suggest as their null hypothesis is that randomly reassigning (or permuting) these group labels and then taking the mean difference between these new groups will give a mean difference similar to the one we got from our original groups. In other words, the null hypothesis is that the group labels are arbitrary, and that we could get a mean difference of that size or bigger by chance alone. The alternative hypothesis is that the group labels are not arbitrary, and a mean difference of that size didn’t occur by chance. In permutation tests, we therefore permute the group labels a large number of times, and see where our original mean difference ranks among the permuted mean differences. This is a bit confusing, but I’ll talk you through it step-by-step.

## Simulating some data

As with the last post, let’s say we collected a sample of 40 site visits for each campaign. To simulate the samples, I will resort to my much-loved method of creating Franken-distributions - in this case, I am merging elements of exponential and uniform distributions, plus throwing in some zero counts. This will give us some inflation around zero and a tapering off as the amount spent per visit increases, which is a far more realistic representation of the sort of data we’d collect.

data <- data.frame(group = rep(c("Campaign 1", "Campaign 2"), c(40, 40)),
amount.purchased = numeric(length = 80))

set.seed(567)
data$amount.purchased[data$group == "Campaign 1"] <- c(rep.int(0, 7),
rexp(33, rate = 1) * 100)
data$amount.purchased[data$group == "Campaign 2"] <- c(rep.int(0, 10),
rexp(30, rate = 2.5) * 100)


As you can see in the histograms below, the distribution of observations for campaign 1 appears to differ from that for campaign 2, so the group labels are not likely to be arbitrary. The frequency of observations where nothing or very little was spent in a visit is lower in campaign 1, and the maximum amount spent in any visit was higher.

## Creating the test statistic

The next step is creating the test statistic to assess whether the difference between the campaigns’ revenue is meaningfully different. This is simpler than in the last post - we can use the raw mean difference rather than standardising it.

diff.means <- mean(data$amount.purchased[data$group == "Campaign 1"]) -
mean(data$amount.purchased[data$group == "Campaign 2"])




## Rejecting or accepting the null hypothesis

To check whether your test statistic is statistically different from 0, we just check how it ranks compared to the permuted means:

sig <- sum(perm.means > diff.means)


The number of permuted mean differences that exceeded the true mean difference was 0. As there were 1,000 permutations, the significance level is simply 1/1001, or p = 0.001. As this is less than 0.05, this means that campaign 1 generates significantly more income than campaign 2 per site visit.

## Take away message

This is a brief introduction to permutation tests, which is a family that includes well-known non-parametric methods such as the Fisher’s exact and Wilcoxon rank-sum tests. These tests are a useful part of your statistical arsenal when your data don’t fit the assumptions of parametric tests (as is often the case). However, these of course aren’t a magical fix-all to your problems and must be used sensibly! As an example, a problem we might have could be that taking the mean of such skewed data is not particularly meaningful, therefore doing a test of mean differences does not make sense.

As part of writing this post, I heavily borrowed from the code used in Thomas Lumley and Ken Rices’ presentation for the Summer Institute in Statistical Genetics, and used code and explanations from Charlie Geyer’s tutorial from his class at University of Minnesota, Twin Cities.

Finally, the full code used to create the figures in this post is located in this gist on my Github page.