Comparing Distributions: EDF and Wasserstein Distance

ObjectPetitU
7 min readOct 24, 2019

In the following we overview some methods which can be used to asses similarity/difference between distributions.The article will not provide rigorous mathematical proofs and constructs, but is aimed to discuss certain aspects as to develop a greater intuition about them. We look at some well known statistics for testing normality and between distributions and we explore the Wasserstein distance as metric to assess difference between distributions.

What is a Probability Distribution?

One of the main questions that the area of statistics attempts to tackle is the whole concept of data, and where does the data come from. What we aim to do when we develop or explore probability distributions is to take data, and develop a distribution which enables us to capture how likely events from that data set are, without requiring the whole data. Essentially, we aim to parameterise that data, we say we can represent the information in the data, using a few parameters which describe a probability distribution.

In so many scenarios we have data, and we want to assess whether it is similar to some other data, or if it follows a particular distribution. This could be as simple as data generated at different times, such as energy consumption between winter or summer, or it could be related to the size of trees planted in different locations.

In the following we present a few techniques.

Techniques for comparison

As the normal distribution is fundamental to statistics as a whole, many tests aim to assess whether data comes from the normal distribution or not. Let’s cover some simple heuristics which are widely used.

Anderson-Darling

Test for Normality : The Anderson-Darling tests whether data comes from a particular distribution. It can be used to test whether that distribution is the normal distribution, as well as others, such as exponential, extreme-value, Weibull, gamma, logistic, Cauchy, and von Mises distributions.

The output is a test statistic which needs to be tested against a table of values.

Shapiro-Wilk Test

The Shapiro-Wilk Test also aims to elucidate whether a data set comes from a normal distribution or not. The null hypothesis is that the distribution is normally distributed. A test statistic value is generated which needs to be compared against p-values to assess significance and acceptance/rejection of the null hypothesis.

Kolmogorov-Smirnov Test

The K-S test is a non-parametric test which be used to compare two data sets, or to compare one against a particular distribution. In the case that we are looking to test whether a variable comes from the normal distribution, it will generate CDF for both and find a distance measure between the two. In the case that it is comparing two variables, it will generate CDF’s and generate the distance between the two.

When comparing two different data sets, it generates the Kolmogorov-Smirnov statistic, which, needs to be compared to a table of critical values to assess the results and because it is a hypothesis test, you will also generate a p-value, a, which can be tested against varying degrees of significance (0.001, 0.05 etc)

If the p value generated is less than the a, then it is likely that the two distributions are different.

comparing distance between CDF in Kolmogorov-Smirnov test

Mann-Whitney U Test

This test looks at the whether two independent samples comes from populations which have the same distribution.

Wasserstein Distance

While the above are statistical tests which can be utilised to compare distribution and if they come from the same distribution/normal distribution, we can also use the Wasserstein distance.

Without going into too much detail, the Wasserstein distance, also known as the ‘earths movers distance’ is a distance measure between distributions.It is from a field known as ‘Optimal Transport Theory’. The intuition is as follows:

If we consider each distribution as a pile of dirt, how much effort or ‘cost’ is required in turning one pile into another pile of dirt.

What should we expect to see?

The Wasserstein distance is as follows:

Wasserstein distance where X and Y are probabilities
Lower Bound for Wasserstein distance given X and Y are Normal distributions

Thus we can see from the lower bound that when X and Y are two normal distributions, the Wasserstein distance between them will be at least the difference between the means.

Comparing normal distributions

Here we take Distribution A as being a Normal Distribution with mean = 0 and sd = 1. Distribution B has a mean with range = [0,8], with a fixed sd = 1.

The results seems intuitive. As we increase the distance of the means, the Wasserstein distance increases. It’s interesting to see here that the relationship is linear, which is exactly what we should be seeing. The Wasserstein distance when utilised to compare two Normal distributions should give us the value of the difference between the means.

The p value from the KS test is zero, indicating that both distributions are the same. This is interesting, as we change the mean of the normal distribution, its shape will remain, but its location will change. The KS statistic see’s that and is correctly identifies difference.

Comparing Normal with t-Distribution

We compare a normal distribution with mean = 0 and SD = 1 with a t-distribution with varying levels of degrees of freedom, ranging from [1,9].

We can see that the p-value generated by KS is below the threshold of 0.05 and 0.001, and thus it considers the samples coming from different distributions. This does not really tell us a lot. In comparison, the Wasserstein distance show how the distance decreases as the degrees of freedom increase.

The t distribution has a mean = 0 for degrees of freedom > 1. While in the comparison of gaussians, the Wasserstein distance is at least the difference between the means, here, we see this is not the case. With a normal distribution with mean =0, and a t distribution with mean = 0, we can see the values mostly hover around a distance = 1.

The t distribution has infinite support meaning that its tails are not bounded and carry on indefinitely. However, as we increase the degrees of freedom, that one sigma events become less spread out, making it closer to a normal distribution. Sot the Wasserstein distance is able to recognise this point.

Comparing the shape of Normal distribution and t-distributions with varying degrees of freedom

Comparing t-distribution with t-distribution

Here we compare a t-distribution with 3 degrees of freedom, with a t-distribution with varying degrees of freedom, ranging from [1,9].

The p-values from KS are choppy, and only in the first instance, have a value less than 0.05. This indicates that initially the distributions come from different distributions, but then, as the degrees of freedom increase, it highlights the possibility of them comping from the same distribution.

The Wasserstein distance initially declares a large distance between the two, and then decreases as the degrees of freedom increases. As the mean of a t-distribution with degrees of freedom > 1 is equal to zero, we can see the values for the Wasserstein distance are very close to zero.

Which method shall we use?

The tests described above are what are known as Empirical Distribution Function (EDF) tests. The aim is to generate EDF’s from the data, and compare with CDF’s of the normal distribution to assess closeness.

The authors in [1] claim that Shapiro-Wilk test is the most powerful, followed by Anderson Darling, and then Kolmogorov-Smirnov. However, it should be understood that with sample sizes, even the Shapiro-Wilk test does not perform well.

The Wasserstein is very much different to the EDF tests. Even when comparing gaussians, we would get an answer which is the difference between the means. However, we could only that if we know that they were both gaussians in the first place, thus Wasserstein distance is really to look how the distance between distributions, not a test for normality.

Bibliography

[1] N.M.Razali, Y.B.Wah, Power Comparisons between Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson Darling tests

--

--