Sometimes you need a lot more data than you think

7 min readNov 11, 2022

This article brings to light work undertaken by Nassim Taleb. Taleb’s work focuses on fat-tailed distributions and how things are not really as normal distributed as suggested or approximated to and what the implications of that are. This article focuses on some of the foundations of his thinking around fat-tailed distributions, how the Central Limit Theorem and Law of Large numbers don’t always work in real life scenarios as thought, and how as practically we can overcome some of these challenges.

Note: Much of the work here follows from his Technical Incerto: Statistical consequences of fat-tails.

Fat Tailed Distributions

Fat-tailed or heavy tailed distributions are a group of distributions that have a higher probability of extreme events happening when compared to, let’s say a normal distribution. They are usually identified as having a large skew away from the normal and a larger kurtosis.

We can see how the tails extend further to the right in different distribution

What is evident from the chart above is how much probability exists ‘in the tail’ as we move from Normal, to lognormal and to a pareto distribution (alpha = 1.25).

NOTE: There is a separate discussion on whether a lognormal distribution has a fat-tail or not which we don’t dive into here, suffice to say, the tail is ‘fatter’ than a normal distributions.

We can provide two examples of the differences between ‘regular’ and fat tailed distribution:

Thin-tail: when the sample gets large, no single observation can really modify the statistical properties.
Fat-tailed: extreme events play a disproportionately large role in determining properties
Thin-tail: the probability of sampling higher than X twice in a row is greater than sampling higher than 2X once.
Fat-tailed: The probability of sampling higher than 2X once is greater than the probability of sampling higher than X twice in a row

Are you sure it’s Gaussian?

This is a great example that Taleb uses in his book (pg 54) that I want to share that highlights how to work with systems where extreme events have occured.

Say we have two possible distributions, either Gaussian or a Power Law. A 10 sigma event (this is pretty serious, I mean, this is a really serious event) is 1 in 1.31*10(-23) for a gaussian distribution. For the Power Law of the same scale, we take a student T distribution with tail exponent = 2, the survival function is in 1 in 203.

What is the probability of the given data being Gaussian conditional on a 10 sigma event, compared to the alternative?

If there is a tiny probability, < 10 (-10) that the data might not be Gaussian, one can firmly reject Gaussianity in favor of the thick tailed distribution. The heuristic is to reject Gaussianity in the presence of any event > 4 or > 5 STD’s.

Let’s think about this for a second, a 10 sigma event happening is extraordinary, so unless you’re absolutely certain(with probability 1) that the data comes from a Gaussian distribution, you should use the power law distribution.

The above is a great example of the ethos of Taleb’s work, are you sure you’re data definitely comes from a gaussian? Those outliers you are throwing away from your data because they can’t be managed, maybe you should think twice about that and re-structure how you see the world. This doesn’t necessarily mean you use a pareto distribution, but it could mean using a distribution which has ‘fatter’ tails than a normal. (Why? This is for another article, but suffice to say its all about risk and pay-off, or as Taleb says, Better be convex than right)

Law of Large Numbers

Given X1, X2, ….Xn are independent and and identically distributed(i.i.d) random variables with

𝔼(Xn) = 𝜇,

then for all n, the sample mean Xn converges to the population mean 𝜇 for n goes to infinity.

Central Limit Theorem

Given X1,…Xn are I.I.D random variables.

𝔼(Xn) = 𝜇

Var(Xn) = 𝜇-squared < ∞

Then as n goes to infinity, the sample mean converges to a normal distribution with mean zero and variance = 𝜇-squared

This is independent of what distribution the sample comes from.

What happens when we don’t have an infinite numbers of samples?

N is not infinite in the real world — Pre Asymptotics

Now we get to the meat of the argument. The Law of Large numbers and the Central Limit theorem above is predicated on n, the number of samples of data moving to infinity, i.e. as we move towards an infinite number of samples, the theorem holds.

Do we ever really move to an infinite number of data points in real life. I think the obvious answer is an astounding NO. We never move to infinite data points, while we may be living in a world of ‘Big Data’, most data sets are still not that big and this is what we call Pre Asymptotics, what happens to the Law of Large numbers and the Central Limit Theorem when n does not move to infinity? Well, quite simply, the theorems doesn’t always hold, particularly in the case when the underlying data comes from fat-tailed distributions.

Law of large numbers — large not infinite

As we can see from the graph below, the stability of the mean isn’t always guaranteed. For the pareto case, we there is a large flux, and even at 2000 samples, there is no guarantee that the mean has stabilised.

Calculating averages as the sample size increase from various distributions

Central Limit Theorem: Pre Asymptotic

Evolution of the distribution of the mean when sampling from a uniform distribution

The above chart clearly shows how, with only a few samples taken from a uniform distribution, the mean move towards a gaussian.

Evolution of the distribution of the mean when sampling from a LogNorm distribution

With the lognormal distribution having a fatter tail than the normal, we can see that even at 20 summands, there is still some skew, we are not seeing a completely symmetric distribution. #Preasymptoticlife

Evolution of the distribution of the mean when sampling from a Pareto distribution

Even after 1000 summands, sampling from the pareto distribution doesn’t get the mean even close to a normal distribution.

The question then we need to ask is, given we might not be sampling from a normal distribution or a thin tailed distribution, and given that we don’t have infinite samples, how much data do we need for the Law of Large numbers and Central Limit Theorem to hold?

How much data do we need?

The expected mean absolute deviation from the mean of n summands

The kappa metric allows us to understand the preasymptotic properties of a given distribution while also allowing us to compare the n-summands of different distributions or different n-summands of the same distribution. As a general rule, a kappa value above 0.15 indicates a high degree of unreliability of the use of the ‘normal approximation’.

The equation above argues that Nv, the number of samples required from a given distribution to match the drop the variance of a gaussian distribution can be approximated using the number of samples from a gaussian distribution to the power of a function of the kappa metric (baseline value of the distribution being considered in Nv).

A pareto distribution with alpha exponent = 1.25 represents the common 80/20 pareto distribution. If we have 30 samples of a gaussian, then according to the formula(k1 = 0.829, this is given in the book and can be calculated using the above), we will require 434646305 samples to get the same drop in variance from averaging (i.e. confidence intervals), that’s nearly 14.5 million times more!!!

This tells us the rate of convergence, or in the example above, how slow convergence to the Central Limit Theorem can be.

Conclusion

Taleb has developed a kappa metric to understand the rate of convergence of summands to the gaussian and how much data we would need to get the same drop in variance as a gaussian. I think what’s more pertinent is the thinking behind the metric. The approach to thinking about data, especially when we do see what we might consider ‘extreme’ events. Data with such events occurs in all aspects of our life, and for those of us who are practitioners of science, it is incumbent on us that we understand why we believe the world is structured the way it is and be able to change the structure in sight of new events.