Home Artificial Intelligence Unraveling the Law of Large Numbers

Unraveling the Law of Large Numbers

5
Unraveling the Law of Large Numbers

Pixabay

The LLN is interesting as much for what it doesn’t say as for what it does

O n August 24, 1966, a talented playwright by the name Tom Stoppard staged a play in Edinburgh, Scotland. The play had a curious title, “Rosencrantz and Guildenstern Are Dead.” Its central characters, Rosencrantz and Guildenstern, are childhood friends of Hamlet (of Shakespearean fame). The play opens with Guildenstern repeatedly tossing coins which keep coming up Heads. Each final result makes Guildenstern’s money-bag lighter and Rosencrantz’s, heavier. Because the drumbeat of Heads continues with a pitiless persistence, Guildenstern is fearful. He worries if he’s secretly willing each coin to return up Heads as a self-inflicted punishment for some long-forgotten sin. Or if time stopped after the primary flip, and he and Rosencrantz are experiencing the identical final result over and once more.

Stoppard does a superb job of showing how the laws of probability are woven into our view of the world, into our sense of expectation, into the very fabric of human thought. When the 92nd flip also comes up as Heads, Guildenstern asks if he and Rosencrantz are throughout the control of an unnatural reality where the laws of probability not operate.

Guildenstern’s fears are in fact unfounded. Granted, the likelihood of getting 92 Heads in a row is unimaginably small. In truth, it’s a decimal point followed by 28 zeroes followed by 2. Guildenstern is more more likely to be hit on the pinnacle by a meteorite.

Guildenstern only has to return back the subsequent day to flip one other sequence of 92 coin tosses and the result will almost definitely be vastly different. If he were to follow this routine on daily basis, he’ll discover that on most days the variety of Heads will roughly match the variety of tails. Guildenstern is experiencing an enchanting behavior of our universe referred to as the Law of Large Numbers.

The LLN, because it is named, is available in two flavors: the weak and the strong. The weak LLN will be more intuitive and easier to relate to. But it’s also easy to misinterpret. I’ll cover the weak version in this text and leave the discussion on the strong version for a later article.

The weak Law of Large Numbers concerns itself with the connection between the sample mean and the population mean. I’ll explain what it says in plain text:

Suppose you draw a random sample of a certain size, say 100, from the population. By the way in which, make a mental note of the term sample size. The size of the sample is the ringmaster, the grand pooh-bah of this law. Now calculate the mean of this sample and set it aside. Next, repeat this process many over and over. What you’ll get is a set of imperfect means. The means are imperfect because there’ll all the time be a ‘gap’, a delta, a deviation between them and the true population mean. Let’s assume you’ll tolerate a certain deviation. When you select a sample mean at random from this set of means, there can be a probability that absolutely the difference between the sample mean and the population mean will exceed your tolerance.

The weak Law of Large Numbers says that the probability of this deviation’s exceeding your chosen level of tolerance will shrink to zero because the sample size grows to either infinity or to the scale of the population.

Regardless of how tiny is your chosen level of tolerance, as you draw sets of samples of ever increasing size, it’ll grow to be increasingly unlikely that the mean of a randomly chosen sample from the set will exceed this tolerance.

To see how the weak LLN works we’ll run it through an example. And for that, allow me, in the event you will, to take you to the cold, brooding expanse of the Northeastern North Atlantic Ocean.

Day-after-day, the Government of Ireland publishes a dataset of water temperature measurements taken from the surface of the North East North Atlantic. This dataset incorporates tons of of 1000’s of measurements of surface water temperature indexed by latitude and longitude. For example, the information for June 21, 2023 is as follows:

Dataset of water surface temperatures of the North East North Atlantic Ocean (CC BY 4.0)

It’s sort of hard to assume what eight hundred thousand surface temperature values appear like. So let’s create a scatter plot to visualise this data. I’ve shown this plot below. The vacant white areas within the plot represent Ireland and the UK.

A color-coded scatter plot of sea surface temperatures of the Northeastern North Atlantic
A color-coded scatter plot of sea surface temperatures of the Northeastern North Atlantic (Image by Creator) (Data source: Dataset)

As a student of statistics, you won’t ever have access to the ‘population’. So that you’ll be correct in severely chiding me if I declare this population of 800,000 temperature measurements because the ‘population’. But bear with me for a bit while. You’ll soon see why, in our quest to grasp the LLN, it helps us to contemplate this data because the ‘population’.

So let’s assume that this data is — ahem…cough — the population. The typical surface water temperature across the 810219 locations on this population of values is 17.25840 degrees Celsius. 17.25840 is just the common of the 810K temperature measurements. We’ll designate this value because the population mean, μ. Remember this value. You’ll must discuss with it often.

Now suppose this population of 810219 values shouldn’t be accessible to you. As a substitute, all you could have access to is a meager little sample of 20 random locations drawn from this population. Here’s one such random sample:

A random sample of size 20
A random sample of size 20 (Image by Creator)

The mean temperature of the sample is 16.9452414 degrees C. That is our sample mean X_bar which is computed as follows:

X_bar = (X1 + X2 + X3 + … + X20) / 20

You possibly can just as easily draw a second, a 3rd, indeed any variety of such random samples of size 20 from the identical population. Listed here are a couple of random samples for illustration:

Random samples of size 20 each drawn from the population
Random samples of size 20 each drawn from the population (Image by Creator)

A fast aside on what a random sample really is

Before moving ahead, let’s pause a bit to get a certain degree of perspective on the concept of a random sample. It’s going to make it easier to grasp how the weak LLN works. And to amass this angle, I need to introduce you to the casino slot machine:

Pixabay

The slot machine shown above incorporates three slots. Each time you crank down the arm of the machine, the machine fills each slot with an image that the machine has chosen randomly from an internally maintained population of images equivalent to an inventory of fruit pictures. Now imagine a slot machine with 20 slots named X1 through X20. Assume that the machine is designed to pick out values from a population of 810219 temperature measurements. If you pull down the arm, each certainly one of the 20 slots — X1 through X20 — fills with a randomly chosen value from the population of 810219 values. Due to this fact, X1 through X20 are random variables that may each hold any value from the population. Taken together they form a random sample. Put one other way, each element of a random sample is itself a random variable.

X1 through X20 have a couple of interesting properties:

  • The worth that X1 acquires is independent of the values that X2 thru X20 acquire. The identical applies to X2, X3, …,X20. Thus X1 thru X20 are independent random variables.
  • Because X1, X2,…, X20 can each hold any value from the population, the mean of every of them is the population mean, μ. Using the notation E() for expectation, we write this result as follows:
    E(X1) = E(X2) = … = E(X20) = μ.
  • X1 thru X20 have equivalent probability distributions.

Thus, X1, X2,…,X20 are independent, identically distributed (i.i.d.) random variables.

…and now we get back to showing how the weak LLN works

Let’s compute the mean (denoted by X_bar) of this 20 element sample and set it aside. Now let’s once more crank down the machine’s arm and out will pop one other 20-element random sample. We’ll compute its mean and set it aside too. If we repeat this process one thousand times, we could have computed one thousand sample means.

Here’s a table of 1000 sample means computed this fashion. We’ll designate them as X_bar_1 to X_bar_1000:

A table of 1000 sample means. Each mean is computed from a random sample of size 20

Now consider the next statement fastidiously:

Because the sample mean is calculated from a random sample, the sample mean is itself a random variable.

At this point, in the event you are sagely nodding your head and stroking your chin, it is vitally much the correct thing to do. The conclusion that the sample mean is a random variable is one of the penetrating realizations one can have in statistics.

Notice also how each sample mean within the table above is a long way away from the population mean, μ. Let’s plot a histogram of those sample means to see how they’re distributed around μ:

A histogram of sample means
A histogram of sample means (Image by Creator)

Many of the sample means appear to lie near the population mean of 17.25840 degrees Celsius. Nonetheless, there are some which might be considerably distant from μ. Suppose your tolerance for this distance is 0.25 degrees Celsius. When you were to plunge your hand into this bucket of 1000 sample means, grab whichever mean falls inside your grasp and pull it out. What can be the probability that absolutely the difference between this mean and μ is the same as or greater than 0.25 degrees C? To estimate this probability, it’s essential to count the variety of sample means which might be no less than 0.25 degrees away from μ and divide this number by 1000.

Within the above table, this count happens to be 422 and so the probability P(|X_bar — μ | ≥ 0.25) works out to be 422/1000 = 0.422

Let’s park this probability for a minute.

Now repeat the entire above steps, but this time use a sample size of 100 as an alternative of 20. So here’s what you’ll do: draw 1000 random samples each of size 100, take the mean of every sample, store away all those means, count those which might be no less than 0.25 degrees C away from μ, and divide this count by 1000. If that seemed like the labors of Hercules, you weren’t mistaken. So take a moment to catch your breath. And once you’re all caught up, notice below what you could have got because the fruit to your labors.

The table below incorporates the means from the 1000 random samples, each of size 100:

A table of 1000 sample means. Each mean is computed from a random sample of size 100

Out of those one thousand means, fifty-six means occur to deviate by least 0.25 degrees C from μ. That provides you the probability that you just’ll run into such a mean as 56/1000 = 0.056. This probability is decidedly smaller than the 0.422 we computed earlier when the sample size was only 20.

When you repeat this sequence of steps multiple times, every time with a distinct sample size that increases incrementally, you’ll get yourself a table stuffed with probabilities. I’ve done this exercise for you by dialing up the sample size from 10 through 490 in steps of 10. Here’s the final result:

A table of probabilities. Shows P(|X_bar — μ | ≥ 0.25) as the sample size is dialed up from 10 to 490
A table of probabilities. Shows P(|X_bar — μ | ≥ 0.25) because the sample size is dialed up from 10 to 490 (Image by Creator)

Each row on this table corresponds to 1000 different samples that I drew at random from the population of 810219 temperature measurements. The sample_size column mentions the scale of every of those 1000 samples. Once drawn, I took the mean of every sample and counted those that were no less than 0.25 degrees C other than μ. The num_exceeds_tolerance column mentions this count. The probability column is num_exceeds_tolerance / sample_size.

Notice how this count attenuates rapidly because the sample size increases. And so does the corresponding probability P(|X_bar — μ | ≥ 0.25). By the point the sample size reaches 320, the probability has decayed to zero. It blips as much as 0.001 occasionally but that’s because I even have drawn a finite variety of samples. If every time I draw 10000 samples as an alternative of 1000, not only will the occasional blips flatten out however the attenuation of probabilities will even grow to be smoother.

The next graph plots P(|X_bar — μ | ≥ 0.25) against sample size. It puts in sharp relief how the probability plunges to zero because the sample size grows.

P(|X_bar — μ | ≥ 0.25) against sample size
P(|X_bar — μ | ≥ 0.25) against sample size (Image by Creator)

Instead of 0.25 degrees C, what in the event you selected a distinct tolerance — either a lower or a better value? Will the probability decay regardless of your chosen level of tolerance? The next family of plots illustrates the reply to this query.

The probability P(|X_bar — μ | ≥ ε) decays (to zero) as the sample size increases. This is seen for all values of ε
The probability P(|X_bar — μ | ≥ ε) decays (to zero) because the sample size increases. That is seen for all values of ε (Image by Creator)

Regardless of how frugal, how tiny, is your alternative of the tolerance (ε), the probability P(|X_bar — μ | ≥ ε) will all the time converge to zero because the sample size grows. That is the weak Law of Large Numbers in motion.

The behavior of the weak LLN will be formally stated as follows:

Suppose X1, X2, …, Xn are i.i.d. random variables that together form a random sample of size n. Suppose X_bar_n is the mean of this sample. Suppose also that E(X1) = E(X2) = … = E(Xn) = μ. Then for any non-negative real number ε the probability of X_bar_n being no less than ε away from μ tends to zero as the scale of the sample tends to infinity. The next exquisite equation captures this behavior:

The weak Law of Large Numbers
The weak Law of Large Numbers (Image by Creator)

Over the 310 yr history of this law, mathematicians have been in a position to progressively calm down the requirement that X1 through Xn be independent and identically distributed while still preserving the spirit of the law.

The principle of “convergence in probability”, the “plim” notation, and the art of claiming really vital things in really few words

The actual form of converging to some value using probability because the technique of transport is named convergence in probability. On the whole, it’s stated as follows:

Convergence in Probability
Convergence in Probability (Image by Creator)

Within the above equation, X_n and X are random variables. ε is a non-negative real number. The equation says that as n tends to infinity, X_n converges in probability to X.

Throughout the immense expanse of statistics, you’ll keep running right into a quietly unassuming notation called plim. It’s pronounced ‘p lim’, or ‘plim’ (just like the word ‘ plum’ but with in ‘i’), or probability limit. plim is the short form way of claiming that a measure equivalent to the mean converges in probability to a particular value. Using plim, the weak Law of Large Numbers will be stated pithily as follows:

The weak Law of Natural Numbers expressed using very less ink
The weak Law of Natural Numbers expressed using very less ink (Image by Creator)

Or just as:

(Image by Creator)

The brevity of notation shouldn’t be the least surprising. Mathematicians are drawn to brevity like bees to nectar. Relating to conveying profound truths, mathematics could well be probably the most ink-efficient field. And inside this efficiency-obsessed field, plim occupies podium position. You’ll struggle to unearth as profound an idea as plim expressed in lesser amount of ink, or electrons.

But struggle no more. If the laconic great thing about plim left you wanting for more, here’s one other, possibly much more efficient, notation that conveys the identical meaning as plim:

The weak Law of Natural Numbers
The weak Law of Natural Numbers expressed using even lesser ink (Image by Creator)

At the highest of this text, I discussed that the weak Law of Large Numbers is noteworthy for what it doesn’t say as much as for what it does say. Let me explain what I mean by that. The weak LLN is commonly misinterpreted to mean that because the sample size increases, its mean approaches the population mean or various generalizations of that concept. As we saw, such ideas concerning the weak LLN harbor no attachment to reality.

In truth, let’s bust a few myths regarding the weak LLN instantly.

MYTH #1: Because the sample size grows, the sample mean tends to the population mean.

This is kind of possibly probably the most frequent misinterpretation of the weak LLN. Nonetheless, the weak LLN makes no such assertion. To see why that’s, consider the next situation: you could have managed to get your arms around a extremely large sample. When you gleefully admire your achievement, it’s best to also pose yourself the next questions: Simply because your sample is large, must it even be well-balanced? What’s stopping nature from sucker punching you with an enormous sample that incorporates an equally giant amount of bias? The reply is completely nothing! In truth, isn’t that what happened to Guildenstern along with his sequence of 92 Heads? It was, in spite of everything, a totally random sample! If it just so happens to have a big bias, then despite the massive sample size, the bias will blast away the sample mean to a degree that’s distant from the true population value. Conversely, a small sample can prove to be exquisitely well-balanced. The purpose is, because the sample size increases, the sample mean isn’t guaranteed to dutifully advance toward the population mean. Nature doesn’t provide such unnecessary guarantees.

MYTH #2: Because the sample size increases, just about every little thing concerning the sample — its median, its variance, its standard deviation — converges to the population values of the identical.

This sentence is 2 myths bundled into one easy-to-carry package. Firstly, the weak LLN postulates a convergence in probability, not in value. Secondly, the weak LLN applies to the convergence in probability of only the sample mean, not another statistic. The weak LLN doesn’t address the convergence of other measures equivalent to the median, variance, or standard deviation.

It’s one thing to state the weak LLN, and even display how it really works using real-world data. But how are you going to make sure that it all the time works? Are there circumstances through which it would play spoilsport — situations through which the sample mean simply doesn’t converge in probability to the population value? To know that, it’s essential to prove the weak LLN and, in doing so, precisely define the conditions through which it would apply.

It so happens that the weak LLN has a deliciously mouth-watering proof that uses as certainly one of its ingredients, the endlessly tantalizing Chebyshev’s Inequality. If that whets your appetite, stay tuned for my next article on the proof of the weak Law of Large Numbers.

It’s going to be impolite to take leave off this topic without assuaging our friend Guildenstern’s worries. Let’s develop an appreciation for just how unquestionably unlikely a result it was that he experienced. We’ll simulate the act of tossing 92 unbiased coins using a pseudo-random generator. Heads can be encoded as 1 and tails as 0. We’ll record the mean value of the 92 outcomes. The mean value is the fraction of times that the coin got here up Heads. We’ll repeat this experiment ten thousand times to acquire ten thousand technique of 92 coin tosses, and we’ll plot their frequency distribution. After completing this exercise, we’ll get the next sort of histogram plot:

A histogram of sample means of 10000 samples
A histogram of sample technique of 10000 samples (Image by Creator)

We see that the majority of the sample means are grouped across the population mean of 0.5. Guildenstern’s result — getting 92 Heads in a row —is an exceptionally unlikely final result. Due to this fact, the frequency of this final result can be vanishingly small. But contrary to Guildenstern’s fears, there may be nothing unnatural concerning the final result and the laws of probability proceed to operate with their usual gusto. Guildenstern’s final result is just lurking contained in the distant regions of the left tail of the plot, waiting with infinite patience to pounce upon some luckless coin-flipper whose only mistake was to be unimaginably unlucky.

5 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here