DyingLoveGrape.


The Law of Large Numbers and Games.


Introduction.

Half of this post will be the theoretical basis for what I will discuss, and the other half will be running some trials via Python and plotting some "examples." I am not a probabilist, nor am I a statistician — there, consequently, may be errors in this post. The main point I want to make is: the law of large numbers only works when you have a large number of trials. As obvious as this is, I feel that there are a number of places where this idea is misused — especially in using expected value or standard deviation in an essential way when the number of trials is small.

The Law Of Large Numbers.

Loosely, the law of large numbers states that probability only works as the number of times you do an experiment gets "large" (remember these quotes, they will be important!). Specifically, the (strong) law of large numbers states

Theorem (Strong Law of Large Numbers). Let $\{X_{i}\}_{i\in {\mathbb N}}$ be independent and identically distributed integrable random variables (if this doesn’t mean anything to you, think about die rolls) with $E(X_{i}) = \mu$, and define $\overline{X_{n}} = \dfrac{1}{n}(X_{1}+\cdots + X_{n})$. Then $\mbox{P}(\lim_{n\to\infty}\overline{X_{n}} = \mu) = 1$.

This last line means: the probability that the average value of $X_{1} + \dots + X_{n}$ will approach the expected value of each $X_{i}$ as $n\to \infty$. The idea that $n\to\infty$ is required is, in a nutshell, the main point of this post.

This is, of course, widely applied. When you flip a fair coin, you expect the probability to get heads to be $\frac{1}{2}$; indeed, this is why coin flipping is so popular: each side has an equal chance of coming up. This idea, combined with human psychology, can get a bit sticky: if heads has come up 20 times in a row, would you expect it to come up again? Would you expect that it was "time for tails to come up" after that long string? This is illustrated in the Gambler’s Fallacy, an interesting topic which we will not get into in this post, but which is important in its own right. The problem is that, with a small number of flips, the probability \[\mbox{P}(X = \mbox{Heads}) = \frac{\mbox{Number of Heads We Get}}{\mbox{Total Coin Flips}}\] will not equal $\frac{1}{2}$ and, indeed, can be quite far away from $\frac{1}{2}$.

Here is an illustration. I’ve let Mathematica generate 0′s and 1′s (meaning "tails" and "heads" respectively) at random with equal probability . If we allow for 5 flips in one "game", the probability of heads (total number of heads over total number of flips) is essentially random; I’ve played ten games and graphed their probabilities below:

As we see, we don’t really get anywhere near one half in some of these games. If we allow for 10 flips for each game, then we get the following:

This is nicer, because it shows that, in each game, we get, on average, pretty close to 0.5. Only a few games are "far away" from 0.5; indeed, we expect such deviation. Let’s look at 100 coin flips for each game:

This shows that most of our games will hover around having an average of 0.5 — what we would expect.

How Large is Large?

There is a subtle (well, maybe not so subtle — ) point in the law of large numbers: we need a large number of trials to get close to the true expected value. But...how large is large?

At this point we could use a number of famous inequalities and theorems to see what an optimal number of trials would be (or at least a nice range of numbers), but I think it would be more instructive to use some real generated data to show the reader what "large" looks like in some cases.

Here is what we will do. I will construct an experiment to construct a few coin flips. Then we will measure how long it takes to flip and get within a certain range of the expected value. That is, we will flip a few times, count the number of heads, and see how long it takes us before we get within some $\epsilon$ of 0.5 (that is, we will look at a tiny interval around 0.5).

We need to be careful here. For example, if we flip heads the first time and tails the second time, this will give us an average of 0.5; but, of course, this is not what we want — we know that this will quickly deviate from 0.5 and then slowly come back to the expected value. Here, we flip a coin and then plot the average value (the first time, it's just the value of the coin). Then we flip it again and plot the average value (the first value plus this value, then divide by two). We continue to flip and plot the running averages. The point is to see if the averages converge quickly to the expected value of 0.5. Here's a plot of the running averages for the first 20 rolls.

Note that this wildly fluctuates in the beginning, going from $0\%$ up to around $50\%$, then doing little jumps and drops between $48\%$ and $60\%$. As we increase the number of rolls (using the same data as the previous plot),

We see that this goes down for a while, passes by the $50\%$ mark, and continues onwards. Note that, though we’ve done this experiment 50 times, we still have a number more "tails" than "heads" (which weighs the data and brings the average closer to $0%$), though we’d expect them to be equal "after a while." Here is a chart for the first 1000 flips.

Notice that at the very beginning we have the erratic behavior we’ve seen before: this is, of course, because the law of large numbers hasn’t "kicked in" yet; the average value will fluctuate radically with each coin flip. As we go onwards we see that the behavior stops being so erratic (look how, for example, the first part looks "sharp"; this is because it goes up and down quickly) and begins to smooth out. Not only this, but it begins to approach its limiting probability of 0.5. Notice that after 200 flips we begin to get close to the limiting probability: we stay around $\pm$ 0.05 of 0.5 after around 400-or-so flips. This is what we’d expect.

A similar thing happens when we roll dice, though the experiment is slightly more interesting. For a 4-sided die (henceforth called a "D4") the expected value is equal to $\frac{(1 + 2 + 3 + 4)}{4} = 2.5$. We do the exact same experiment as before, and record the values we obtain for the first 10 rolls:

We see that, despite eventually "coming back down" at the end, the average of the rolls spend a lot of time above below expected value. Similarly, other iterations of this experiment have given other wild behaviors: staying near 1, fluctuating between 2 and 4 wildly, staying at 3, and so forth. Let’s see what happens as we continue rolling 40 more times:

One notices that we are now "closer" to the expected value, and the rolls are "around" 2.5, within 0.15 on either side. Let’s add some more rolls until we get to 200:

Whoops. We’re still not quite at 2.5, though we are consistently with 0.15 of it near the end. Maybe if we add some more rolls?

We’re getting closer. In the last graph, we were within 0.1 of the expected value. Now we are getting within 0.05 of the expected value towards the 800 mark. In general, it looks like around 100 rolls are "good enough" to lie within 0.5 of the expected value (though I did get a few exceptions when running trials). At last, look at this last figure:

We see that after a significant number of rolls (around 6500) we finally are consistently within 0.02 or so of our expected value. This is pretty close if you only need an error of 0.02, but 6500 is a pretty large number — if this was an experiment where we needed to hit a small particle with another small particle, for example, it may take significantly more than 6500 tries to get within a proper range since an error of 0.02 may be extremely large in this case. Zooming into the end (starting with the 6000th trial at $x = 0$),

we notice a bunch of little hills and valleys, but nothing too extreme. Notice that even after 6000 trials we're "pretty close" but, depending on the precision one needs, we may not be anywhere near the expected value. We'd need more trials for that (and, to test that out, I'd need a faster computer —).

Rolling two D4 do a similar thing, though it seems that they converge to their expected value (of 5) quicker. For example, here is the first 100 rolls:

Notice, though, before 40 rolls, the average is erratic and strange. Even after 60 rolls, the rolls are only within around 0.1 of the expected value. For 1000 rolls,

we notice that we get somewhat closer to the expected value, getting within about 0.1 after around 200 rolls. If we take this to the extreme 10,000 rolls, we obtain a much nicer plot (starting with roll 9800 at $x=0$):

For most of the trials I ran, it took around 6000 rolls to get within 0.01 of the expected value. Of course, we'd expect to need a large number of trials like this.

Despite the handwaving and "experimental" data above, statistics gives us some ways of telling "how much error" is going to happen on average. Many of these estimates are based on $z$- or $t$-values and the standard deviation and, given a large enough $n$, these work quite well. Even for smaller $n$, we sometimes get a relatively nice estimate using $t$-values. A lot of emphasis is placed on the standard deviation, but how well does standard deviation measure the average deviation of our experiments for not-so-large $n$?

Standard Deviation.

The standard deviation is a measure which essentially tells us, on average, how much our data deviates from the mean. In addition, similar to the idea of an expected value and an average of elements, there are two different "kinds" of standard deviations: the population and the sample standard deviation. Of course, the job of the latter is usually to attempt to approximate the former. Here’s how to compute these:

Population Standard Deviation: If we have data for a population, we will also have $\mbox{E}(X) = \mu$ the expected value for the population. Because it is a population, this $\mbox{E}(X)$ will be exact: it is not approximating anything. The standard deviation is given by $\sqrt{\mbox{E}(X^{2}) - \mbox{E}(X)^{2}}$, the square root of the variance.

Sample Standard Deviation: For this one we do not know what our expected value of the population is, and, hence, we have only an approximation of that average given by the average $\bar{x}$. We calculate the sample standard deviation in the following way: \[s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}}.\]

Note that we’ve used the factor $n - 1$ instead of $n$ because of Bessel’s Correction.

Suppose we have a 6-sided die (henceforth, "D6"). The expected value $\mbox{E}(X) = 3.5$ is easily calculated; the value $\mbox{E}(X^{2}) = 15.1667$ can be calculated in a similar way. These two values give us that the population standard deviation is equal to $\sigma = \sqrt{2.91667} = 1.70783$.

In a real test, how long does it take until we get close to this standard deviation? We do some trial rolls and plot the sample standard deviation each time. For 20 rolls,

Notice that our sample standard deviation spends a lot of time meandering back and forth, and gets to only about 0.2 or so within the population standard deviation (1.70783). Let’s do the same for 100 rolls,

We see that this eventually gets to within about 0.1 of the population standard deviation after 80 or so rolls. After 250 rolls, below we see that the sample standard deviation gets quite close to the population standard deviation (within about 0.01):

Let’s do another example which isn’t so dull. This time, we will roll two D6′s. One can calculate the expected value and so forth, and from this we find that the standard deviation $\sigma = 2.412$. As usual, we expect the first 10 rolls to not give too nice of a graph:

What about after 100 rolls?

We see that the standard deviation is still not all that close to the population standard deviation even after 100 rolls, though it has stabilized within about 0.15 below it. What about 300 rolls?

Look at this behavior! Interesting. Eventually, the sample standard deviation will become closer and closer to the population standard deviation — though, depending on the precision you need, it might take a while longer.

I should note that if one does not need much precision for one's analysis then one can estimate expected values and standard deviates reasonably well by doing experiments similar to the ones we've done above. For example, if we want just a "ballpark figure" for the (population) standard deviation of rolling two 20-sided dice:

Notice that the values seem to stabilize near 8.5 or so. The actual standard deviation is 8.15475 which, depending on the analysis one is doing, might be "close enough."

The Empirical Rule and Common Mistakes.

One rule which is cited frequently is the

Empirical Rule: In a normal distribution, approximately $68\%$ of data is one standard deviation from the mean, $95\%$ between two standard deviations, and $99.7\%$ is between three standard deviations.

In many cases (especially when $n$ is large enough and the data is approximately normal) this rule works quite well. Though, when $n$ is not-so-large (and I know I'm beating this point to death) even if the data is approximately normal, the estimate is often not-so-great. Oftentimes one will see someone claiming that $95\%$ of their data lies between two standard deviations which may be quite inaccurate. In the case of small $n$ one can attempt to use the $t$-distribution (with degrees of freedom $n-1$) but this will give them a much wider interval in attempt to "capture" $95\%$ of the data.

The Point?

You're probably sick of reading it by now, but the point is this: when you need to apply the law of large numbers make sure that you have a large enough number of trials. In particular, if you are using expected value and standard deviation to make crucial decisions about something, make sure that the number of trials conducted is large enough for these to be accurate representations of the population.

It’s a bit of a strange feeling to go in reverse here — usually one uses a sample’s data and attempts to reconstruct the population, but here we are using the population’s data and noting that the sample may not always represent it accurately — but it is important to note. Especially since I’ve seen (e.g., in various game-making and game-analyzing forums) these concepts used time and time again without any reference to the law of large numbers and the necessity for a large number of trials. This post, in particular, was primarily motivated by the dozens of individuals I saw on game-making and game strategy forums making claims which, even if mathematically correct, were not entirely accurate or relvant. At the very least I hope that I've made an impression on the reader that sometimes one cannot just say "$n\geq 30$ so this is large enough."

In The Next Post...

In the next post, we will go over simple games and discuss which of these games can benefit from using expected value and which games (which I call "Luck Based") are games which, regardless of strategy, are essentially based on the "luck" of the roller — that is, one where one particularly "good" roll will give a large advantage/disadvantage to a player from which it is difficult to recover from. For example, if we had a game where each player rolls a die in turns and sum up the numbers they have rolled, if they use a D6 and race to 7 then the first player rolling a 1 is a severe disadvantage, from which it is difficult to recover from. On the other hand, if they raced to something like 1000, each roll would have less of an effect and, in general, the rolls would eventually approximate the number of turns times the expected value — this type of game is "less" based on luck. We will discuss some of these things in the next post.