Running Simulations

Obtaining random numbers

The quality of pseudo-random number generators vary widely, and some care is needed if you are preparing to run a simulation. First of all, never use rand() or any unofficial hacks from this course when you run simulations in C++. In Python you can use random(), randint() or randrange() for good results.

In C++ you must learn how to use std::random library. Here is some code modified from cppreference.com, which will help you get started. The code is more or less equivalent to the random functions in Python.

std::random_device rd; // We obtain a "true" random seed
std::mt19937 gen(rd()); // We seed the Mersenne twister

//
// Replace these ones below with whatever distribution you are using
//

//
// generates random floats in range [0,1)
//
std::uniform_real_distribution<> dis(0,1);

//
// generates random integers in range [0,10]
//
std::uniform_int_distribution<> dis2(0,10);

//
// There are many distributions......see std documentation
//

//
// Use your generators as this:
//
std::cout << dis1(gen);

Note! Never use the device for generating random number - should only be used as a seed for another generator!

You could also use actual (instead of pseudo) random numbers from RANDOM.ORG.

Example simulation

Let $S$ be a finite simple sample space and $A$ an event. A simulation to calculate the probability $P(A)$ could be the following (in pseudo C-code):

count = 0;
for(int i = 0; i < num_simulations; i++) {
choose random element p in S;
if(inA(p)) count ++;
}
probA = (float) count / (float) num_simulations;

The two things we need to implement in a concrete example would be the choosing of a random element from S, and the predicate inA.

Lets calculate the probability that we have at least two $6$'s if we roll $4$ dice. In Python we could do

numRuns = 1000000
hits = 0
for i in range(numRuns):
count = 0;
for j in range(4):
if(randint(1,6) == 6): count += 1;
if(count > 1): hits += 1

print(hits/numRuns)

When I ran the code I got $0.132286$. The correct answer is $0.131944\dots$

Monte Carlo integration

We will look at a method called Monte Carlo Integration. We look at an example using this method, where we compute the area of some part $A$ of the plane. The first thing we do is to have a careful look at the problem. We find a reasonable upper bound of the area. The simplest approach is to find a bounding box for the area, i.e. a rectangle $B$, which encloses $A$. Now the area of $B$ is easy to compute by hand, and if $A/B$ is not close to $0$ or $1$ (which would lead to longer simulation times), we can run a simulation to find the area $A$. In C++

count = 0;
for(int i = 0; i < num_simulations; i++) {
choose random point p in B;
if(inA(p)) count ++;
}
areaA = areaB * count / num_simulations;

The only thing we need to implement in a specific problem is the predicate inA.

Accuracy

So now we know how to run simulations. But how many runs should we do? A simple method would be to experiment and see how the answer converges when we increase the number of runs. But it is useful to calculate quick and reasonable bounds without experimenting.

Consider a sample space $S$ with an event $A$. Let $X$ be the random variable which takes the value $1$ on $A$ and $0$ on $\overline{A}$. All our simulation examples so far can be thought of as estimating the mean of $X$. Recall from the previous lecture that $X$ has the Bernoulli distribution, the mean of $X$ is $p=P(A)$ and the variance is $p(1-p)$.

Say we want to do $n$ simulation runs. Let $Y$ be the random variable $$Y=\frac{1}{n}\sum X_i$$ where $X_i$ is the result of the $i$'th random choice from $S$. The mean of $Y$ is $p$, and the variance is $\frac{1}{n}Var(X)$, since the $X_i$ are independent. Using Chebyshev's inequality (and remembering that $p(1-p)\leq \frac{1}{4}$ we get that $$P(|Y-p|\geq t) \leq \frac{p(1-p)}{nt^2}\leq \frac{1}{4nt^2}.$$

Let us do some example calculations. Say that we want to calculate $P(A)$ up to $\pm 0.01$, in such a way that there is at most a $5$ per cent probability that the calculation is wrong. How many runs would we need? We want $$P(|Y-p|\geq 0.01)\leq 0.05$$ which we will achieve if $$\frac{1}{4n(0.01)^2}\leq 0.05$$ which we solve to get $$n>50000.$$

Say that we want to calculate $P(A)$ up to $\pm 0.01$, in such a way that there is at most a $0.5$ per cent probability that we are wrong. How many runs would we need We want $$P(|Y-p|\geq 0.01)\leq 0.005$$ which we will achieve if $$\frac{1}{4n(0.01)^2}\leq 0.005$$ which is solved to $$n>500000$$.

In general, this says that if you want to decrease the probability of mistake by one decimal digit, you will need to do $10$ times as many runs.

Say that we want to calculate $P(A)$ up to $\pm 0.001$, in such a way that there is at most a $5$ per cent probability that we are wrong. How many runs would we need We want $$P(|Y-p|\geq 0.001)\leq 0.05$$ which we will achieve if $$\frac{1}{4n(0.001)^2}\leq 0.05$$ or $$n>5000000.$$

In general, this says that if you want to increase precision by one decimal digit, then you will need to do $100$ times as many runs.

These bounds are conservative, and in practise you should expect faster convergence than this. Regardless, you will always face problems when simulating probabilities that are close to $0$ or $1$.

Rare events

So what can you do if you are simulating to compute a probability which is very close to $0$ or $1$? Lets consider a sketch of a simple example where we use conditional probability to simplify the simulation of a rare event:

What is the probability of drawing all aces if you draw $4$ cards from a deck of $52$?

The probability is computed by hand as approx. $0.000014775$. If you use the trivial approach to simulations you will be running for a long time to converge to this number.

So we use conditional probabilities to divide up the problem. First we simulate the probability that we draw one ace from a deck of cards. This probability is higher at approx. $0.0769$ and we will converge faster. We now simulate the probability that the second card is an ace, given that the first card is an ace, the probability that the third card is an ace, given that the first two cards are aces, and the probability that the last card is an ace, given that the first three cards were aces.

Let $A_i$ be the event that the first $i$ cards are aces. We have found values for

$P(A_1)$
$P(A_2 | A_1)$
$P(A_3 | A_2)$
$P(A_4 | A_3)$

and can compute our required probability as $$P = P(A_1)\cdot P(A_2 | A_1)\cdot P(A_3 | A_1, A_2)\cdot P(A_4 | A_3, A_2, A_1).$$

This example is an illustration of a method we can call splitting. The general setup is a rare event $A$ and $n$ events $A_i$, where $$A_1\supseteq A_{2} \supseteq \cdots \supseteq A_n = A$$ such that each conditional event is not rare. We have $$P(A)=P(A_1)\cdot P(A_2|A_1)\cdots P(A_{n}|A_{n-1}).$$