Wednesday, January 23, 2013

The Birthday Paradox

Imagine that you are sitting in a classroom with 22 other people. How likely is it that two people in this room share a birthday? As it turns out, this probability exceeds 50%; that is, more than half the time, two people in a room of 23 will share a birthday. This result may seem counterintuitive. In fact, this likelihood disagrees so often with human intuition that it is famously known as the birthday paradox. In today's post, we will derive this probability using the probabilistic techniques established in previous posts (see our post on permutation statistics for an overview of these techniques).

Let's start by defining $B_1,B_2,\dots,B_n$, where $B_i$ is a random variable for the birthday of person $i$. We will assume that each $B_i$ is independent and identically distributed with a range of values $1,2\dots,365$ (not a leap year). A collision is a pair $B_i,B_j$ for which $B_i = B_j$ (and $i \neq j$). Let $X$ be a random variable for the number of collisions. To resolve the birthday paradox, we want to know the probability that $X$ is at least 1 as a function of $n$.

Let's first examine the expectation of $X$. We can compute $X$ as a sum of indicator variables for each possible collision, from which it follows that
\[E[X] = \sum_{i < j} P(B_i = B_j), \] by linearity of expectation. If we assume that the birthdays are uniformly distributed, it follows that \[E[X] = \frac{1}{365}{n \choose 2}\] as $P(B_i = B_j) = \frac{1}{365}$ for each $i \neq j$. For what value of $n$ might we expect at least one collision? From the equation for $E[X]$, it follows that $E[X] \geq 1$ when $n \geq 28$. This analysis does not address the problem in question but does lend some intuition as to why the claim is true. The idea is that as $n$ increases, the number of ways a collision might occur increases at rate of $O(n^2)$. This quadratic growth tells us that we can start to expect a collision when $n$ exceeds (some constant factor times) $\sqrt{m}$, where $m$ is the number of outcomes (in this case $m=365$).

Let's take a step back and try to better estimate $P(X \geq 1)$. We will generalize the problem slightly by assuming that there are $m$ possible outcomes and maintain the assumption that each is equally likely. Let $Y_i$ be the event that the $i$-th outcome (the $i$-th person's birthday) does not collide with any prior outcomes. It then follows that the probability of no collisions is
\[ P(X = 0) = P(Y_1 \cap Y_2 \cap Y_3 \cap \dots \cap Y_n), \] which expands to
\[P(Y_1)P(Y_2 | Y_1)P(Y_3 | Y_1,Y_2) \dots P(Y_n | Y_1,Y_2,\dots,Y_{n-1}),\] which is simply
\[1\left(1-\frac{1}{m}\right)\left(1-\frac{2}{m}\right)\dots \left(1 - \frac{n-1}{m}\right)\] since the $i$-th outcome need simply avoid $i-1$ distinct outcomes out of $m$.

Through some simple analysis, one can show the well known inequality that $1 + x \leq e^x$ for all $x$. Thus, substituting this bound in our above derivation shows that
\[ P(X = 0) \leq e^{-\frac{1}{m}}e^{-\frac{2}{m}}\dots e^{-\frac{n-1}{m}}\] which simplifies to
\[ P(X = 0) \leq e^{-\frac{n(n-1)}{2m}} \] since $1 + 2 + \dots + n-1 = \frac{n(n-1)}{2}$. To resolve the original question we want to know how large $n$ must be for $P(X = 0) \leq 1/2$. Taking the natural logarithm of the above inequality shows that
\[n(n-1) \geq \ln(4)m \] which, for $m = 365$, is satisfied when $n \geq 23$.

Initially, we stated that we only need to assume that each $B_i$ is independent and identically distributed for the birthday paradox to hold. This claim should be fairly intuitive since if birthdays are less uniformly distributed, the likelihood of a collision should increase. Moreover, it turns out that birthdays tend not to be uniformly distributed in practice.

Data source: NYTimes.com, Amitabh Chandra, Harvard University

We can provide a more general analysis by assuming that the $i$-th birthday occurs with probability $p_i$. A collision-free assignment of birthdays is one in which all $n$ people are assigned unique birthdays from the set of $m$ possible birthdays. Let $\mathcal{S(m,n)}$ be the set of all collision-free assignments. The probability of a particular assignment $S \in \mathcal{S(m,n)}$ is simply the product of the probabilities associated with $S$. Thus, it follows that \[P(X = 0) = \sum_{S \in \mathcal{S(m,n)}} \prod_{i \in S} p_i.\] In a uniform distribution, $p_i = \frac{1}{m}$ and the probability reduces to
\[ \frac{m!}{(m-n)!} \left(\frac{1}{m}\right)^n \]
which agrees with our analysis above (multiply the $\frac{1}{m}$ terms into the factorials). Hence, to prove that uniformity minimizes the likelihood of collisions, we want to show that \[ \sum_{S \in \mathcal{S(m,n)}} \prod_{i \in S} p_i \leq \frac{m!}{(m-n)!} \left(\frac{1}{m}\right)^n \] for any choice of probabilities $p_1, p_2, \dots, p_m$.

We can assume that $n \geq 2$ since the claim is trivial otherwise. We can then break up the summation using the following definitions:
  1. Let $\mathcal{A}$ be the set of all $S \in \mathcal{S(m,n)}$ where $1,2 \notin S$. 
  2. Let $\mathcal{B}$ be the set of all $S \in \mathcal{S(m, n - 1)}$ where $1,2 \notin S$. 
  3. Let $\mathcal{C}$ be the set of all $S \in \mathcal{S(m, n - 2)}$ where $1,2 \notin S$.
It follows that \[ P(X = 0) = \sum_{S \in \mathcal{A}} \prod_{i \in S} p_i  + n(p_1 + p_2) \sum_{S \in \mathcal{B}} \prod_{i \in S} p_i  + n(n-1)p_1p_2 \sum_{S \in \mathcal{C}} \prod_{i \in S} p_i . \] Observe that this sum is invariant to changes (exclusive) to $p_1$ and $p_2$ (that preserve a valid probability distribution) except for the $p_1p_2$ term, which is maximized when $p_1 = p_2$. It follows that $P(X = 0)$ is maximized when $p_1 = p_2$. Since $p_1$ and $p_2$ were chosen arbitrarily, it follows that all probabilities must be equal when $P(X = 0)$ is maximized; that is, the uniform distribution maximizes $P(X = 0)$.

This result may seem to be of little practical importance. It turns out, however, that the notion of collisions is of vital importance in computer science. In the study of data structures, hash tables work by storing elements into one of $m$ buckets. The data structure is more efficient when fewer elements are assigned to the same bucket (that is, when there are fewer collisions). How an element is mapped to a bucket is determined by what is called a hash function. By the above analysis, it follows that we prefer hash functions that distribute elements more uniformly. How one constructs such a hash function is a subject for another post...

No comments :

Post a Comment