Tuesday, September 10, 2013

The Sieve of Eratosthenes

In our last post, we introduced some basic concepts from number theory and, in particular, the concept of prime numbers. In today's post, we continue this foray by discussing some of the computational aspects of prime numbers.

Eratosthenes of Cyrene is often credited for inventing the discipline of geography. He invented the system of longitude and latitude and is famous for accurately predicting the surface area of the Earth during antiquities (as described in this post).

A portrait of Eratosthenes

Another vestigial mathematical development of Eratosthenes is his famous sieve. The sieve provides an elegant method for identifying prime numbers and serves as one of the earliest examples of number-theoretic algorithms.

We know from the prime number theorem that there is $O(\frac{n}{\log{n}})$ prime numbers from $1,2,\dots,n$, but can we find them? Is there a systematic way to identify all of the prime numbers from $1$ to $n$? The sieve of Eratosthenes solves precisely this task and does so efficiently.

We will describe the algorithm as though we are equipped with a pen and papyrus:
  1. Start by writing down all numbers from $2$ to $n$.
  2. The first number must be prime. Underline it, and cross off all multiples of this number.
  3. Repeat 2 starting at the first unmarked number (i.e. the first number that is neither crossed off nor underlined). Once all numbers are marked, the underlined numbers are exactly the primes from $1$ to $n$.
To illustrate the idea further, consider the following example where $n = 10$.
  1. 2, 3, 4, 5, 6, 7, 8, 9, 10; write down all numbers from 2 to 10.
  2. 2, 3, 4, 5, 6, 7, 8, 9, 10; underline 2, cross off its multiples (4, 6, 8, 10).
  3. 2, 3, 4, 5, 6, 7, 8, 9, 10; underline 3, cross off its multiples (6, 9).
  4. 2, 3, 4, 5, 6, 7, 8, 9, 10; underline 5, cross off its multiples (10).
  5. 23, 4, 5, 6, 7, 8, 9, 10; underline 7, cross off its multiples (n/a).
  6. The underlined numbers 2, 3, 5, and 7 are prime.
One considering the problem of outputting all prime numbers from $1$ to $n$ might have first taken a different approach. Another simple algorithm is the following:

# To output all primes from 1 to n
for i in 2..n do
  if no j in 2..i-1 divides i: output i

Though simple, this approach is far less efficient. For a given number $i$, checking whether it is divisible by any smaller number from $2$ to $i-1$ is expensive. That is, for each $i$ we perform $i-2$ division tests, giving a total of \[1 + 2 + 3 + \dots + (n - 3) + (n - 2) = 1/2(n-1)(n-2)\] division tests.

A shrewd observer might notice that we can modify the second loop to perform division tests for only $j = 2$ to $\lfloor \sqrt{i} \rfloor$. This optimization is justified since if there exists an $a$ and $b$ such that $ab = i$, then the smaller of $a$ and $b$ must be at most $\sqrt{i}$. Leveraging this optimization, the runtime is reduced from $O(n^2)$ to $O(n^\frac{3}{2})$.

How does this compare to the sieve of Eratosthenes? To better analyze the computation costs, we will rewrite the algorithm in pseudocode:

# To output all primes from 1 to n
let L = [2..n]
for i in L do:
  if i is not marked:
    output i
    for j in i,2i,3i,...,(n/i)i:
      mark j

Notice that the second loop is only entered for numbers $i$ that are prime and for a given number $i$, it makes $n/i$ iterations. Thus, the total runtime can be bounded by \[ \sum_{p}\frac{n}{p} \] which simplifies to \[n \sum_{p} \frac{1}{p}\] where both sums are over all primes $p \leq n$.

We can bound the sum $ \sum_{p} \frac{1}{p} $ by the harmonic series $ \sum_{i = 1}^n \frac{1}{i} $ which we know from a previous post is $O(\log{n})$. Thus, we have shown that the runtime of the sieve is $O(n\log{n})$, which is quite an improvement over the other algorithm's $O(n^\frac{3}{2})$. However, this analysis is not tight. It turns out that the sum is exponentially smaller; that is, by Euler's theorem on the divergence of the sum of the reciprocals of the primes, we have that $ \sum_{p} \frac{1}{p}$ is $O(\log{\log{n}})$, and thus, the total runtime is $O(n\log{\log{n}})$.

We leave you with a few exercises:
  1. Can you find a way to improve the sieve of Eratosthenes by avoiding "double marking" of numbers?
  2. We showed that the naive algorithm can be implemented using $O(n^\frac{3}{2})$ division tests, but how expensive are division tests? Come up with a way to test if $i$ divides $n$ in $O(\log{n})$ time.