The Distributive Law (a + b)c = ac + bc

# RSA and Modular Arithmetic

RSA is probably the most famous public key cryptosystem. (As is traditional in maths, RSA is named after three people who weren't the first to discover it.) RSA is usually explained with an example - for example:

• Alice chooses two prime numbers, for example, $p = 7$ and $q = 11$.
• Alice computes $n$ $= pq$ $= 7 \times 11$ $= 77$, and $t$ $= (p-1)(q-1)$ $= 6 \times 10$ $= 60$.
• Alice finds a number $e = 13$ which is coprime to $t$, and its inverse $d$ $\equiv e^{-1}$ $\equiv 37 \pmod{t}$, e.g. using the Extended Euclidean algorithm.
• Alice publishes $n$ and $e$, but keeps $d$ private.
• Bob encrypts a message $m$ $= 24$ to send to Alice, by computing $c$ $\equiv m^e$ $\equiv 24^{13}$ $\equiv 52 \pmod{n}$. Bob sends the encrypted message $c$ to Alice.
• Alice decrypts $c$ by computing $c^d$ $\equiv 52^{37}$ $\equiv 24$ $\equiv m \pmod{n}$. Therefore Alice has recovered Bob's message.

But this is not an explanation; it's a demonstration. The student will not learn why RSA works simply by considering examples, and could be forgiven for imagining that RSA is a "black box" which seems to work although nobody is really sure why it should.

One argument could be that since $d$ is "$1/e \pmod{t}$", it makes sense that $c^d \equiv c^{1/e}$ is the $e$th root of $c$, and of course this would be $(m^e)^{1/e} \equiv m$, recovering the original message. This isn't really satisfying by itself, though, and leaves plenty unanswered:

1. Where did $t$ come from? Weren't we supposed to be working $\renewcommand{\mod}{\operatorname{mod} \,} \mod n$?
2. Why is it safe to publish $n$ and $e$?
3. How do we know that $x^{1/e} = {}^e \!\!\!\sqrt{x}$ still works in modular arithmetic, or that we'll even get the right $e$th root? There could be more than one, after all, and they'd each encrypt to the same $c$.

Actually, RSA is very well-understood, and the mathematics behind it is far from impenetrable. But to start with, let's consider the simplest possible cryptosystem: the Caesar cipher.

The picture above shows almost everything there is to know about how the Caesar cipher works; in this case, the encryption process is "go forward $3$ steps". We could say the encryption key used is $e = 3$. What it doesn't show is what to do when we get to the end of the alphabet:

There we go - the alphabet cycles around on itself. Another advantage of this picture over the first is that it shows the "go forward $e$ steps" operation as the more fundamental "go forward a step" operation iterated $e$ times. This might seem to be a pointless observation, but it means the algebraic structure of the system (or "where the arrows go") does not depend on $e$. This fact is hidden in the first picture, because its arrows were specific to the case $e = 3$.

There are a few (equivalent) methods to decrypt a message encrypted this way:

1. Go backward $e$ steps.
2. Take another $26 - e$ steps forward.
3. Encrypt it another $25$ times.

Method C is a bit less obvious, but it follows from the fact that taking $e$ steps forward $26$ times is the same as taking $26$ steps forward $e$ times; of course, this is the same as doing nothing $e$ times. It's much more effort, so we'd use A or B in practice, but the fact that it works is important.

The Caesar cipher is the simplest cryptosystem to use, and it's also the simplest to attack. But it's even more unsuitable as a public key cryptosystem: the information necessary to encrypt a message is sufficient to decrypt one. So Alice must keep $e$ secret, and Bob cannot send encrypted messages unless he is in on the secret.

RSA has more in common with the Caesar cipher than one might think. It is based on the same algebraic structure: a directed cycle. In mathematical notation, we could relabel the letters A to Z as the numbers $0$ to $25$, and the "go forward a step" operation could be $x \mapsto x+1 \pmod{26}$ - or "add $1$, but wrap around back to $0$ when you reach $26$".

The modulus $n$ in the RSA algorithm is chosen to be the product of two prime numbers. A prime number $p$ can only be written as $p = ab$ if either $a=1$ or $b=1$ (but not both).* Because multiplication gives us the number of items in a grid, a visual interpretation of this is that if $p$ items are arranged in a grid, they must form either a single row or a single column:

$7$ and $5$ are prime numbers; they can only form "trivial" grids.

$12$ is not a prime number; it can be made into a grid in several ways.

For our purposes, it's important that when there is only one way to form a non-trivial grid out of a number $n$, (i.e. $n=pq$ for two primes $p,q$), then finding this grid (i.e. finding $p$ and $q$) gets more and more difficult as $p$ and $q$ get larger. We have a few methods which are better than trial and error, but for prime numbers with more than a few hundred digits, the problem is believed to be unsolvable in practice - and not for lack of trying.

Prime factors don't appear from nowhere: if $p$ is a factor of $ab$ then $p$ must be a factor of either $a$ or $b$ (or both).* They also don't disappear; if $p$ is a factor of $a$, then $p$ is a factor of $ab$. Working $\mod p$, these give us the familiar rule $ab \equiv 0$ if and only if $a \equiv 0$ or $b \equiv 0 \pmod{p}$ - i.e. we can't multiply non-zero residues and get zero.

To illustrate this, we can partition the residues $\mod p$ into two sets which are closed under multiplication: if you multiply two residues of the same colour, the result will have that colour. Green means "no factor of $p$": the product of two such residues cannot have a factor of $p$, so it must be green too. Red means "has a factor of $p$": the only such residue $\mod p$ is $0$.

E.g. $3$ and $6$ are green, so $3 \times 6 \equiv 4 \pmod{7}$ is also green.

We are no longer talking about addition, but multiplication; the cyclic structure defined by the "add $1$" operation is gone. It is perhaps surprising, then, that the green set has another cyclic structure. Whatever prime $p$ we choose, there is always a primitive element $\omega$ for which the operation $x \mapsto \omega x \pmod{p}$ generates a cycle visiting all of the green residues:

E.g. $\omega \equiv 3 \pmod{7}$ produces this cyclic structure. $\omega \equiv 5$ also works.

The arrows now show the result of multiplying by $\omega$, rather than adding $1$. Rearranging the residues makes this clearer:

Note that, for example, multiplying by $2$ (forward two steps) is the same as multiplying by $3$ (forward a step) twice, as $3^2 \equiv 2 \pmod{7}$. If our step was $x \mapsto 2x \pmod{7}$ instead, we'd get this picture:

This operation produces two separate cycles instead. Importantly, they have the same length; in the language of group theory, they are the cosets of the subgroup generated by $2$. The cosets are always the same size, because you can shift one along to get any other. So the size of the green set must be a multiple of the length of a cycle; in this case, $6$ is a multiple of $3$. (This is Lagrange's theorem.) This matters because our earlier observation still holds: if we "go forward $e$ steps" $6$ times then we'll get back where we started, whatever $e$ is - even if the cycle length is not $6$. "Go forward $e$ steps $6$ times" is the same as "go forward $e$ steps $3$ times" $2$ times, or "do nothing" twice.

Is multiplication $\mod p$ good enough for a public key cryptosystem? No. If a message is encrypted by "taking $e$ steps forward" (i.e. multiplying by $\omega^e \pmod{p}$), all three decryption methods still work:

1. Going backward $e$ steps: to undo multiplication by $\omega^e$, we can "divide" by $\omega^e$. This means finding a multiplicative inverse $(\omega^e)^{-1} \pmod{p}$ which can be calculated e.g. by the Extended Euclidean algorithm.
2. Taking another $p-1-e$ steps forward: because only one element is red, the length of the green cycle is $p-1$. In total we'll have made $e + (p-1-e) = p-1$ steps around the cycle, so we recover the original message.
3. Encrypting again $p-2$ times: again, this is the less obvious one, but we get $m \omega^e \cdots \omega^e$ $\equiv m \omega^{(p-1)e}$ $\equiv m \times 1^e$ $\equiv m \pmod{p}$, as $\omega^{p-1} \equiv 1 \pmod{p}$. This is Fermat's little theorem, but really it's the same logic as before; Taking $e$ steps forward $p-1$ times is the same as taking $p-1$ steps forward $e$ times, which gets us back where we started.

Note that the information necessary to encrypt a message ($p$ and $\omega^e$) is again sufficient to decrypt it.

For RSA, we work $\mod pq$. In this context we have e.g. $pq \equiv 0$ but $p \not\equiv 0$ and $q \not\equiv 0 \pmod{pq}$, so the familiar rule about $0$ no longer applies; $p$ and $q$ (and their multiples) are zero divisors. Because prime factors can't just appear or disappear under multiplication, if residues $\mod pq$ have (or don't have) factors of $p$ (or $q$) then they can't lose or gain this factor by multiplication.** This means we can partition the residues into four sets closed under multiplication:

• Those with no factor of $p$ or $q$ (green),
• Those with a factor of $p$ but not $q$ (blue),
• Those with a factor of $q$ but not $p$ (yellow),
• Those with factors of both $p$ and $q$ - i.e. only $0$ (red).

In fact, by the Chinese remainder theorem, we can treat residues $\mod pq$ as pairs of independent residues $\mod p$ and $\mod q$; the zero divisors are the $\mod pq$ residues for which one of these is $0$ (blue and yellow). This can be demonstrated (but not explained) by arranging our $pq$ items into a grid, where multiplication can be done per row and per column:

For $p=7$, $q=11$. The residues $\mod 11$ form a cycle generated by $2$.

(As noted earlier, if $p$ and $q$ are large enough, then finding this grid is practically impossible for an attacker who only knows $n$.)

The residues $\mod p$ and $\mod q$ give co-ordinates in the space of residues $\mod pq$. When arranged according to these co-ordinates, the numbers seem random - but the Chinese remainder theorem guarantees us that multiplication acts on the co-ordinates independently.

Again, if two residues have the same colour, their product must have that colour. It's clear that the number of green residues is $t$ $= (p-1)(q-1)$, so that answers question 1. Let's restrict ourselves only to these - most of the residues are green***, and we can do as much multiplication as we want without leaving the green set. What does multiplication look like?

This picture shows the cycles generated by $24$ (left) and $65$ (right). $24 \equiv 3 \pmod{7}$ and $2 \pmod{11}$, so $x \mapsto 24x \pmod{77}$ has an effect $\mod 7$ of "take a step right", and the effect $\mod 11$ is "take a step down". The operation $x \mapsto 65x \pmod{77}$ is $2$ steps right and $5$ steps down. Because the structures $\mod 7$ and $\mod 11$ each cycle back around to $1$, taking a step right from the last column "wraps around" to the first column, and similarly once we reach the bottom of the grid, we wrap around to the top.

For clarity, only one cycle is shown in each picture (in dark green); the other residues (in light green) form their own cycles.**** Importantly, Lagrange's theorem still holds: the cycles are all shifted copies of each other, so they're all the same length, and this length must be a factor of the size of the green set, $t$.

This is almost good enough for a public key cryptosystem. The reason for this is we're doing multiplication $\mod n$, but the cycle length is not $n$ - it's some factor of $t$ $= (p-1)(q-1)$. We can publish $n = pq$ and keep $t$ secret; the attacker can no longer go all the way around the cycle to decrypt a message, because he doesn't know how far this is. Finding $p$ and $q$ is practically impossible, so the attacker can't use them to find $t$.

In fact, an attacker knowing only $n$ cannot practically find $t$ any other way, either: as $n = pq$ and $n-t+1$ $= p+q$, it follows that $p$ and $q$ are the solutions to the quadratic equation $x^2 - (n-t+1)x + n$ $= 0$, so they can be found by the quadratic formula - a feat which we assumed to be practically impossible. This answers question 2.

Still, choosing a step like $x \mapsto 24x$ and iterating it $e$ times is not enough to encrypt a message. Although decryption methods B and C are now unavailable to an attacker (they would need to know the cycle length in order to go all the way round), method A still works - $24^{-1} \pmod{n}$ can be found without knowing $p$ and $q$.

What we need is a "one step forward" operation which can't be undone without knowing some secret. Undoing addition or multiplication is straightforward - students are expected to subtract and divide without the use of calculators, and these are only a little harder to do $\mod n$. Exponentiation, on the other hand, is hard to undo - finding $k$th roots or logarithms (i.e. undoing $x \mapsto x^k$ or $x \mapsto k^x$) for real numbers generally requires a calculator, and it turns out these are even harder to do $\mod n$ when $n$ is very large.

So the encryption process for RSA is modular exponentiation: a message $m$ is encrypted as $c$ $\equiv m^e \pmod{n}$. This means we "go forward $e$ steps", but each step is $x \mapsto mx \pmod{n}$, where $m$ is the message, and rather than starting at $m$ we start at $1$. Taking the earlier cycle generated by $m = 24$, and taking $e = 13$ steps forward:

$c \equiv m^e \equiv 24^{13} \equiv 52 \pmod{77}$

Now we are multiplying by $m$ at each step rather than $e$, decryption method A is no longer available - we'd need to know $m$ in order to divide by it! So now the cryptosystem is secure - the attacker doesn't know $m$ so he can't find $m^{-1}$ to take steps backward; in fact, knowing $m$ is required even to do $x \mapsto mx$ to take a step forward. So an attacker cannot decrypt $c$ to find $m$.

There is a problem: how does the intended recipient do it? It appears to be impossible - without knowing $m$ already, taking steps in the cycle is impossible, even if we know the secrets $p$, $q$ and $t$.

However, this is not quite true. We don't know how to take one step, but $c \equiv m^e \pmod{n}$ is enough information to take $e$ steps; multiplying by $c$ takes us another $e$ steps forward. We could go $e$ steps backward from $c$ by dividing (i.e. multiplying by $c^{-1} \pmod{n}$), but this would recover the residue $1$, since that's where the cycle started - not $m$.

The solution is method C: take $e$ steps forward (i.e. multiply by $c$) again the correct number of times. What is the correct number? We don't know the cycle length, but we know $t = 60$ is a multiple of it; so $t$ steps forward is the same as doing nothing. We could do it another $t-1$ times as in the Caesar cipher, but this again gets us to the start of the cycle, $1$, rather than $m$ which is one step further.

If we go $e$ steps forward $d$ times, we'll arrive at the residue which is $ed$ steps into the cycle. Therefore, we'd like $ed$ $\equiv 1 \pmod{t}$, as taking $t$ steps forward some number of times does nothing, and then one more step gets us to $m$. Assuming $e = 13$ was a sensible choice (so this equation has a solution), we can find $d \equiv e^{-1} \pmod{t}$ by "division"; the Extended Euclidean algorithm yields $d = 37$. Therefore, if we start at $1$ and go $e$ steps forward (i.e. multiply by $c$) $d$ times, we'll get to $m$: this gives $m \equiv c^d$ $\equiv 52^{37}$ $\equiv 24 \pmod{n}$, recovering the original message.

Actually, there is another way of interpreting this. Multiplication and exponentiation are hard to think about $\mod n$; by taking logarithms, we can turn them into addition and multiplication. In fact, we already took logarithms earlier, when we rearranged the numbers $\mod 7$:

Our "take a step forward" operation $x \mapsto 3x \pmod{7}$ becomes the much simpler $a \mapsto a+1 \pmod{6}$, where $a \equiv \log_3 x$. Similarly $b \equiv \log_2 x$ turns $x \mapsto 2x \pmod{11}$ into $b \mapsto b+1 \pmod{10}$. What about the $(\mod p, \mod q)$ co-ordinates from the Chinese remainder theorem?

$x \pmod{77}$ is represented as a vector $\mathbf{x}$ $= (\log_3 x, \log_2 x) \pmod{\mathbf{t}}$, where $\mathbf{t}$ $= (p-1,q-1)$. To go the other way, a vector $(a,b)$ represents the residue $45^a \times 57^b \pmod{77}$, since $45$ and $57$ represent "one step right" and "one step down" respectively.

In this co-ordinate system, multiplication $\mod pq$ becomes vector addition $\mod \mathbf{t}$, and exponentiation becomes scalar multiplication $\mod \mathbf{t}$. In particular, the encryption process $m \mapsto m^e \pmod{n}$ becomes $\mathbf{m} \mapsto e \mathbf{m} \pmod{\mathbf{t}}$, and the decryption process $c \mapsto c^d \pmod{n}$ becomes $\mathbf{c} \mapsto d \mathbf{c} \pmod{\mathbf{t}}$: from this perspective, encryption is multiplication by $e$, so decryption is simply division by $e$. Multiplicative inverses are always unique; this answers question 3. Now you know!

*The first of these definitions of prime numbers is taught in schools, but the second is the proper definition; the first is actually the definition of irreducible. It so happens that prime and irreducible are equivalent in $\mathbb{Z}$; in some contexts they are not.

**This is not the case when $p,q$ are not factors of $n$: clearly, $2 \times 5 \equiv 3 \pmod{7}$ has neither a factor $2$ nor $5$.

***For large $p,q$, the message $m$ is extremely unlikely to fall outside the green set. Finding such an $m$ would allow an attacker to find $p$ and $q$ by computing the highest common factor of $m$ and $n$ (e.g. by Euclid's algorithm); but we assumed this to be practically impossible.

****For $p,q>2$, no residue $m$ can generate a cycle visiting the whole green set. This is because a direct product of cyclic groups is cyclic if and only if their orders are coprime, but $p-1$ and $q-1$ have a common factor $2$. Indeed, by parity considerations, the index of a cyclic subgroup must always be even, so not every factor of $t$ is a possible cycle length. This would justify choosing $d$ $\equiv e^{-1} \pmod{t/2}$ rather than $\mod t$, which would make decryption marginally faster; in our example, we would find $c^7$ instead of $c^{37}$, saving $3$ iterations in an efficient algorithm - but for large primes, on average only $1$ iteration would be saved. Using the Chinese remainder theorem to invert $e$ $\mod p-1$ and $\mod q-1$ is much better.

# Blue Eyes and Common Knowledge

Blue Eyes is a widely-known logic puzzle, popularised by Randall Munroe (among others):

On this island there are $100$ blue-eyed people, $100$ brown-eyed people, and the Guru (she happens to have green eyes). [...] Every night at midnight, a ferry stops at the island. Any islanders who have figured out the color of their own eyes then leave the island, and the rest stay. [...] The Guru is allowed to speak once (let's say at noon), on one day in all their endless years on the island. Standing before the islanders, she says the following:

"I can see someone who has blue eyes."

Who leaves the island, and on what night?

Standard logic puzzle rules apply: the islanders are perfect logicians, can't discuss eye colours, can't discern their own eye colours in reflections, et cetera. The problem is difficult and the solution is extremely counter-intuitive to many. If you have not seen it before, I encourage you to stop reading now, and try it yourself.

The blue-eyed islanders all leave $100$ nights after the announcement. To see this, let's consider some simpler problems.

1. Suppose only one islander has blue eyes. He sees no other islanders with blue eyes, so when the Guru announces that she sees someone who has blue eyes, that islander knows it must be himself. He leaves on that night.
2. Suppose two islanders have blue eyes. Each sees another islander with blue eyes; as far as each of them knows, the Guru is talking about the other; nobody leaves on the first night. But both of them know the solution to the one-blue-eyed-islander problem, and observe on the first night that the blue-eyed islander they see is still there. Each of them concludes that that blue-eyed islander must see somebody else with blue-eyes; they don't see anybody else fitting that description, so it must be themselves. They both leave on the second night.
3. Suppose three islanders have blue eyes. Each sees two others with blue eyes, but when neither of them leave on the second night, they know that those other islanders must each be able to see two others with blue eyes, one of whom must be themselves. They all leave on the third night.

An inductive argument shows that if the number of blue-eyed islanders is $n$, then they all leave on the $n$th night.

What makes this so counter-intuitive is that the Guru apparently tells the islanders something they already knew, but after the announcement they make logical deductions they could not have made before. Even after seeing the solution, it is tempting to think that the islanders have by some trick managed to agree (without communicating) on a strategy "if you can only see $d-1$ blue-eyed islanders on the $d$th day, say your eyes are blue and leave the island", and the strategy only works because they're all following it. Perhaps the only thing stopping the islanders from performing this feat sooner is that they need to know which day is the "first" day. The content of the Guru's announcement seems to be irrelevant; it could be "I see somebody with two feet", "I like table tennis", or "$2 + 2 = 5$", because the announcement only serves to synchronise the counting.

This is a popular interpretation, but it is a mistake. For example, there's no particular reason the brown-eyed islanders couldn't also count synchronously even though the announcement didn't contain the word "brown", so shouldn't they leave too? Unfortunately, this does not explain why it is a mistake; in my opinion, the real puzzle here is not "who leaves the island, and on what night?" but Blue Eyes Redux:

1. What do the islanders know after the Guru's announcement that they didn't know before?
2. How do they use this knowledge?
3. Why does it take $100$ days? What do they know on, e.g., the $50$th day, that they didn't know on the $49$th?

To answer this we need to inspect the reasoning used by each islander to eventually conclude that their own eyes are blue. Let's introduce some notation: for an islander $x$ and $k,d \in \mathbb{N}$,

• Let $B(x)$ be the proposition "$x$ has blue eyes".
• Let $E(k)$ be the proposition "There are at least $k$ blue-eyed islanders".
• Let $S(x,k)$ be the proposition "$x$ sees at least $k$ blue-eyed islanders".
• Let $L(x,d)$ be the proposition "$x$ leaves the island on the $d$th night".
• Let $\newcommand\knows[1][]{\mathrel{\smash{\overset{\,#1}{\Rule{0ex}{.5ex}{0ex}\smash{\leftarrow}}}}} x \knows P$ mean "$x$ knows that the proposition $P$ is true", and $x \knows[d] P$ mean "$x$ knows on the $d$th day that $P$ is true". $\newcommand\nknows[1][]{\mathrel{\smash{\overset{\,#1}{\Rule{0ex}{.7ex}{0ex}\smash{\not\leftarrow}}}}} x \nknows[d] P$ means "$x$ doesn't know on the $d$th day that $P$ is true".

For example, $x \knows B(x)$ means "$x$ knows his own eyes are blue", and $x \knows E(k)$ means "$x$ knows there are at least $k$ blue-eyed islanders", which is not quite the same as $S(x,k)$. Of course, the former is a simple inference from the latter. Let's write some rules of inference: $\newcommand\infer{\vDash} P \infer Q$ means "$Q$ can be inferred from $P$".

• Visibility: $B(x), \, x \ne y \infer y \knows B(x)$. If $x$'s eyes are blue then other islanders can see this.
• Counting: $S(x,k) \infer x \knows E(k)$. $x$ can count: if he sees at least $k$ blue-eyed islanders, then he knows there are at least $k$ blue-eyed islanders.
• Exit: $x \knows[d] B(x) \infer L(x,d)$. $x$ leaves the island as soon as he knows his own eyes are blue.
• Non-exit: $x \nknows[d] B(x) \infer \neg L(x,d)$. The inverse of Exit: $x$ does not leave the island if he doesn't know his own eyes are blue.
• Presence: $\neg L(x,d) \infer x \nknows[d] B(x)$. The contrapositive of Exit: if $x$ has not left the island, then he doesn't yet know his eyes are blue.
• Addition: $B(x), \, S(x,k) \infer E(k+1)$. If $x$'s eyes are blue and he sees at least $k$ other blue-eyed islanders, then there are at least $k+1$.
• Subtraction: $S(x,k) \infer x \knows S(y,k-1)$. If $x$ sees at least $k$ blue-eyed islanders, then he knows that another islander $y$ can see at least $k-1$.
• Discovery: $x \knows[d] E(k), \, \neg S(x,k) \infer x \knows[d] B(x)$. If $x$ knows there are at least $k$ blue-eyed islanders, but he cannot see $k$, then $x$ knows his own eyes are blue.
• Non-discovery: $x \knows[d] E(k), \, x \nknows[d] B(x) \infer S(x,k)$. A contrapositive of Discovery: if $x$ knows there are at least $k$ blue-eyed islanders, and does not know his own eyes are blue, then he must be able to see $k$.
• Deduction: $(\{P_i\} \infer Q), \, \forall i \bullet x \knows[d] P_i \infer x \knows[d] Q$. The islanders are perfect logicians: if an argument is valid and $x$ knows its premises to be true, then $x$ knows its conclusion to be true.
• Correctness: $x \knows[d] P \infer P$. The islanders do not make incorrect inferences; if they know something to be true, then it is true.
• We assume that if the true number of islanders is $n$ then $B(x) \infer S(x,n-1), \, \neg S(x,n)$, as blue-eyed islanders can see everyone's eyes except their own.

It goes without saying that $x \knows P \infer x \knows[d] P$, and $x \knows[d] P, \, e > d \infer x \knows[e] P$.

Let's try using this notation to analyse the problem for small numbers of islanders.

1. Suppose only one islander $x$ has blue eyes.
• The Guru announces $E(1)$, and $x$ hears the announcement, so $x \knows E(1)$.
• By assumption, $\neg S(x,1)$.
• By Discovery, we infer $x \knows B(x)$ immediately. By Exit, $x$ leaves on the first night.
2. Suppose two islanders $x, y$ have blue eyes.
• The Guru announces $E(1)$, but we already have $x \knows E(1)$ by Counting. However, $x$ knows $y$ hears the announcement, so $x \knows y \knows E(1)$.
• By assumption, $\neg S(x,2)$, but $x \nknows[1] E(2)$ so we cannot apply Discovery. So by Non-exit $\neg L(x,1)$, and by symmetry $\neg L(y,1)$.
• It is now the second day. $x$ observes that $y$ did not leave, so $x \knows[2] \neg L(y,1)$. By Presence and Deduction, $x \knows[2] y \nknows[1] B(y)$.
• We have $x \knows y \knows E(1)$ and $x \knows[2] y \nknows[1] B(y)$, so by Non-discovery and Deduction, $x \knows[2] S(y,1)$.
• So $x \knows[2] S(y,1)$ and by Visibility $x \knows B(y)$. By Addition and Deduction, $x \knows[2] E(2)$.
• We have $x \knows[2] E(2)$ and $\neg S(x,2)$, so by Discovery $x \knows[2] B(x)$.
• By Exit, $x$ leaves the island on the second night, and by symmetry, so does $y$.

With two blue-eyed islanders we're already able to explicitly write down something $x$ knows after the Guru's announcement that he doesn't know before: $x \knows y \knows E(1)$. Without the announcement, $y \knows E(1)$ but $x \nknows y \knows E(1)$, as $x$ doesn't see any blue-eyed islander which he knows $y$ can see.

The fundamental distinction here is between knowing something, and being able to attribute that knowledge to somebody else. For example, if you happen to know that a particular berry is poisonous, you will not eat it. Someone else might also know the berry is poisonous, but if you're unaware of that then you might worry that they would eat it. "The berry is poisonous" can't be used as a premise when you "put yourself in their shoes", because you don't know that they're aware of it. If, however, it's common knowledge that the berry is poisonous, you know they're aware of it, so you can use it as a premise while you're walking in their shoes.

1. Suppose three islanders $x,y,z$ have blue eyes.
• The Guru announces $E(1)$ but not only did we already have e.g. $x \knows E(2)$ by Counting, we had e.g. $x \knows y \knows E(1)$ by CountingSubtraction and Deduction. However, we do now have e.g. $x \knows y \knows z \knows E(1)$.

Each islander not only knows that the other islanders were present for the announcement, but they know that every other islander also knows this. "Everyone knows everyone knows everyone knows there's a blue-eyed islander" is the key information required to make the deductions which eventually lead to the islanders leaving. Although the Guru tells the islanders something they clearly already know, $x$ did not previously know that $y$ knew $z$ knew $E(1)$: $x$ can see a blue-eyed islander visible to $y$, but he does not know that $y$ sees a blue-eyed islander visible to $z$.

This is probably where it gets too abstract to follow intuitively. We're used to the idea of "walking in someone else's shoes" - but not imagining what it would be like to be someone else themselves walking in someone else's shoes. That would be like wearing two pairs of other people's shoes one over the other, which is silly. But what allows the islanders to leave, is that the Guru's announcement $E(1)$ is common knowledge, so however many pairs of shoes you're wearing one over the other, you can still use $E(1)$ as a premise. Informally, $P$ is common knowledge if

• Everyone knows $P$, and
• Everyone knows that $P$ is common knowledge.

Of course, this is self-referential, so it doesn't really work as a definition.

To go further, it will be useful to extend our notation. The difference between $\forall x \bullet x \knows[d] P$ and $\forall x,y \bullet x \knows[d] y \knows[d] P$ is clearly important; let's write $1 \knows[d] P$ and $2 \knows[d] P$ for these cases, and more generally $m \knows[d] P$ to mean an $m$-islander-long "chain" of knowledge about knowledge about knowledge, $\forall x_1,\ldots,x_m \bullet x_1 \knows[d] \ldots \knows[d] x_m \knows[d] P$.

• We can see that $m \knows[d] n \knows[d] P$ is equivalent to $(m+n) \knows[d] P$.
• By Correctness, $m \knows[d] P \infer (m-1) \knows[d] P$, and inductively $m \knows[d] P, \, n < m \infer n \knows[d] P$.
• Deduction can be inductively extended to $(\{P_i\} \infer Q), \, \forall i \bullet m \knows[d] P_i \infer m \knows[d] Q$. Let's call this Metadeduction.

In the problem with three blue-eyed islanders, for example, we have $2 \knows E(1)$ before the Guru's announcement and $3 \knows E(1)$ after the announcement. In fact, we have $\forall m \bullet m \knows E(1)$, although we don't use this fact for $m > 3$. This leads to a reasonable definition of common knowledge:

• $P$ is common knowledge if $\forall m \bullet m \knows P$.

Let's write this as $\infty \knows P$. There are a few more rules:

• The Guru's announcement is public: $\infer \infty \knows E(1)$. It's common knowledge that there is a blue-eyed islander.
• Remaining on the island is public: $\neg L(x,d-1) \infer \infty \knows[d] \neg L(x,d-1)$. If an islander doesn't leave the island, this becomes common knowledge the next day.
• Infinite Metadeduction: $(\{P_i\} \infer Q), \, \forall i \bullet \infty \knows[d] P_i \infer \infty \knows[d] Q$. That is, if an argument is valid and its premises are common knowledge, then its conclusion is also common knowledge. This follows directly from Metadeduction.

We can now answer Blue Eyes Redux:

1. Suppose there are $n$ blue-eyed islanders. ($n=100$, for example.)
• The Guru announces $E(1)$. By many applications of CountingSubtraction and Deduction, we previously had $(n-1) \knows E(1)$. By The Guru's announcement is public, we now have $n \knows E(1)$: this answers part A.
• By Non-exit, $\forall x \bullet \neg L(x,1)$.
• Remaining on the island is public so $n \knows[2] \forall x \bullet \neg L(x,1)$, and by Presence and Metadeduction we have $n \knows[2] \forall x \bullet x \nknows[1] B(x)$. By Non-discovery and Metadeduction $(n-1) \knows[2] \forall x \bullet S(x,1)$. By VisibilityAddition and Metadeduction $(n-1) \knows[2] E(2)$.
• By Non-exit, $\forall x \bullet \neg L(x,2)$.
• $\ldots$
• On the $d$th day, Remaining on the island is public so $(n-d+2) \knows[d] \forall x \bullet \neg L(x,d-1)$, and by Presence and Metadeduction we have $(n-d+2) \knows[d] \forall x \bullet x \nknows[d-1] B(x)$. By Non-discovery and Metadeduction $(n-d+1) \knows[d] \forall x \bullet S(x,d-1)$. By VisibilityAddition and Metadeduction $(n-d+1) \knows[d] E(d)$. This answers part C.
• $\ldots$
• So $1 \knows[n] E(n)$, by assumption $\forall x \mathrel{|} B(x) \bullet \neg S(x,n)$, by Discovery $\forall x \mathrel{|} B(x) \bullet x \knows[n] B(x)$ and by Exit $\forall x \mathrel{|} B(x) \bullet L(x,n)$. This answers part B.

The salient point is that each day, the fact at the end of the chain gets stronger, but the chain gets $1$ shorter.* Without $n \knows E(1)$ on the first day, the chain would be too short on the $n$th day and we'd have $0 \knows[n] E(n)$, which is just $E(n)$. So nobody would ever know $E(n)$, we would never be able to apply Non-discovery that final time, and the islanders would be stuck on the island forever - as they were before the Guru's announcement.

*If we used $\infty \knows E(1)$ instead of $n \knows E(1)$, then by Infinite Metadeduction each day we'd still have $\infty \knows[d] E(d)$. This makes the proof simpler, but hides the reason it doesn't work if $n \nknows E(1)$.

# Matrix Multiplication and Pascal's Triangle

Students who learn about matrix multiplication before linear map composition often think the definition seems arbitrary and unmotivated. (Why not just multiply each component separately, the way addition and scalar multiplication work?) Linear maps are the major reason we use matrices, and matrix multiplication really only makes sense once you figure this out.

But still, any grid of numbers is by definition a matrix, whether or not those numbers came from the co-ordinates of images of basis vectors under a linear map. Whatever context the data in the grid came from, we can do matrix multiplication. The matrix may even be interpreted post hoc as linear map. What's surprising is how often this bonus linear algebraic structure ends up being useful and elegant.

A few examples:

• A graph or network has two natural representations as a grid of numbers; an adjacency matrix $A$ gives the connections between nodes and other nodes, and an incidence matrix $B$ gives the two endpoints of each edge. $(A^n)_{i,j}$ happens to give the number of routes of length $n$ between nodes $i$ and $j$, and $B$ has a post hoc interpretation as the boundary map, which transforms a vector $\mathbf{e}$ of edges into a vector $B\mathbf{e}$ of nodes such that $\mathbf{e}$ is a disjoint union of paths with distinct endpoints $B\mathbf{e}$.
• Markov chain is a random process with a set of states, and each state has a probability distribution for determining the next state. If $P_{i,j}$ is the probability that state $i$ will evolve into state $j$, then it turns out that $(P^n)_{i,j}$ is the probability that state $i$ will evolve into state $j$ after $n$ steps.
• A multi-variable function $f(\mathbf{x})$ has a Hessian matrix $$H_{i,j} = \frac{\partial f}{\partial x_i \, \partial x_j}$$ which appears as a quadratic form $\newcommand\transp{^{\mathsf{T}}} \frac{1}{2}\mathbf{x}\transp H \mathbf{x}$ in $f$'s Taylor series, and its eigenvalues determine the nature of $f$'s stationary points.
• A set of random variables $X_i$ has a covariance matrix $C_{i,j} = \operatorname{Cov}(X_i, X_j)$. Given a linear combination $Y = \sum y_j X_j$, matrix multiplication strikes again: $(C \mathbf{y})_i = \operatorname{Cov}(Y, X_i)$.

How about Pascal's triangle? $$P = \begin{pmatrix} 1 & 0 & 0 & 0 & \cdots \\ 1 & 1 & 0 & 0 & \cdots \\ 1 & 2 & 1 & 0 & \cdots \\ 1 & 3 & 3 & 1 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \ddots \end{pmatrix}$$ Of course it's a matrix. The numbers are determined by the first row, and the formula $P_{i+1,j+1}$ $= P_{i,j} + P_{i,j+1}$. We also know this matrix as $\newcommand\ncr[2]{{}^{#1} \mathrm{C}_{#2}} P_{n,r} = \ncr{n}{r}$.

Alternatively we can write it as a symmetric matrix: $$S = \begin{pmatrix} 1 & 1 & 1 & 1 & \cdots \\ 1 & 2 & 3 & 4 & \cdots \\ 1 & 3 & 6 & 10 & \cdots \\ 1 & 4 & 10 & 20 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \ddots \end{pmatrix}$$ $P$ and $S$ are just grids of numbers; they're defined by a simple combinatorial rule, and there is no linear algebra in sight. Surely matrix multiplication is meaningless here? But in fact, $PP\transp = S$: $$\left(PP\transp\right) {}_{\!i,j} = \sum P_{i,k} \left(P^{\mathsf{T}}\right) {}_{\!k,j} = \underbrace{\sum \ncr{i}{k} \, \ncr{j}{k} = \ncr{i+j}{i}} = S_{i,j}$$ as required.

Let's take a look at that third step more closely - what we're saying is, the number of ways to choose $i$ items from a set of $i+j$ is the same as the number of ways to choose some number $k$ items from a set of $i$ and another $k$ from a set of $j$. There is indeed a natural correspondence between such choices; for example, how many ways are there of choosing $3$ of the symbols A,B,C,X,Y?

 $k = 0$ ABC $k = 1$ AABBCC XYXYXY BCXBCYACXACYABXABY $k = 2$ ABACBC XYXYXY CXYBXYAXY

This yields all $10 = \ncr{3+2}{3}$ ways. The correspondence is simple; choose $k$ of the first set A,B,C not to use, and choose $k$ of the second set X,Y to use.

A consequence of this fact is that if we truncate $S$ so that it is a finite square matrix, then its determinant is $1$. This is not difficult to show: $S = PP\transp$ is a product of a lower-triangular and upper-triangular matrix, each of which has only $1$s on the diagonal.

(This insight is lost on MATLAB:

>> det(pascal(20))ans = 1007.44031415979>> det(sym(pascal(30)))ans = 209

Thanks, IEEE 754.)

Can $P$ be interpreted as a linear map? In fact, it can. $P$ is the matrix of binomial coefficients, so binomial expansions are a good context to consider. If we identify polynomials $f(x) = a_0 + a_1 x + \ldots + a_n x^n$ with row vectors $[f(x)] = (a_0~a_1~\cdots~a_n)$, then $[1]\,P = [1]$, $[x]\,P = [1 + x]$, $[x^2]\,P$ $= [1 + 2x + x^2]$ $= [(1+x)^2]$, and in general $$[x^n]\,P = \sum P_{n,k} \left[x^k\right] = \left[\sum \ncr{n}{k} \, x^k\right] = \left[(1+x)^n\right]$$ Since matrix multiplication is linear, it follows immediately that $[f(x)] \,P = [f(1+x)]$. Therefore the matrix $P$ represents a linear transformation on polynomials given by the substitution $x \mapsto x+1$. Then we see that $P^a$ represents the substitution $x \mapsto x+a$, and so $(P^a)_{n,r}$ is the coefficient of $x^r$ in the binomial expansion $(x+a)^n$.

Somehow, wherever a grid of numbers comes from, linear algebra finds something to say about it.

Home