The Applications of Probability to Cryptography
The underlying manuscript is held by the National Archives in the UK and can be accessed at www.nationalarchives.gov.uk using reference number HW 25/37. Readers are encouraged to obtain a copy.
The original work was under Crown copyright, which has now expired; the work is now in the public domain.
Two Second World War research papers by Alan Turing were declassified recently. The papers, The Applications of Probability to Cryptography and its shorter companion Paper on Statistics of Repetitions, are available from from the National Archives in the UK at www.nationalarchives.gov.uk.
The released papers give the full text, along with figures and tables, and provide a fascinating insight into the preparation of the manuscripts, as well as the style of writing at a time when typographical errors were corrected by hand, and mathematical expression handwritten into spaces left in the text.
Working with the papers in their original format provides some challenges, so they have been typeset for easier reading and access. We recommend that the typeset versions are read with a copy of the original manuscript at hand.
This document contains the text and figures for The Applications of Probability to Cryptography, the companion paper is also available in typeset form from arXiv at www.arxiv.org/abs/1505.04715. These notes apply to both documents.
Separately, a journal article by Zabell111Zabell, S. 2012. “Commentary on Alan M.Turing: The Applications of Probability to Cryptography” Cryptologia, 36:191-214. provides an analysis of the papers and further background information.
It is not our intent to cast Alan Turing’s manuscripts into a journal style article, but more to provide clearer access to his writing and, perhaps, to answer the questions “If Turing had have had access to typesetting software, what would his papers have looked like?”. Consequently no “house-style” copy-editing has been imposed. Occasional punctuation has been added to improve readability, some obvious errors corrected, and paragraph breaks added to ease the reading of long text blocks - and occasionally to give a better text flow. Turing uses typewriter underlining, single, and double quotes to indicate emphasis or style; these have been implemented using font format changes, double quotes are used as needed.
The manuscript has many typographical errors, deletions, and substitutions, all of which are indicated by over-typing, crossed out items, and handwritten pencil or ink annotations. These corrections have been implemented in this document to give the text that we presume Turing intended. Additionally, there are some hand written notes in the manuscript, which may or may not be by Turing; these are indicated by the use of footnotes.
British English spelling is used in the manuscript and this is retained, so words such as favour, neighbourhood, cancelling, etc. will be encountered. Turing appears to favour the spellings bigramme, trigramme, tetragramme, etc., although he is not always consistent; throughout this document the favoured rendering is used.
Turing’s wording is unchanged to give the full flavour of his original style. This means that “That is to say that we suppose for instance that ….. ” will be encountered - amongst others!
Both papers end abruptly, no summary or conclusion is offered, perhaps the papers are incomplete or pages are missing. To indicate the end of the manuscript we have marked the end of each paper with a printing sign - an infinity symbol between two horizontal bars.
In the section on a letter subtractor problem, reference is made to other methods to be discussed later in the paper. This does not happen - perhaps another indicator of an incomplete paper or missing pages.
Finally, Turing uses some forward page references that appear in the manuscript as see(p ), obviously intending to return and complete the reference. This also does not happen, so these references remain unresolved.
In short, we strive to represent Turing’s text as he wrote it.
Ciphertext, cleartext, etc.
In an attempt to capture the flavour of the time, ciphertext, cleartext, keys, etc. are displayed in a fixed pitch, bold, non-serif font to represent the typewriter, teletype, and telegraph machines that would have printed the original code, viz. CONDITIONS.
In the manuscript all mathematics is hand written in ink and pencil in spaces left between the typed text. Sometimes adequate space was left, other time not, and the handwriting spills into margins and adjacent lines, adding to the reading challenge. We have cast all mathematics into standard in-line or display formats as appropriate. We have used the mathcal font in places to capture the flavour of Turing’s handwriting, e.g. “the probability p” appears as “the probability ”.
Turing uses no punctuation in his mathematics, this has been added to be consistent with modern practice222See, for instance, Higham, Nicholas J. 1998. “Handbook of writing for the Mathematical Sciences”, SIAM, Philadelphia.; he also uses letters to reference equations - numbers are used in this document. In many places we have added parentheses to give clarity to an expression, and in some places where Turing is inconsistent in his uses of parentheses for a mathematical phrase (the expression for letter probability in the Vigenère in particular) we have chosen one format and been consistent in its use.
As Turing demonstrates a love of dense mathematics the algebraic multiplication symbol has occasionally been used for readability, so all standard forms of multiplication will be encountered, viz., . Finally, convention suggests that the subject of a formula or expression sits on its own on the left hand side of the equals sign, with the subsidiary variables collected on the right hand side. Turing adheres to this convention as it suits him, his preference is retained.
In short, we strive to retain the elegance of Turing’s mathematics, whilst casting it into a modern format.
Figures and tables
All figures have been included with rearrangement of some items to improve clarity or document flow. Turing uses a variety of papers, styles, inks, pen, and pencil; these have all been represented in standard figure and table format.
Turing provides a rudimentary Contents for The Applications of Probability to Cryptography, this has been reworked with some additions to make it more meaningful. Paper on Statistics of Repetitions, being much shorter, requires no Contents.
The editor can be contacted at: firstname.lastname@example.org.
- 1 Introduction
- 2 Straightforward Cryptographic Problems
Chapter 1 Introduction
The theory of probability may be used in cryptography with most effect when the type of cipher used is already fully understood, and it only remains to find the actual keys. It is of rather less value when one is trying to diagnose the type of cipher, but if definite rival theories about the type of cipher are suggested it may be used to decide between them.
1.2. Meaning of probability and odds
I shall not attempt to give a systematic account of the theory of probability, but it may be worth while to define shortly probability and odds. The probability of an event on certain evidence is the proportion of cases in which that event may be expected to happen given that evidence. For instance if it is known the 20% of men live to the age of 70, then knowing of Hitler only Hitler is a man we can say that the probability of Hitler living to the age of 70 is 0.2. Suppose that we know that Hitler is now of age 52 the probability will be quite different, say 0.5, because 50% of men of 52 live to 70.
The odds of an event happening is the ratio where is the probability of it happening. This terminology is connected with the common phraseology odds of 5:2 on meaning in our terminology that the odds are 5/2.
1.3. Probabilities based on part of the evidence
When the whole evidence about some event is taken into account it may be extremely difficult to estimate the probability of the event, even very approximately, and it may be better to form an estimate based on a part of the evidence, so that the probability may be more easily calculated. This happens in cryptography in a very obvious way. The whole evidence when we are trying to solve a cipher is the complete traffic, and the events in question are the different possible keys, and functions of the keys. Unless the traffic is very small indeed the theoretical answer to the problem “ What are the probabilities of the various keys? ” will be of the form “ The key has a probability differing almost imperceptibly from 1 (certainty) and the other keys are virtually impossible”. But a direct attempt to determine these probabilities would obviously not be a practical method.
1.4. A priori probabilities
The evidence concerning the possibility of an event occurring usually divides into a part about which statistics are available, or some mathematical method can be applied, and a less definite part about which one can only use one’s judgement. Suppose for example that a new kind of traffic has turned up and that only three messages are available. Each message has the letter V in the 17th place and G in the 18th place. We want to know the probability that it is a general rule that we should find V and G in these places. We first have to decide how probable it is that a cipher would have such a rule, and as regards this one can probably only guess, and my guess would be about . This judgement is not entirely a guess; some rather insecure mathematical reasoning has gone into it, something like this:-
The chance of there being a rule that two consecutive letters somewhere after the 10th should have certain fixed values seems to be about (this is a complete guess). The chance of the letters being the 17th and 18th is about (another guess, but not quite as much in the air). The probability of a letter being V or G is (hardly a guess at all, but expressing a judgement that there is no special virtue in the bigramme VG). Hence the chance is or about . This is however all so vague, that it is more usual to make the judgment “” without explanation.
The question as to what is the chance of having a rule of this kind might of course be resolved by statistics of some kind, but there is no point in having this very accurate, and of course the experience of the cryptographer itself forms a kind of statistics.
The remainder of the problem is then solved quite mathematically. Let us consider a large number of ciphers chosen at random. of them say. Of these of them will have the rule in question, and the remainder not. Now if we had three messages of each of the ciphers before us, we should find that for each of the ciphers with the rule, three messages have VG in the required place, but of the remaining only a proportion will have them. Rejecting the ciphers which have not the required characteristics we are left with cases where the rule holds, and cases where it does not. This selection of ciphers is a random selection of ones which have all the known characteristics of the one in question, and therefore the odds in favour of the rule holding are:
It should be noticed that the whole argument is to some extent fallacious, as it is assumed that there are only two possibilities, viz. that either VG must always occur in that position, or else that the letters in the 17th and 18th positions are wholly random. There are however many other possibilities worth consideration, e.g.
On the day in question we have VG in the position in question.
Or on another day we have some other fixed pair of letters.
Or in the positions 17, 18 we have to have one of the four combinations VG, RH, OM, IL and by chance VG has been chosen for all the three messages we have had.
Or the cipher is a simple substitution and VG is the substitute of some common bigramme, say TH.
The possibilities are of course endless, and it is therefore always necessary to bear in mind the possibility of there being other theories not yet suggested.
The a priori probability sometimes has to be estimated as above by some sort of guesswork, but often the situation is more satisfactory. Suppose for example that we know that a certain cipher is a simple substitution, the keys having no specially noticeable properties. Suppose also that we have 50 letters of such a message including five occurrences of P. We want to know how probable it it that P is the substitute of E. As before we have to answer two questions.
How likely is it that P would be the substitute of E neglecting the evidence of the five Es occurring in the message?
How likely are we to get 5 Ps?
If P is not the substitute of E
If P is the substitute of E.
I will not attempt to answer the second question for the present. The answer to the first is simply that the probability of a letter being the substitute of E is independent of what the letter is, and is therefore always , in particular it is for the letter P. The only guesswork here is the judgement that the keys are chosen at random.
1.5. The Factor Principle
Nearly all applications of probability to cryptography depend on the factor principle (or Bayes’ Theorem). This principle may first be illustrated by a simple example. Suppose that one man in five dies of heart failure, and that of the men who die of heart failure two in three die in their beds, but of the men who die from other causes only one in four dies in their beds. (My facts are no doubt hopelessly inaccurate). Now suppose we know that a certain man died in his bed. What is the probability that he died of heart failure? Of all numbering N say we find that
|die in their beds of heart failure|
|die in their beds from other causes|
Now as our man died in his bed we do not need to consider the cases of men who did not die in their beds, and these consist of
|cases of heart failure and|
|from other causes|
and therefore the odds are in favour of heart failure. If this had been done algebraically the result would have been
In this the theory is that the man died of heart failure, and the data is that he died in his bed.
The general formula above will be described as the factor principle, the ratio
is called the factor for the theory on account of the data.
Usually when we are estimating the probability of a theory there will be several independent pieces of evidence e.g. following our last example, where we want to know whether a certain man died of heart failure or not, we may know
He died in his bed
His father died of heart failure
His bedroom was on the ground floor
and also have statistics telling us
2/3 of men who die of heart failure die in their beds
2/5 ……………………………have fathers who died of heart failure
1/2 ……………………………have bedroom on the ground floor
1/4 of men who died from other causes die in their beds
1/6 ……………………………have fathers who died of heart failure
1/20 of men who die of other cause have their bedrooms on the ground floor
Let us suppose that the three pieces of evidence are independent of one another if we know that he died of heart failure, and also if we know that he did not die of heart failure. That is to say that we suppose for instance that knowing that he slept on the ground floor does not make it any more likely that he died in his bed if we knew all along that he died of heart failure. When we make these assumptions the probability of a man who died of heart failure satisfying all three conditions is obtained simply by multiplication, and is and likewise for those who died from other causes the probability is , and the factor in favour of the heart theory failure is
We may regard this as the product of three factors and and arising from from the three independent pieces of evidence. Products like this arise very frequently, and sometimes one will get products involving thousands of factors, and large groups of these factors may be equal. We naturally therefore work in terms of the logarithms of the factors. The logarithm of the factor, taken to the base is called decibanage in favour of the theory. A deciban is a unit of evidence; a piece of evidence is worth a deciban if it increase the odds of the theory in the ratio . The deciban is used as a more convenient unit that the ban. The terminology was introduced in honor of the famous town of Banbury.
Using this terminology we might say that the fact that our man died in bed scores 4.3 decibans in favour of the heart failure theory . We score a further 3.8 decibans for his father dying of heart failure, and 10 for his having his bedroom on the ground floor, totalling 18.1 decibans. We then bring in the a priori odds 1/4 or and the result is the the odds are , or as we may say “12.1 deciban up on evens”. This means about 16:1 on.
Chapter 2 Straightforward Cryptographic Problems
The factor principle can be applied to the solutions of a Vigenère problem with great effect. I will assume here that the period of the cipher has already been determined. Probability theory may be applied to this part of the problem also, but that is not so elementary. Suppose our cipher, written out in its correct period is111 Turing’s statement of the ciphertext is slightly different to what he decodes. The N M at the end the first line are reversed to read DKQHSHZMNP in Fig 5, which gives the correct cleartext.
D K Q H S H Z N M P
R C V X U H T E A Q
X H P U E P P S B K
T W U J A G D Y O J
T H W C Y D Z H G A
P Z K O X O E Y A E
B O K B U B P I K R
W W A C E J P H L P
T U Z Y F H L R Y C
Figure 1. Vigenère problem.
(It is only by chance that it makes a rectangular array.)
Let us try to find the key for the first column, and for the moment let us only take into account the evidence afforded by the first letter D. Let us first consider the key B. The factor principle tells us
Now the a priori odds in favour of key B may be taken as 1/25. The probability of getting D in the cipher with the key B is just the probability of getting C in the clear which (using the count on 1000 letters in Fig 2) is 0.021. If however the key is not B we can have any letter other the C in the clear, and the probability is (1 - 0.021)/25. Using the evidence of the D then the odds in favour of the key B are
We may then consider the effect of the next letter in the column R which gives a further factor of (25 x 0.064)/(1 - 0.064). We are here assuming that the evidence of the R is independent of the evidence of the D. This is not quite correct, but is a useful approximation; a more accurate method of calculation will be given later. Let us write for the frequency of the letter in plain language. Then our final estimate for the odds in favour of key B is
where is the series of letters in the 1st column, and we use the letters and numbers interchangeably, A meaning 1, B meaning 2, , Z meaning 26 or 0. More generally for key the odds are
The value of this can be calculated by having a table of the decibanage corresponding to the factors . One then decodes the column with the various possible keys, looks up the decibanage, and adds them up.
The most convenient form for doing this is a table of values of , taken to the nearest integer, or as we may say, the values of the score in half decibans. One may also have columns showing multiples of these, and the table made of double height222 Turing provides a table of double height for Fig 3 to allow the “gadget” of Figure 4 to be used with any letter of the alphabet as a decode key - hence the double alphabet. Figure 4 can be prepared as a transparency, with the original markings cleared, and markings for the new decode letter added. Fig 3 and Fig 4 are correctly proportioned in this document for this to work. (Fig 2.3). For the first column with key B the decoded column is CQWS••OAV,333 S•• means SSS, for a total of three letter S, as noted in the following arithmetic. The linear decode for the example is CQWSSOAVS and we score -5 for C, -26 for Q, -5 for W, 17 for the three letters S, 5 for O, 7 for A and -10 V, totalling -17. These calculations can be done very quickly by the use of the transparent gadget Fig 2.4 , in which squares are ringed in pencil to show the number of letters occurring in the column.
Figure 2. Count on 1000 letters.
The value for X has been taken more of less at random as a compromise
between real language & telegraphese. Also I added to each entry (see p )444 Forward reference left unresolved in the manuscript..
The gadget may be placed over Fig 2.3 in various positions corresponding to the various keys. The score is obtained by adding up the numbers showing through the various squares. In Fig 2.5 the alphabet has been written in a vertical below the cipher text of Fig 1, each letter representing a possible key. The score for each key has been written opposite the key, and under the relevant column. An X denotes a bad score, not worth adding up. Usually these will be -15 or worse. It will be seen that for the first column P, having a score of 43 is extremely likely to be right, especially as there is no other score better than 8. If we neglect this latter fact the odds for the key are i.e. about 5:1 on. The effect of decoding this column with key P has been shown underneath.
For the second column the best key is O, but is by no means so certain as the first column. The decode for this column is also shown, and provides very satisfactory combinations with the first column, confirming both keys. (This confirmation could also be based on probability theory, given a table of bigramme frequencies). In the third column I and C are best although D would be very possible, and in the fourth column Q and U are best.
Writing down the possible decodes we see that the first line must read OWING and this makes the other lines read CONDI, ITHAS, EIMPO, ETOIM, ALCUL, MACHI, HISIS, EGRET. By filling in the word CONDITIONS the whole can now be decoded.555 Solution: Keylength - 10, Key - POIUMOLQNY, Cleartext - OWINGTOWAR CONDITIONS ITHASBECOM EIMPOSSIBL ETOIMPORTC ALCULATING MACHINESXT HISISVERYR EGRETTTABLE
A more accurate argument would run as follows. For the first column, instead of setting up as rival theories the two possibilities that B is the key and that B is not we can set up 26 rival theories that the key is A or B or … Z, and we may apply the factor principle in the form:-
The argument to justify this form of factor principle is really the same as for the original form. Let be the a priori probability of key . Then out of N cases we have cases of key . Let be the probability of getting the column C with key , then we have rejected the cases where we get columns other than C we find that there are cases of key i.e. the a posteriori probability of key is , where is independent of .
We have therefore to calculate the probability of getting the column C with key and this is simply , i.e. the product of the frequencies of the decode letters which we get if the key is .
Since the a priori probabilities of the keys are all equal we may say that the a posteriori probabilities are in the ratio i.e. in the ratio which is more convenient for calculation. The final value for the probability is then
The calculation of the product may be done by the method recommended before for
The table in Fig 3 was in fact made up for . The differences between the two tables would of course be rather slight. The new result is more accurate than the old because of the independence assumption in the original result.
If we only want to know the ratios of the probabilities of the various keys there is no need to calculate the denominator . This denominator has however another importance: it gives us some evidence about other assumptions, such as that the cipher is Vigenère, and that the period is 10. This aspect will be dealt with later (p. )666 Forward reference left unresolved in the manuscript..
2.2. A letter subtractor problem
A substitution with the period is obtained by superimposing three substitutions of periods 91, 95, and 99, each substitution being a Vigenère composed of slides of 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.777 Equivalent to keys A to J. The three substitutions are known in detail, but we do not know for any given message at what point in the complete substitution to begin. For many messages however we can provide a more or less probable crib. How can we test the probability of a crib before attempting to solve it? It may be assumed that approximately equal numbers of slides 0, 1, …, 9 occur in each substitution.
The principle of the calculation is that owing to the way in which the substitution is built up, not all slides are equally frequent, e.g. a slide of 25 can only be the sum of slides of 9, 8, and 8, or 9, 9, and 7 whilst a slide of 15 can be any of the following
A crib will therefore, other things being equal, be more likely if it requires a slide of 15 than if it requires a slide of 25. The problem is to make the best use of this principle, by determining the probability of the crib with reasonable accuracy, but without spending long over it.
We have to find the probability of getting a given slide. To do this we can apply several methods.
We can produce a long stretch of key by addition and take a count of the resulting slides. This is obviously a very general method, and requires no special mathematical technique. It may be rather laborious, but by interpreting a small count with common sense one can probably get quite good results.
There are 1000 possible combinations of slides all equally likely viz. 000, 001, …, 999. We can add up the digits in these and take the remainder on division by 26, and then count the number of combinations giving each of the possible remainders.
We can make use of a trick which might appear to be rather special, but is really applicable to a multitude of problems. Consider the expression
For each possible way of expressing a number n as the sum of three numbers 0, …, 9, say , there is a term in , coming out of the first factor, out of the second, and out of the third. Hence the number of ways of expressing n in the form , is the coefficient of in i.e. in
Expanding by the binomial theorem
Now multiply by and we get
This means to say that the chances of getting totals 0, 1, 2, … are in the ratio 1, 3, 6, 10, … The chances of getting remainders of 0, 1, 2, … on division by 26 are in the ration 4, 4, 6, 10, 15, … To get true probabilities these must be divided by their total which is conveniently 1000.
There are two other methods, both connected with the last method but not relying so much on the special features of the problem. They will be discussed later.888 No such discussion appears in the manuscript.
Suppose then that the probabilities have been calculated by one method or the other (as in fact we have done under (c)). We can then estimate the values of cribs. Let us suppose that a possible crib for a message beginning MVHWUSXOWBVMMK was AMBASSADOR so that the slides were 12, 9, 6, 22, 2, 0, 23, 11, 14. The slide of 12 gives us some slight evidence in favour of the crib being right for slides of 12 occur with frequency 0.073 with right cribs, whilst with wrong cribs they occur with frequency only 1/26. The factor in favour of the crib is therefore or about 1.9. A similar calculation may be made for each of the slides, but of course the work may be greatly speeded up by having the values of the factors 26 in half decibans tabulated: here is the coefficient of in the above polynomial . The table is given below (Fig 6)
Figure 6. Scores in half decibans of the various slides.
Evaluating this crib by means of this table we score
i.e. the crib is worse by a factor of than it was before e.g. if the a priori odds of the crib were 2:1 against it becomes 98:1 against. This crib was in fact made up at random i.e. the letters of the cipher text were chosen at random.
Now let us take one made up correctly, i.e. really enciphered by the method in question, but with a random chosen key.
This scores 15 so that if it were originally 2:1 against, it now becomes nearly 3:1 on.
Having decided on a crib the natural way to test it is to have a catalogue of the positions in which a given series of slides is obtained if the 91 period component is omitted. We make 91 different hypotheses as to this third component, draw an inference as to what is the part of the slide arising from the components of periods 95 and 99 combined. This we look up in the catalogue. This process is fairly lengthy, and as the scoring of the crib takes only a minute it is certainly worth doing.
2.3. Theory of repeats
Suppose we have a cipher in which there are several very long series of substitutions which can be used for enciphering a message, but that one may sometimes get two messages enciphered with the same series of substitutions (or possibly, the series of substitutions for one message being those for another with some at the beginning omitted). In such a case let us say that the messages fit, or that they fit at such and such a distance, the distance being the number of substitutions which have to be omitted from the one series to obtain the other series. One will frequently want to know whether two messages fit or not, and we may find some evidence about this by examining the repeats between them.
By the repeats between them I mean this. One writes out the cipher texts of the two messages with the letters which are thought to have been enciphered with the same substitution under one another. One then writes under these messages a series of letters O and X, an O being written where the cipher texts differ and an X where they agree. The series of letters O and X will begin where the second message begins and end where the first to end ends. This series of letters O and X may be called the repetition figure. It may be completed by adding at the ends an indication of how many letters there are which do not overlap, and which message they belong to.
As an example:
On the whole one expects that a fit is more likely to be right the more letters X there are in the repetition figure, and that long series of letters X are especially desirable. This is because it would not be very unusual for two fairly common words to lie directly under one another when the clear texts are written out, thus
If the corresponding cipher texts really fit, i.e. if the letters in the same column are enciphered with the same substitution, then the condition for an X in the repetition figure of the cipher texts is that there be an X in the repetition figure of the corresponding clear text. Now series of several consecutive letters X can occur quite easily as above by two identical words coming under one another, or by such combinations as
if the messages really fit, but if not they can only occur by complete coincidence. One therefore tends to believe that there is a fit when one gets such series of letters X. As regards single cases of X the value of them is not so clear, but one can see that if is the frequency of letters in plain language then the frequency of letters X as a whole in comparison of plain language with plain language is , whilst for wrong fits of cipher text it is 1/26 which is necessarily less. Given a sufficiently long repetition figure one should therefore be able to tell whether it is a fit or not simply by counting the letters X and O.
So much is well known. The real point of this section is to show these ideas can be developed into an accurate method of estimating the probabilities of fits.
2.3.1. Simple form of theory
The complete theory takes account of the various possible lengths of repeat. As this theory is somewhat complicated it will be as well to give first two simplified forms of the theory. In both cases the simplification arises by neglecting a part of the evidence. In the first simplified form of theory we neglect all evidence except the number of letters X and the number of letters O. In the other simplified form the evidence is the number of series of (say) four consecutive letters X in a repetition figure.
When our evidence is just the number of times X occurs in the repetition figure, (n let us say) and the length of the repetition figure (N say), then the factor in favour of the fit is
As an approximation we may assume that the numerator of this expression has the same value as if the right repetition figures were produced letter by letter by independent random choices, with a certain fixed probability of getting an X at each stage. This probability will have to be . The numerator is then
which we may write as . Now let us denote by the th symbol of the given repetition pattern and put and . Then , the probability of getting the repetition pattern is which simplifies to . We may do a similar calculation for the denominator, but here we must take since all letters occur equally frequently in the cipher. The denominator is then
In dividing to find the factor for the fit cancels out, leaving
In other words we score a factor of for an X and a factor of for an O. More convenient is to regard it as decibans for an X and per unit length of repetition figure (per unit overlap).
An alternative argument, leading to the same result, runs as follows. Having decided to neglect all evidence except the overlap and the number of repeats we pretend that nothing else matters, i.e. that the form of the figure is irrelevant. In this case we can regard each letter of the repetition figure as independent evidence about the fit. If we get an X the factor for the fit is
i.e. . Similarly the factor for an O is .
In either form of argument it is unnecessary to calculate the number . In this particular case there is no particular difficulty about about it: it is the binomial coefficient. In some similar problems this cancelling out is a great boon, as we might not be able to find any simple form for the factor which cancels. The cancelling out is a normal feature of this kind of problem, and it seems quite natural that it should happen when we think of the second form of argument in which we think of the evidence as consisting of a number of independent parts.
The device of assuming, as we have done here, that the evidence which is not available is irrelevant can often be used and usually leads to good results. It is of course not supposed that the evidence really is irrelevant, but only that the error resulting from the assumption when used in this kind of way is likely to be small.
2.3.2. Second simplified form of theory
In the second simplified form of theory we take as our evidence that a particular part of the repetition figure is OXXXXO (say, or alternatively OXXXXXO say). The factor is then
The denominator is
and the numerator may be estimated by taking a sample of language hexagrams and counting the number of pairs that have the repetition figure OXXXXO. The expectation of the number of such pairs is the sum for all pairs of the probabilities of those pairs having the desired repetition figure i.e. is the number of such pairs (viz where is the size of the sample) multiplied by the frequency of OXXXXO repetition figures. This frequency may therefore be obtained by division if we equate the expected number of these repetition figures to the actual number.
2.3.3. General form of theory
It is not of course possible to have statistics of every conceivable repetition figure. We must make some assumptions to reduce the variety that need to be considered. The following assumption is theoretically very convenient, and also appears to be a very good approximation.
The probability of repeats at two points known to be separated by a point where there is known to be no repeat are independent.
We may also assume that the probability of a repeat is independent of anything but the repetition figure in this neighbourhood. (We may however as a refinement produce different positions in a message). We can therefore think of repetition figures as being produced by selecting the symbols of the figure consecutively, the probability of getting an X at each stage being determined by the repetition figure from the point in question back as far as the last O. Sometimes this will take us back as far as the beginning of the message, and will include the number telling us how many more letters there are which do not repeat at all. We need in practice only distinguish two cases, where this number is 0 and when it is more. We may also neglect the question as to which message occurs first. We therefore have to distinguish the following cases
|OX||some X||none X|
|OXX||some XX||none XX|
|OXXX||some XXX||none XXX|
The entries opposite the repetition figures are the notations we are adopting for the probability of getting another X following such a figure. Strictly speaking we should also bring in a notation for the probability of the message coming to an end after any given repetition figure. As the repeats at the end of a comparison do not appear to behave very differently from those in the main part of the message I shall neglect this complication by assuming that the probability of getting an O added to the probability of getting an X is 1, and that afterwards one cuts off the end of the series arbitrarily.
Let us calculate the factor for the repeat figure999 In the manuscript, Turing squeezes the figure into three lines by spilling into the margins and use of pen and ink. The typeset equivalent is unreadable, so the figure has been split into a left and right components. Reassemble as: none X X X X O | O | O | X O | X X X O | O | X X | some
Underneath each symbol has been written the probability that one would get that symbol, knowing the ones which precede, both for the case of a right and of a wrong repetition figure. The factor for the fit is the product of the first row divided by the product of the second. It is convenient to split this up as indicated by the vertical lines into the product of
and this product may be put into the form of the product of
- which we call the factor for an
initial tetragramme repeat level,
- the factor for a single repeat,
- the factor for a trigramme,
- the correction for a final bigramme,
- the factor for an overlap of 16,
- the factor for a trigramme.
We shall neglect the correction for a final bigramme (or whatever it may be). It is in any case rather small, and vanishes if the repetition figure ends with O; also with our conventions the whole question of the ends of repetition figures has been left rather in doubt.
Now let us put101010 The manuscript has a pencilled note beside indicating it is to be read as . We presume that this also means that should be , and should be , However, these are not indicated and no changes are made in the subsequent text. We leave the text unchanged
The values of the can be obtained as follows. We take a number of plain language messages and leave out two or three words at the beginning. Then combine the messages to form one long message; this message may be made to eat its own tail i.e. it may be written round a circle. If the message were compared with itself in every possible position, except level, we should expect to get repetition figures which when divided up as shown by vertical lines after each O, containing parts which consist of r symbols O, or as we may say actual r-gramme repeats, where h is the probability of an O .
The values of can be calculated given the apparent number of r-gramme repeats for each . This apparent number of r-gramme repeats is the number of series of r consecutive symbols X in the repetition figures regardless of what precedes or follows the series.
By considering the ways in which an actual repeat can give rise to the apparent repeat of various lengths we see that
The calculation of may perhaps best be done by comparing the beginners of a number of messages with the long circular message, and the values of by comparing the beginners among themselves. A similar technique of actual and apparent numbers of repeats can be used. I shall not go into this in detail. The formulae required may now be assembled.
may be calculated as follows. From the identity
2.4. Transposition ciphers
2.4.1. A probability problem
In making calculations about substitution ciphers we have often found it useful to treat the plain language as if it were produced by independent choices for the letters, using certain fixed frequencies with which the letters are chosen. Our method for Vigenère and one of the simplified forms of repeat theory could be based on this sort of assumption. With a transposition cipher however such an assumption would be useless or worse than useless, for it would result in the conclusion that all transpositions were equally likely. We have therefore to take a slightly less crude assumption, and the one which suggests itself is that the letters forming the plain language are chosen consecutively, the probability of getting a particular letter depending only on what the letter is and what the preceding letter was. It is easily verified the if is the proportion of bigrammes in plain language and the frequency of the letter then the probability of a letter following an is . The probability of a piece of plain language of length letters saying is then
which may also be written as
We may also calculate the probability of a given piece of plain language having certain given letters in given places, the remainder of the message being unspecified. The probability is given by
and if the data is that the known letters are
it is approximately111111 The manuscript has as the first term , a pencilled annotation indicates that the is to be read as . This substitution has been made in the text.
A more or less rigorous deduction of this approximation from the assumptions above is given at the end of the section. For the present let us see how it can be applied. If we have two theories about the transposition of which the one requires the above pattern of letters, and the other brings the same letters in to positions in which no two of them are consecutive, then the factor in favour of the first as compared with the second is
We can apply this straightforwardly to the case of a simple transposition by columns. The following text is known to be a simple transposition of a certain type of German text with a key length of not more than 15.121212 As for the Vigenère problem above, Turing’s statement of the ciphertext is slightly different from that which he scores for decryption. The second line in the ciphertext below begins NLTS, however, this changes to NITS in the scoring example in Figure 7 below. See also the notes accompanying the cleartext.
S A T P T W S F A S T A U T E E A I E U F H W T J T D D G C
N L T S E F C U I E B O E Y Q H G T J T E E F I E O R T A R
U R N L N N N N A I E O T U S H L E S B F B R N D X G N J H
U A N W R
To solve this transposition, we may try comparing the first six letters of S A T P T W which we know form part of one column with each other series of six letters in the message, for we know that one such comparison will give entirely bigrammes occurring in the decode. We may try first
The factor for a transposition which brings these letters together, as compared with one which leaves them apart is
By using a table of values of
made up for the type of traffic in question, and given to the nearest integer (table of values of expressed in half-decibans) we get the product by addition. Such a table is shown in Fig 6. The scores for this particular columns are SF -7, AA -7, TS -2, PT -10, TA -3, WU -13, totalling -36. If we consider this combination as a priori about 100:1 against (there are 95 letters in the message) it is a posteriori about 3000:1 against.
Similar scoring may be done for every possible comparison of S A T P T W with six consecutive letters of the message. The comparison may be made both with S A T P T W as earlier and as later column; one may also use the last six letters of the message H U A N W R.
The results of doing this are shown in Fig 7. The message has been written out vertically. The first columns of figures after the message gives the score for S A T P T W as earlier column, entered against the first letter of the later column, e.g. the -36 as calculated above gets entered against the F of F A S T A U. The second column after the message consists of the scores for H U A N W R as first column [and the column before the message gives scores for H U A N W R as second column].131313 …and the column to end of sentence, has the note in pencil: I doubt it - S.W. One of these columns has been worked out in detail but in the other two crosses have been put in where the scores are very bad.
The scores which eventually turned to to be right are ringed. The fourth comparison, which did not have to be done scored very badly viz. -27. Amongst the good scores which were wrong there was one score of 37. It was not difficult to see that this one was wrong as most of the score came from W O with requires Z to precede it, and there was no Z in the message. Apart from this fact the comparison was about evens, although if we take into account the fact that there was no better score it would be better.141414 Using Turing’s scoring recommendations and a key length of 12 with sequence 5, 11, 8, 7, 3, 10, 6, 12, 9, 4, 1, 2, a cleartext emerges: BNTO SJJ ALBA RFJ STATT IN OST B HEUTE DEN ETA RUFS PEDUNYAR NACHT FGFNQUUDNUL WICH AHTR X WIESEN WI GEN GRESFOITE TE. With Turing’s original statement of the ciphertext, as noted above, GRESFOITE becomes GRESFOLTE. Turing scores for the I and not for L, although it makes no differences in the decision to align bigrams. [We have already had a case of this kind of thing in connection with Vigenère; if the various positions are a priori equally likely and the factors are then the value for the probability of the th alternative is better than ].
2.4.2. The Probability Formula
We can put
This is not true, but it is true that except for very special values for ,
and the convergence is rather rapid.
To prove this I shall assume that the eigenvalues of are all different in modulus. In this case we can find a matrix with unit determinant, such that is in the diagonal form
since we have
That is, for each provides a solution of
with . Conversely if we have any solution of (2.3) then for some and all , for as is non singular we can find numbers such that
and then substituting in (2.3) we get
Which, since is nonsingular implies .
As the series are all different there is only one value of for which and so for all . Now putting for all we see that one member of the series is 1, for (2.3) is certainly satisfied.
I shall prove that the remaining eigenvalues satisfy . We first prove that if then . This follows by multiplying (2.3) on each side by and summing. Since
Next we show that each for which is real and positive. Let satisfy (2.3) with ; then the eigenvalue for is and so
If has been chosen so small that for all then the L.H.S. is positive for the coefficients in the matrix are positive, whereas the R.H.S. is negative for suitably chosen , unless . If now we may take it that is real for each . As it must satisfy it is negative for some , but then
and if is chosen so that for all the L.H.S is positive whereas the R.H.S is negative for sufficiently large .
All the eigenvalues therefore satisfy as the eigenvalues are all different in modulus this means that except for one value of . Then as tends to a matrix which has only one element different from 0, and that a 1 on the diagonal, say in position .
Calling this matrix the series of matrices tends to the matrix . This matrix is the one and only one which satisfies and is therefore the one whose coefficient is .
2.4.3. Another probability problem
There is another probability problem that arises in connection with simple transpositions. With a message of length , and a key length of what is the probability that the th letter will be at the bottom of a column? Let be the length of the short columns i.e. , and let . Then if the th letter is at the bottom of the th column we must have
and there will be short and long columns among these first columns. There are151515 Turing is using Binomial Coefficient notation in this section;
ways in which the short and long columns can be arranged consistently with this, and altogether ways in which the columns can be arranged, so that the probability of the the letter being at the bottom of a column is
There will normally be very few terms in the sum.
Let us take the case of the message of length 133 and consider the 45th letter, assuming the key length is between 10 and 20 (inclusive).