Keyword Recovery with the χ2 Method

If the estimated keyword length is correct, each coset constructed is encrypted with the same letter (see Keyword Length Estimation with Index of Coincidence). The following is an encryption with keyword BOY.

MICHIGAN TECHNOLOGICAL UNIVERSITY
BOYBOYBO YBOYBOYBOYBOY BOYBOYBOYB
NWAIWEBB RFQFOCJPUGDOJ VBGWSPTWRZ

Since the keyword length is 3, we have the following cosets.

MICHIGAN TECHNOLOGICAL UNIVERSITY
N I B F O P D V W T Z
W W B Q C U O B S W
A E R F J G J G P R

If A is considered to be the 0-th letter, the following table has the positions of all 26 English letters.

0 1 2 3 4 5 6 7 8 9 10 11 12
A B C D E F G H I J K L M
13 14 15 16 17 18 19 20 21 22 23 24 25
N O P Q R S T U V W X Y Z

Since the first letter in the keyword is B, the plaintext letters corresponding to B are shifted to the right one position so that M becomes N and A becomes B. Since the second letter in the keyword is O, the plaintext letters corresponding to O are shifted to the right 14 positions so that I becomes W, O becomes B, and so on. Similarly, the third coset is obtained by shifting the plaintext letters corresponding to Y to the right 24 positions.

In the case of only knowing the three cosets, we need to shift each of them to the left some positions to get the plaintext back. More precisely, each coset is shifted to the left 1 position, 2 positions, ..., and 25 positions. Note that we do not need to shift 0 position because it is the coset itself. Since each shift produces a possible decryption of the coset, there are 26 different possibilities. If we have k cosets, the total number of shift combinations is 26k, which can be very large even if k is small. For example, if the possible keyword length is 8, there are 268 = 208,827,064,576 possible shift combinations (or possible keywords). With such a large number of combinations, it is very difficult to verify which shift of a coset can yield the correct keyword letter. Consequently, we need a better method rather than the use of brute force.

There is a simple method based on the frequency of letters in English. Since each coset is encrypted by the same letter, its frequency does not look like a typical English text. Shifting a coset changes its frequency. Of the 26 possible shifts, one can yield the original plaintext whose frequency should be very similar to the frequency of English. Therefore, we may compare the frequency of each shift against the frequency of English, and the shift that produces a frequency closest to the English frequency is likely to be the correct shift. But, what is the meaning of "closest"? Fortunately, in statistics there are methods to measure goodness-of-fit, one of which is the χ2 method. Given a set of observed values f1, f2, ..., fn and a set of corresponding known/expected values F1, F2, ..., Fn, the χ2 is computed as follows:

In our case, the F's are the values in the English frequency table, which are known, and the f's are the frequency obtained from a shift. The shift of a coset that produces the smallest χ2 value is the one whose frequency is the closet to that of the English language. However, the shift corresponding to the smallest χ2 may not always be the correct choice. In general, we have to examine several shifts that correspond to some small χ2 values.

Consider the second coset WWBQCUOBSW discussed earlier. The count, frequency fi and frequency Fi are shown below. The computed χ2 is 17.0130.

Letter A B C D E F G H I J K L M
Count 0 2 1 0 0 0 0 0 0 0 0 0 0
fi 0 0.2 0 0 0 0 0 0 0 0 0 0 0
Fi 0.082 0.014 0.028 0.038 0.131 0.029 0.020 0.053 0.064 0.001 0.004 0.034 0.025
Letter N O P Q R S T U V W X Y Z
Count 0 1 0 1 0 1 0 1 0 3 0 0 0
fi 0 0.1 0 0.1 0 0.1 0 0.1 0 0.3 0 0 0
Fi 0.071 0.080 0.020 0.001 0.068 0.061 0.105 0.025 0.009 0.015 0.002 0.020 0.001
χ2 17.0130

If the coset WWBQCUOBSW is shifted to the left by one position, we have VVAPBTNARV and the following table. The computed χ2 is 10.8557.

Letter A B C D E F G H I J K L M
Count 2 1 0 0 0 0 0 0 0 0 0 0 0
fi 0.2 0.1 0 0 0 0 0 0 0 0 0 0 0
Fi 0.082 0.014 0.028 0.038 0.131 0.029 0.020 0.053 0.064 0.001 0.004 0.034 0.025
Letter N O P Q R S T U V W X Y Z
Count 1 0 1 0 1 0 1 0 3 0 0 0 0
fi 0.1 0 0.1 0 0.1 0 0.1 0 0.3 0 0 0 0
Fi 0.071 0.080 0.020 0.001 0.068 0.061 0.105 0.025 0.009 0.015 0.002 0.020 0.001
χ2 10.8557

The following table shows the 26 χ2 values of each coset with the smallest one in boldface. The smallest χ2 of coset 1 is 1.9532 which corresponds to the letter B (i.e., shifting to the left one position). The smallest χ2 of coset 2 is 2.1695 which corresponds to the letter O (i.e., shifting to the left 14 positions). The smallest χ2 of coset 3 is 2.3933 which corresponds to the letter Y (i.e., shifting to the left 24 positions). In other words, the first, second and third cosets are encrypted by B, O and Y, respectively. Therefore, we are lucky and find the correct keyword BOY.

Shift Corresponding Letter Coset 1 χ2 Coset 2 χ2 Coset 3 χ2
0 A 12.6808 17.0130 33.4114
1 B 1.9532 10.8557 47.2982
2 C 16.6228 61.7972 3.3558
3 D 10.2763 15.4671 9.8983
4 E 24.9700 35.7427 4.4140
5 F 16.1760 17.4307 19.7483
6 G 29.5341 82.8543 22.8300
7 H 2.5481 14.5767 66.6135
8 I 6.3800 4.3482 41.4530
9 J 20.3966 9.5387 26.0354
10 K 8.4236 6.4101 62.5271
11 L 14.2454 43.6233 8.4614
12 M 9.8439 31.8807 26.1736
13 N 15.2270 70.7267 3.6981
14 O 11.8107 2.1695 14.6269
15 P 19.8472 16.2274 11.9170
16 Q 22.7962 6.1626 51.0042
17 R 8.1086 31.2416 10.6299
18 S 19.9131 34.4761 56.6600
19 T 4.6458 29.8607 36.4451
20 U 18.6617 4.8624 36.3898
21 V 3.3357 27.0150 14.0996
22 W 21.9697 3.7015 21.4566
23 X 18.5799 118.3588 33.7453
24 Y & 19.8023 14.2303 2.3933
25 Z 19.8783 55.4882 19.8128

Suppose MICHIGAN TECHNOLOGICAL UNIVERSITY is again encrypted to the following with an unknown keyword of length 4:

YITZU GRFFE TZZOC GSITS XUEAH EIKUT P

The χ2 values of all shifts for each coset are shown in the table below. To save space, this table only shows the four smallest χ2 values of each coset.

Shift Letter Coset 1 χ2 Coset 2 χ2 Coset 3 χ2 Coset 4 χ2
0 A
2.2259 2.2292
1 B
2.9978

2 C
5.6245 3.6112
5 F 4.1750


6 G
5.2742

12 M 3.9131

3.4856
14 O


4.3258
15 P

1.9889
17 R

5.9857
18 S 4.6828

2.0008
20 U 2.3036


25 Z


3.6293

The smallest χ2 values suggest the keyword to be UAPS and the decrypted result is EIEHAGCNLEEHFONOYIEADUPINETSATA. This is certainly not correct. If we align the plaintext, ciphertext and decrypted text together as follows, we should be able to see the problem.

MICHIGAN TECHNOLOGICAL UNIVERSITY
YITZUGRF FETZZOCGSITSX UEAHEIKUTP
EIEHAGCN LEEHFONOYIEAD UPINETSATA

It is obvious that the second shift corresponds to A and the fourth shift corresponds to S are correct, because the letters in corresponding positions of the plaintext and ciphertext are identical. However, the first shift U and the third shift P are not. Therefore, the unknown keyword looks like the following:

F A A S
M C
S P
U R

The following has all 16 possible combinations:

FAAS MAAS SAAS UAAS
FACS MACS SACS UACS
FAPS MAPS SAPS UAPS
FARS MARS SARS UARS

The following shows the decryption with each of the 16 possible keywords:

FAAS TITHPGRN AETHUOCONITAS UEICEISPTP FACS TIRHPGPN AERHUOAONIRAS UCICEGSPTN
FAPS TIEHPGCN AEEHUONONIEAS UPICETSPTA FARS TICHPGAN AECHUOLONICAS UNICERSPTY
MAAS MITHIGRN TETHNOCOGITAL UEIVEISITP MACS MIRHIGPN TERHNOAOGIRAL UCIVEGSITN
MAPS MIEHIGCN TEEHNONOGIEAL UPIVETSITA MARS MICHIGAN TECHNOLOGICAL UNIVERSITY
SAAS GITHCGRN NETHHOCOAITAF UEIPEISCTP SACS GIRHCGPN NERHHOAOAIRAF UCIPEGSCTN
SAPS GIEHCGCN NEEHHONOAIEAF UPIPETSCTA SARS GICHCGAN NECHHOLOAICAF UNIPERSCTY
UAAS EITHAGRN LETHFOCOYITAD UEINEISATP UACS EIRHAGPN LERHFOAOYIRAD UCINEGSATN
UAPS EIEHAGCN LEEHFONOYIEAD UPINETSATA UARS EICHAGAN LECHFOLOYICAD UNINERSATY

Therefore, the correct keyword is MARS. Compared with 264 = 456,976, 16 is significantly smaller and can use brute force to recover the keyword and the plaintext. Note that the successful rate is much higher if the ciphertext is long and the keyword is relatively short.

Example 1

The following ciphertext was discussed in Example 2 of Index of Coincidence.

VVQGY TVVVK ALURW FHQAC MMVLE HUCAT WFHHI PLXHV UWSCI GINCM
UHNHQ RMSUI MHWZO DXTNA EKVVQ GYTVV QPHXI NWCAB ASYYM TKSZR
CXWRP RFWYH XYGFI PSBWK QAMZY BXJQQ ABJEM TCHQS NAEKV VQGYT
VVPCA QPBSL URQUC VMVPQ UTMML VHWDH NFIKJ CPXMY EIOCD TXBJW
KQGAN

In Example 3 of Keyword Length Estimation with Index of Coincidence we found the likely keyword length to be 8. The ciphertext is divided into eight cosets and the smallest χ2 of each coset and its corresponding shift letters are shown below.

Letter χ2
C 1.051619
O 1.147637
M 1.019016
P 0.702297
U 0.777985
T 0.684600
E 1.042904
R 0.645601

The possible keyword is COMPUTER with which we can decrypt the ciphertext correctly. ♦

Example 2

Example 4 discussed on the Keyword Length Estimation with Index of Coincidence page determined the keyword length of the following ciphertext to be 6.

TYWUR USHPO SLJNQ AYJLI FTMJY YZFPV EUZTS GAHTU WNSFW EEEVA
MYFFD CZTMJ WSQEJ VWXTU QNANT MTIAW AOOJS HPPIN TYDDM VKQUF
LGMLB XIXJU BQWXJ YQZJZ YMMZH DMFNQ VIAYE FLVZI ZQCSS AEEXV
SFRDS DLBQT YDTFQ NIVKU ZPJFJ HUSLK LUBQV JULAB XYWCD IEOWH
FTMXZ MMZHC AATFX YWGMF XYWZU QVPYF AIAFJ GEQCV KNATE MWGKX
SMWNA NIUSH PFSRJ CEQEE VJXGG BLBQI MEYMR DSDHU UZXVV VGFXV
JZXUI JLIRM RKZYY ASETY MYWWJ IYTMJ KFQQT ZFAQK IJFIP FSYAG
QXZVK UZPHF ZCYOS LJNQE MVK

The ciphertext is divided into 6 cosets. The smallest χ2 value of each coset is shown below.

Letter χ2
S 0.598145
U 0.388866
M 0.450186
M 0.311680
E 0.249809
R 0.679312

Thus, the keyword is very likely to be SUMMER. The decrypted text, with spaces and punctuation added, is as follows:

BE KIND AND COURTEOUS TO THIS GENTLEMAN.
HOP IN HIS WALKS AND GAMBOL IN HIS EYES.
FEED HIM WITH APRICOCKS AND DEWBERRIES,
WITH PURPLE GRAPES, GREEN FIGS AND MULBERRIES.
THE HONEY BAGS STEAL FROM THE HUMBLEBEES,
AND FOR NIGHT TAPERS CROP THEIR WAXEN THIGHS
AND LIGHT THEM AT THE FIERY GLOWWORMS' EYES
TO HAVE MY LOVE TO BED AND TO ARISE.
AND PLUCK THE WINGS FROM PAINTED BUTTERFLIES
TO FAN THE MOONBEAMS FROM HIS SLEEPING EYES.
NOD TO HIM, ELVES AND DO HIM COURTESIES.

This is what Titania said in Act 3, Scene 1 of William Shakespeare's A Midsummer Night's Dream. Of course, this is a correct decryption. ♦

Example 3

The following ciphertext was discussed in Example 5 on the Keyword Length Estimation with Index of Coincidence page. A possible keyword length is 7.

WQXYM REOBP VWHTH QYEQV EDEXR BGSIZ SILGR TAJFZ OAMAV VXGRF
QGKCP IOZIJ BCBLU WYRWS TUGVQ PSUDI UWOES FMTBT ANCYZ TKTYB
VFDKD ERSIB JECAQ DWPDE RIEKG PRAQF BGTHQ KVVGR AXAVT HARQE
ELUEC GVVBJ EBXIJ AKNGE SWTKB EDXPB QOUDW VTXES MRUWW RPAWK
MTITK HFWTD AURRV FESFE STKSH FLZAE ONEXZ BWTIA RWWTT HQYEQ
VEDEX RBGSO REDMT ICM

This ciphertext is divided into 7 cosets. The smallest χ2 value of each cost is shown below:

Letter χ2
A 0.268109
M 0.975344
E 0.314517
R 0.278291
I 1.367654
C 0.591278
A 0.650083

Therefore, AMERICA is a possible keyword and the decrypted text, with spaces and punctuation added, is shown below:

WE THE PEOPLE OF THE UNITED STATES, IN ORDER TO FORM A MORE PERFECT UNION,
ESTABLISH JUSTICE, INSURE DOMESTIC TRANQUILITY, PROVIDE FOR THE COMMON
DEFENCE, PROMOTE THE GENERAL WELFARE, AND SECURE THE BLESSINGS OF LIBERTY
TO OURSELVES AND OUR POSTERITY, DO ORDAIN AND ESTABLISH THIS CONSTITUTION
FOR THE UNITED STATES OF AMERICA.

You certainly know what this quote is. ♦