The index of coincidence can be used to estimate the length of the unknown keyword.
To this end, let us guess a length l and divide the ciphertext
C = C1C2C3...CN
into l strings as follows:
If the length of the keyword is 3, this ciphertext is divided into three cosets as follows:
The 28 letters are divided into 3 cosets with the second and third having 4 letters
and the first having 5 letters.
♦
If the length of the keyword is guessed right,
each coset would preserve the English IC to some degree.
Therefore, the IC's of the cosets S1 to Sl
should be closer to ICEnglish = 0.068,
and the average of these IC's would still be high and close to ICEnglish = 0.068.
Otherwise, each coset would look more or less random,
and the IC's of these cosets would be closer to
ICRandom = 0.038$.
Hence, the average of the IC's would be low.
Based on this observation, we may divide the ciphertext into 1 coset (the ciphertext itself),
2 cosets, 3 cosets, 4 cosets, etc
and compute the IC of each coset and the average.
The length that yields the highest average IC and close to ICEnglish
is likely to be the correct length of the keyword.
In this way, we have l strings S1, S2, ...,
Sl, each of which is a substring of the ciphertext.
These substrings are usually referred to as cosets.
Moreover, if the length of the keyword l is correct,
each coset is the encrypted result by the same letter of the keyword.
Note that the "shorter" cosets have ⌊N/l⌋ letters
and the "full" cosets have ⌊N/l⌋+ letters.
Example 1
Suppose we have the following ciphertext of 28 characters:
RSTCS JLSLR SLFEL GWLFI ISIKR MGL
RCLRF GFSRL
SSSSE WIIM
TJLLL LIKG
Example 2
Let us continue with the previous example.
The following table shows the IC's and their averages of lengths 2, 3 and 4.
Length | Indices of Coincidence | Average |
2 | 0.0769, 0.0659 | 0.0714 |
3 | 0.1111, 0.1944, 0.1667 | 0.1574 |
4 | 0.0476, 0.0476, 0.0476, 0.0476 | 0.0476 |
Since length 3 yields the largest IC average, the ciphertext is very likely encrypted by a keyword of length 3. ♦
Example 3
Consider the ciphertext in Example 2 discussed on the
Index of Coincidence page.
VVQGY TVVVK ALURW FHQAC MMVLE HUCAT WFHHI PLXHV UWSCI GINCM
UHNHQ RMSUI MHWZO DXTNA EKVVQ GYTVV QPHXI NWCAB ASYYM TKSZR
CXWRP RFWYH XYGFI PSBWK QAMZY BXJQQ ABJEM TCHQS NAEKV VQGYT
VVPCA QPBSL URQUC VMVPQ UTMML VHWDH NFIKJ CPXMY EIOCD TXBJW
KQGAN
The following shows a summary of possible keyword lengths from 1 to 10.
Length | Indices of Coincidence | Average |
1 | 0.041989 | 0.041989 |
2 | 0.046830, 0.044846 | 0.045838 |
3 | 0.040068, 0.045654, 0.046532 | 0.044085 |
4 | 0.050528, 0.047059, 0.057255, 0.042353 | 0.049299 |
5 | 0.045122, 0.039024, 0.040244, 0.035366, 0.039024 | 0.039756 |
6 | 0.052101, 0.062389, 0.058824, 0.039216, 0.051693 0.052058 |
0.048128 |
7 | 0.039080, 0.041379, 0.041872, 0.032020, 0.068966 0.041872, 0.029557 |
0.042106 |
8 | 0.058462, 0.055385, 0.086154, 0.040000, 0.101538 0.063333, 0.096667, 0.083333 |
0.073109 |
9 | 0.031621, 0.043478, 0.075099, 0.047431, 0.043478 0.043478, 0.023715, 0.043290, 0.043290 |
0.043876 |
10 | 0.066667, 0.019048, 0.028571, 0.033333, 0.042857 0.063158, 0.031579, 0.047368, 0.031579, 0.026316 |
0.039048 |
The largest IC average 0.073109 corresponds to keyword length 8, and, hence, l = 8 is the most likely length of the keyword. Compared with the example on the Kasiski's Method using Kasiski's method, keyword length 8 is very reasonable. ♦
Example 4
Let us look at a longer ciphertext as follows:
TYWUR USHPO SLJNQ AYJLI FTMJY YZFPV EUZTS GAHTU WNSFW EEEVA
MYFFD CZTMJ WSQEJ VWXTU QNANT MTIAW AOOJS HPPIN TYDDM VKQUF
LGMLB XIXJU BQWXJ YQZJZ YMMZH DMFNQ VIAYE FLVZI ZQCSS AEEXV
SFRDS DLBQT YDTFQ NIVKU ZPJFJ HUSLK LUBQV JULAB XYWCD IEOWH
FTMXZ MMZHC AATFX YWGMF XYWZU QVPYF AIAFJ GEQCV KNATE MWGKX
SMWNA NIUSH PFSRJ CEQEE VJXGG BLBQI MEYMR DSDHU UZXVV VGFXV
JZXUI JLIRM RKZYY ASETY MYWWJ IYTMJ KFQQT ZFAQK IJFIP FSYAG
QXZVK UZPHF ZCYOS LJNQE MVK
The following shows a summary of possible keyword lengths from 1 to 10.
Length | 1 | 2 | 3 | 4 | 5 |
Average | 0.041180 | 0.044827 | 0.046035 | 0.045062 | 0.038385 |
Length | 6 | 7 | 8 | 9 | 10 |
Average | 0.062620 | 0.040136 | 0.044498 | 0.047109 | 0.041173 |
The largest IC average 0.062620 corresponds to keyword length 6, and, hence, l = 6 is the most likely length of the keyword. ♦
Example 5
Here is one more example as follows:
WQXYM REOBP VWHTH QYEQV EDEXR BGSIZ SILGR TAJFZ OAMAV VXGRF
QGKCP IOZIJ BCBLU WYRWS TUGVQ PSUDI UWOES FMTBT ANCYZ TKTYB
VFDKD ERSIB JECAQ DWPDE RIEKG PRAQF BGTHQ KVVGR AXAVT HARQE
ELUEC GVVBJ EBXIJ AKNGE SWTKB EDXPB QOUDW VTXES MRUWW RPAWK
MTITK HFWTD AURRV FESFE STKSH FLZAE ONEXZ BWTIA RWWTT HQYEQ
VEDEX RBGSO REDMT ICM
The following shows a summary of possible keyword lengths from 1 to 10.
Length | 1 | 2 | 3 | 4 | 5 |
Average | 0.043351 | 0.041466 | 0.044919 | 0.039914 | 0.039010 |
Length | 6 | 7 | 8 | 9 | 10 |
Average | 0.041352 | 0.071317 | 0.037294 | 0.043934 | 0.040091 |
The largest IC average 0.071317 corresponds to keyword length 7,
and
It is important to note that the maximum length is limited to 10 in all examples on this page. In practice, the keyword length may be longer than 10. If the correct keyword of an unknown keyword is k, then the method discussed on this page may report the possible keyword length being a multiple of k (i.e., k, 2k, 3k, etc). More precisely, if the true keyword is BOY, then this method may report the length being 3 for BOY, 6 for BOYBOY, 9 for BOYBOYBOY, etc. However, this is usually not a problem because one can use k, 2k, 3k, etc to recover the keyword. In general, the shortest length works well. |