Keyword Length Estimation with Index of Coincidence

The index of coincidence can be used to estimate the length of the unknown keyword. To this end, let us guess a length l and divide the ciphertext C = C₁C₂C₃...C_N into l strings as follows:

The first string S₁ starts with C₁ and includes every l-th letter: S₁ = C₁C_1+lC_1+2lC_1+3l...

The second string S₂ starts with C₂ and includes every l-th letters: S₂ = C₂C_2+lC_2+2lC_2+3l...

In general, the i-th string S_i starts with C_i and includes every l-th letter: S_i = C_iC_i+lC_i+2lC_i+3l...

The l-th string S_l starts with C_l and includes every l-th letter: S_l = C_lC_2lC_3lC_4l...

In this way, we have l strings S₁, S₂, ..., S_l, each of which is a substring of the ciphertext. These substrings are usually referred to as cosets. Moreover, if the length of the keyword l is correct, each coset is the encrypted result by the same letter of the keyword. Note that the "shorter" cosets have ⌊N/l⌋ letters and the "full" cosets have ⌊N/l⌋+ letters.

Example 1

Suppose we have the following ciphertext of 28 characters:

RSTCS JLSLR SLFEL GWLFI ISIKR MGL

If the length of the keyword is 3, this ciphertext is divided into three cosets as follows:

RCLRF GFSRL
SSSSE WIIM
TJLLL LIKG

The 28 letters are divided into 3 cosets with the second and third having 4 letters and the first having 5 letters. ♦

If the length of the keyword is guessed right, each coset would preserve the English IC to some degree. Therefore, the IC's of the cosets S₁ to S_l should be closer to IC_English = 0.068, and the average of these IC's would still be high and close to IC_English = 0.068. Otherwise, each coset would look more or less random, and the IC's of these cosets would be closer to IC_Random = 0.038$. Hence, the average of the IC's would be low. Based on this observation, we may divide the ciphertext into 1 coset (the ciphertext itself), 2 cosets, 3 cosets, 4 cosets, etc and compute the IC of each coset and the average. The length that yields the highest average IC and close to IC_English is likely to be the correct length of the keyword.

Example 2

Let us continue with the previous example. The following table shows the IC's and their averages of lengths 2, 3 and 4.

Length	Indices of Coincidence	Average
2	0.0769, 0.0659	0.0714
3	0.1111, 0.1944, 0.1667	0.1574
4	0.0476, 0.0476, 0.0476, 0.0476	0.0476

Since length 3 yields the largest IC average, the ciphertext is very likely encrypted by a keyword of length 3. ♦

Example 3

Consider the ciphertext in Example 2 discussed on the Index of Coincidence page.

VVQGY TVVVK ALURW FHQAC MMVLE HUCAT WFHHI PLXHV UWSCI GINCM
UHNHQ RMSUI MHWZO DXTNA EKVVQ GYTVV QPHXI NWCAB ASYYM TKSZR
CXWRP RFWYH XYGFI PSBWK QAMZY BXJQQ ABJEM TCHQS NAEKV VQGYT
VVPCA QPBSL URQUC VMVPQ UTMML VHWDH NFIKJ CPXMY EIOCD TXBJW
KQGAN

The following shows a summary of possible keyword lengths from 1 to 10.

Length	Indices of Coincidence	Average
1	0.041989	0.041989
2	0.046830, 0.044846	0.045838
3	0.040068, 0.045654, 0.046532	0.044085
4	0.050528, 0.047059, 0.057255, 0.042353	0.049299
5	0.045122, 0.039024, 0.040244, 0.035366, 0.039024	0.039756
6	0.052101, 0.062389, 0.058824, 0.039216, 0.051693 0.052058	0.048128
7	0.039080, 0.041379, 0.041872, 0.032020, 0.068966 0.041872, 0.029557	0.042106
8	0.058462, 0.055385, 0.086154, 0.040000, 0.101538 0.063333, 0.096667, 0.083333	0.073109
9	0.031621, 0.043478, 0.075099, 0.047431, 0.043478 0.043478, 0.023715, 0.043290, 0.043290	0.043876
10	0.066667, 0.019048, 0.028571, 0.033333, 0.042857 0.063158, 0.031579, 0.047368, 0.031579, 0.026316	0.039048

The largest IC average 0.073109 corresponds to keyword length 8, and, hence, l = 8 is the most likely length of the keyword. Compared with the example on the Kasiski's Method using Kasiski's method, keyword length 8 is very reasonable. ♦

Example 4

Let us look at a longer ciphertext as follows:

TYWUR USHPO SLJNQ AYJLI FTMJY YZFPV EUZTS GAHTU WNSFW EEEVA
MYFFD CZTMJ WSQEJ VWXTU QNANT MTIAW AOOJS HPPIN TYDDM VKQUF
LGMLB XIXJU BQWXJ YQZJZ YMMZH DMFNQ VIAYE FLVZI ZQCSS AEEXV
SFRDS DLBQT YDTFQ NIVKU ZPJFJ HUSLK LUBQV JULAB XYWCD IEOWH
FTMXZ MMZHC AATFX YWGMF XYWZU QVPYF AIAFJ GEQCV KNATE MWGKX
SMWNA NIUSH PFSRJ CEQEE VJXGG BLBQI MEYMR DSDHU UZXVV VGFXV
JZXUI JLIRM RKZYY ASETY MYWWJ IYTMJ KFQQT ZFAQK IJFIP FSYAG
QXZVK UZPHF ZCYOS LJNQE MVK

The following shows a summary of possible keyword lengths from 1 to 10.

Length	1	2	3	4	5
Average	0.041180	0.044827	0.046035	0.045062	0.038385
Length	6	7	8	9	10
Average	0.062620	0.040136	0.044498	0.047109	0.041173

The largest IC average 0.062620 corresponds to keyword length 6, and, hence, l = 6 is the most likely length of the keyword. ♦

Example 5

Here is one more example as follows:

WQXYM REOBP VWHTH QYEQV EDEXR BGSIZ SILGR TAJFZ OAMAV VXGRF
QGKCP IOZIJ BCBLU WYRWS TUGVQ PSUDI UWOES FMTBT ANCYZ TKTYB
VFDKD ERSIB JECAQ DWPDE RIEKG PRAQF BGTHQ KVVGR AXAVT HARQE
ELUEC GVVBJ EBXIJ AKNGE SWTKB EDXPB QOUDW VTXES MRUWW RPAWK
MTITK HFWTD AURRV FESFE STKSH FLZAE ONEXZ BWTIA RWWTT HQYEQ
VEDEX RBGSO REDMT ICM

The following shows a summary of possible keyword lengths from 1 to 10.

Length	1	2	3	4	5
Average	0.043351	0.041466	0.044919	0.039914	0.039010
Length	6	7	8	9	10
Average	0.041352	0.071317	0.037294	0.043934	0.040091

The largest IC average 0.071317 corresponds to keyword length 7, and l = 6 is the most likely length of the keyword. ♦

It is important to note that the maximum length is limited to 10 in all examples on this page. In practice, the keyword length may be longer than 10. If the correct keyword of an unknown keyword is k, then the method discussed on this page may report the possible keyword length being a multiple of k (i.e., k, 2k, 3k, etc). More precisely, if the true keyword is BOY, then this method may report the length being 3 for BOY, 6 for BOYBOY, 9 for BOYBOYBOY, etc. However, this is usually not a problem because one can use k, 2k, 3k, etc to recover the keyword. In general, the shortest length works well.