CSc 8370 fall '04
Practicing Statistics without a Licsence.
or
Statistical Measures of Languages
We need to be able to identify when a string is in a (human) language. It is also important to be able to recognize when two strings have been encrypted by the same algorithm in the same key. We've seen that doing this by hand is rather tiresome. Therefore we need to examine computable measures that we can use. We assume that if two strings are characterized by identical distributions, then they come from the same source. The measures listed below describe several ways of characterizing the distributions.
Approximate Measure of Probability.
Given a sample of size N, the probability that an event (eg. the occurrence of a given symbol) occurs is approximated by :
![]()
where ∂ is 1 when the two symbols are the same and 0 otherwise
Incidence of Coincidence or

where ∂ is 1 when t equals t' and 0 otherwise.
k ≤1 and
is only equal to 1 when the two texts are identical. If the sources
are characterized by a probability distribution p, then the expected
value for k is the sum of p2. If the source is random
(or if the sources are random with respect to each other) then the
expected value is just
.
Measures of smoothness for the distribution (,)
, measure the smoothness of the distribution. Uniformly distributed symbols are smooth, and meaningfully distributed symbols are not (to first approximation). For both of these functions one calculates the empirical distribution mi which is just the number of times the symbol is symbol i divided by the total number of symbols in the string (i.e. the size of the string).


so we just drop the constant.

(some authors normalize in this formula, in which case it is m/M where M is the size of the string).

(again this is for normalized m, for un-normalized the mi M would be just mi)
These
formulae appear to be "magic", but are actually based in
approximations to the entropy of a distribution (information
theory). The information contained in a distribution can be defined
as -log(p). This is "simply" a useful way of describing
how many "bits" are needed to represent the distribution
at a given fidelity. It is also a very useful way to handle joint
distributions because the logarithms add when the probabilities
multiply. The expected value of the information (-p ln p) is
identified as the entropy and this is maximized when the
distribution is mostly flat. Chernoff noted that
and
therefore x ln(x) < x(x-1).
Measures of errors in the distribution (2)
An empirical distribution m can always be compared to a theoretical distribution p by a 2 test.

(unscaled version would replace p by p*M).
This will be minimal at a correct decryption.
How much text is enough? (unicity distance).
The unicity distance is the size of the message which has a unique key. It can be estimated by using the entropy of the distribution p. It depends both on the underlying clear language and the cipher system. For single substitution in English it is about 25 characters.
where
Z is the number of possible cipher combinations. 3.5 is a constant
chosen to make U= 25 when Z = 26!. With Ceaser ciphers Z = 26 (for
N=26 letters), for monoalphabetic Z = 26!, for dialphabetic Z =
(262!). With block keyed ciphers like DES where there
are 2n keys then Z is 2n.
Problem Set, In class and out.
Write programs to derive monomer probabilities and pair probabilities. It would also be good to write a program to calculate Phi. The file 3boat10.txt is a relevant source of plaintext.
(the file classcipher) is this a mono alphabetic substitution? Prove it.
vvgfxxdfaaavaxdfdvdgvvdgfxdvdxvxgaavfgdfvaagfvfggdgavfdfafgxdvafgfgaxdagfggxddvxfxvvaxavvvggxdgvggvxvxvgdgvxdxgdvxgxdvvvxgdaaagavxddaxgdafdvvdaxgxdggavvfxgxdaddadxggvfavxfaxvgxgaavxvgvgffxdgagdgaxxggaaadxvvdvdgvgdadxvdfddafgdaadvvvdvgdfdxvxdvfavxgadvdfxxgxaavvaffgxxfdvggfvgdvgvvavvgadvavvgavfavvvgagddvxvagxdvdffgaavxvxgxgxddxdgfagggaxavdxaadgxdddgxxffgdggaadavvffxgfvaaavavaxgvavxxaxgfgddxavxgdavvggxdddaavfddddvagagggvgggaxdfffdgadagvvadaffggfagxavfgxdfvxfvadaggfdvgxaggfaxdavxvvaaaaaafdvvxxadxfdvxdvxgfagagafvgdvdgddvvagggdddvdfgagaaadadgadffvaaafxdvgadgdggffffadgfdggvvdadagadxgafaagfadgaavgvfaffvavagfvvfdvaxgvdaggaagfxaxdgxvxfxgaaffgfxgxgxvgxvvvgfvggfgdgavavaadxvvxaxxvfvdvaxdvdgaaadgxgvadavdaafdvaxdaaggxafdvxxgavvggvfxfavxxvvafvgvvgfdvfddgfvaaggfgvvvvfdvfvddvxgdgggvvggggdfxfagfffvgvdxvvfgddxaaadvavgaaaafdvgfdvaxdfgvaxgxafavvadvggdfgxvfgavvagggavxxfgdgfvvavvdvfddavaaxxdaxxfvxagfdvvvvgxgvgfaaxvfgfxgddvdgavxfadavvgvvafaxgvgxvvdfvgvadavfdaaggxavdffvadvvgddxgvfvafddaxdvadvgdxdvdxdddavvdadvaadvxvgdvavvvvagdagfvxdvfgaxaxgffffv
It could be a variant of adfgvx. Is it permuted or not? Prove it.
Side information tells you that if it is a transposition then it is an incomplete columnar transposition. Use the Phi test to find which period (i.e. what size column or row) was used.
4) It may have come from 3boat10.txt. How would you check this without solving the cipher? Write a program and try it. (There are several fast algorithm that will identify too many places and a slower algorithm that will only find correct answers).