Page 1 of 1

Sounds Same Warmup

Posted: Tue Feb 17, 2009 2:01 pm
by efe
This is how I solved it:

I made the assumption that for each plaintext letter each homophone is used with the same probability.
From this follows that if the ciphertext letters X and Y stand for the same plaintext letter, the frequency of the bigrams (groups of 2 letters) AX and AY will be similar for all other ciphertext letters A.

For example if we assume that '6' and 'A' are homophones for the same letter, then '60' and 'A0' would occur with the same (or almost the same) frequency. So I calculate the difference between these two frequencies and do this for all analogous bigrams. Then I sum up these differences. If the result is very low compared to other bigrams then '6' and 'A' are really translations for the same letter.

The following python code implements this idea:

Code: Select all

c = "7A62E92C...F215F8" # the ciphertext

# build dictionary of bigram-frequency in ciphertext
d =  dict([(bigram,c.count(bigram)) for bigram in ["%02X" % i for i in range(0x100)]])

ct = "0123456789ABCDEF"
for x in ct:
	print x, [sum(map(lambda a: abs(d[x+a]-d[y+a]) + abs(d[a+x]-d[a+y]), ct)) for y in ct]
(The original code was a lot bigger - It's so cool how you can shrink down python programs :-) )

The output is:

0 [0, 186, 126, 48, 118, 128, 180, 125, 117, 124, 194, 182, 184, 194, 180, 132]
1 [186, 0, 214, 178, 212, 206, 68, 195, 211, 102, 46, 54, 54, 50, 52, 104]
2 [126, 214, 0, 114, 62, 44, 192, 47, 49, 184, 212, 210, 208, 208, 194, 186]
3 [48, 178, 114, 0, 100, 108, 164, 113, 95, 116, 180, 178, 178, 184, 166, 120]
4 [118, 212, 62, 100, 0, 50, 186, 59, 51, 184, 214, 210, 204, 204, 188, 188]
5 [128, 206, 44, 108, 50, 0, 184, 45, 51, 180, 208, 206, 200, 204, 182, 184]
6 [180, 68, 192, 164, 186, 184, 0, 169, 183, 102, 56, 48, 68, 62, 62, 104]
7 [125, 195, 47, 113, 59, 45, 169, 0, 58, 171, 193, 191, 185, 189, 175, 173]
8 [117, 211, 49, 95, 51, 51, 183, 58, 0, 179, 213, 209, 205, 209, 193, 183]
9 [124, 102, 184, 116, 184, 180, 102, 171, 179, 0, 104, 106, 114, 114, 98, 26]
A [194, 46, 212, 180, 214, 208, 56, 193, 213, 104, 0, 52, 46, 52, 44, 100]
B [182, 54, 210, 178, 210, 206, 48, 191, 209, 106, 52, 0, 60, 60, 68, 108]
C [184, 54, 208, 178, 204, 200, 68, 185, 205, 114, 46, 60, 0, 54, 52, 118]
D [194, 50, 208, 184, 204, 204, 62, 189, 209, 114, 52, 60, 54, 0, 64, 118]
E [180, 52, 194, 166, 188, 182, 62, 175, 193, 98, 44, 68, 52, 64, 0, 98]
F [132, 104, 186, 120, 188, 184, 104, 173, 183, 26, 100, 108, 118, 118, 98, 0]

As you can see, the lists are very similar for the homophonic groups.
Knowing the order of frequencies of the letters in the text I could easily determine wich group belongs to which letter.

Posted: Tue Feb 17, 2009 4:32 pm
by gfoot
Hmm, that worked really well for you.

The relative frequencies of homophones of the same plaintext letter should have been completely even, but this caused problems because the actual frequencies of the individual ciphertext characters made it really obvious which ones were matching homophones - you could just read it off the frequency table. I think the Ys had 68 or 69 occurences, the Xs had about 60, the Ws had about 45, and the Zs had about 63. It was too obvious which were which, even from single letter analysis. So I remapped some of them, changing occurences of one homophone into another for the same letter, to scramble the individual frequencies a bit. In the final ciphertext, the homophones for one letter do not actually have the same basic frequency (especially for something like w).

Bearing that in mind, I would have thought your reliance on the unnormalized frequencies matching would not work as well as it did!

As I said in the challenge text, it might be worth running your solver on half or a quarter of the input data, to see how well it does. It's trickier on less data, but my test solver managed quite well with some modifications.

My test solver is far from perfect - I know two obvious ways to extend it, but I don't know the maths for one and the other is not a well-defined problem so needs some experimentation. Hopefully I'll find time to try these out sometime, then I think I could analyze smaller plaintext sizes - my current solver works fine on the first 250 characters, but I lose faith in it for smaller sample sizes.

Posted: Wed Apr 13, 2016 12:14 am
by Hippo
Yep with that assumption it is doable, but it would be easy to encode in a way to break this assumption.

... I have used almost the same on warmup, than I was trying to extend it to sounds same.

... I have improved my warmup method using technikue from "sociomapping" ... to put simillar elements near one other ... it clearly made 4 groups of letters even when used for 1/3 of warmup.

I have resigned on using it to sound same after the first try gave wrong results.