Distribution of letters in parts of English words




Does it fit ETAOINSHRDLU? That's what I remember the frequency of letters in English to be.


The most surprising thing to me is how skewed towards the end of words e is.


Interesting, but with English's convoluted history and extreme oral to textual mismatch I doubt it says much linguistically. A phonetic breakdown like this might be something which could then be meaningfully compared to other languages.


I work anagrams a lot, and found the graphs to mainly be what I expected. One surprise was "z" - which I quickly figured out was due to "z doubling" toward the end of some common words: dizzy, fizzy, and jazzy are three examples.

It would be a useful tool for someone trying to learn to work anagrams, because the logic of letter placement is something you need to be good at if you're breaking substitution codes. This is basically a cheat sheet.


It doesn't perfectly match ETAOINSHRDLU (which I also remember, most likely from reading Hofstadter). The highest frequency letters here (the ones with charts in the darkest red) are E, O and T; next are A, H, I, N and S; the third group, D, F and R. Compared to ETAOINSHRDLU, A is underrepresented here (or O overrepresented), and F overrepresented. (L and U both appear in the fourth-highest frequency group in the chart.)

I'm a bit surprised (and delighted) that none of the 26 letters have a relatively smooth, flat, balanced graph with roughly equal frequencies for all positions. L probably comes the closest at a quick visual examination, but the bump toward the end of the word is still more than twice as tall as the lowest point. That said, it does seem that the rarest letters (such as J, Q, X and Z) often have very sharp, unbalanced graphs.


I think "L" makes sense because it's a common letter and it often appears used in the same way "z" is - as a doubled letter toward the end of a word. For each of those uses, it counts twice. On just this page including comments, I found: will, hopefully, all, especially, linguistically, meaningfully, basically, all, still, tall.

(I'm tired, and may have missed a few)


Oh Z, you wildcard. Surprised me as well.


ETAOIN SHRDLU comes from the Linotype machine, which was probably designed using a pretty small data set. It's surprisingly close to what you see in actual English, but not precise.


