I was curious about the frequency in which ordered character pairs are observable in different languages. So I wrote a PHP script that fetches texts online or from the disk and parses them. I chose one classic novels as a source for one language. The choices are certainly not representative for the language but they provide some kind of insight still, I think. From the sources I only used the texts actually belonging to the novel.
Basically I provide an alphabet and then all sequences of length 3 or larger consisting exclusively of characters present in the alphabet are considered words.
The word ‘ralf’ then would leed to a tick at r-a, a-l and l-f with first character being listed in the first column and the second character being listed in the first row.
German / ETA Hoffmann / Lebens-Ansichten des Katers Murr
514’668 character combinations
English / Charles Dickens / Great Expectations
519’987 character combinations
Russian / Fyodor Dostoyevsky / Crime and Punishment
624’622 character combinations
Comparison of character frequencies
Apparently the german novel has the least relative variety of used character combinations.
Conditional formatting in Excel
I applied four different formattings. Most of them are rather arbitrary and only help recognizing quicker how the frequencies are distributed. A white font color means the the figure belongs to the top 10 and the green color (all figures from 0 to 9) is supposed to give an idea about how many character combinations are not really used in the sample.
Relative totals in last column and row
This part of the map shows the frequency a character is at first position (last column) or second position (last row). Of course the only reason for an asymmetry here can arise from the beginning or end of a word. A strong asymmetry for example is observable for the letter “e” in the English novel. 15.3% of combinations end with “e” but only 10.8% start with “e”.
Conclusions
Apart from the obvious observations mentioned – none yet. I might come back to this perspective on language to play around with it more. You’re welcome to share any insights or ideas – of course. But still, the pictures do look nice, don’t they.
Is it possible to recognize which language is used in text if you have enough information about letter-pairs?