# Frequency of character combinations for three languages

I was curious about the frequency in which ordered character pairs are observable in different languages. So I wrote a PHP script that fetches texts online or from the disk and parses them. I chose one classic novels as a source for one language. The choices are certainly not representative for the language but they provide some kind of insight still, I think. From the sources I only used the texts actually belonging to the novel.

Basically I provide an alphabet and then all sequences of length 3 or larger consisting exclusively of characters present in the alphabet are considered words.

The word ‘ralf’ then would leed to a tick at r-a, a-l and l-f with first character being listed in the first column and the second character being listed in the first row.

# German / ETA Hoffmann / Lebens-Ansichten des Katers Murr

514’668 character combinations

# English / Charles Dickens / Great Expectations

519’987 character combinations

# Russian / Fyodor Dostoyevsky / Crime and Punishment

624’622 character combinations

# Comparison of character frequencies

Apparently the german novel has the least relative variety of used character combinations.

# Conditional formatting in Excel

I applied four different formattings. Most of them are rather arbitrary and only help recognizing quicker how the frequencies are distributed. A white font color means the the figure belongs to the top 10 and the green color (all figures from 0 to 9) is supposed to give an idea about how many character combinations are not really used in the sample.

# Relative totals in last column and row

This part of the map shows the frequency a character is at first position (last column) or second position (last row). Of course the only reason for an asymmetry here can arise from the beginning or end of a word. A strong asymmetry for example is observable for the letter “e” in the English novel. 15.3% of combinations end with “e” but only 10.8% start with “e”.

# Conclusions

Apart from the obvious observations mentioned – none yet. I might come back to this perspective on language to play around with it more. You’re welcome to share any insights or ideas – of course. But still, the pictures do look nice, don’t they.

The heatmaps and charts are done using Excel 2010. I will soon publish articles about how this can be achieved using Excel.

## One thought on “Frequency of character combinations for three languages”

1. Is it possible to recognize which language is used in text if you have enough information about letter-pairs?