Frequency of character combinations for three languages

I was curious about the frequency in which ordered character pairs are observable in different languages. So I wrote a PHP script that fetches texts online or from the disk and parses them. I chose one classic novels as a source for one language. The choices are certainly not representative for the language but they provide some kind of insight still, I think. From the sources I only used the texts actually belonging to the novel.

Basically I provide an alphabet and then all sequences of length 3 or larger consisting exclusively of characters present in the alphabet are considered words.

The word ‘ralf’ then would leed to a tick at r-a, a-l and l-f with first character being listed in the first column and the second character being listed in the first row.

German / ETA Hoffmann / Lebens-Ansichten des Katers Murr

514’668 character combinations

heatmap for frequencies of character combinations in a german novel - "Kater Murr" (ETA Hoffmann)heatmap for frequencies of character combinations in a german novel - "Kater Murr" (ETA Hoffmann)

English / Charles Dickens / Great Expectations

519’987 character combinations

heatmap for frequencies of character combinations in an english novel - "Great Expectations" (Charles Dickens)heatmap for frequencies of character combinations in an english novel - "Great Expectations" (Charles Dickens)

Russian / Fyodor Dostoyevsky / Crime and Punishment

624’622 character combinations

heatmap for frequencies of character combinations in a russian novel - "Crime and Punishment" (Fyodor Dostoyevsky)

Comparison of character frequencies

variety in usage of character combinations for three novels in englisch, german and russian

Apparently the german novel has the least relative variety of used character combinations.

Conditional formatting in Excel

I applied four different formattings. Most of them are rather arbitrary and only help recognizing quicker how the frequencies are distributed. A white font color means the the figure belongs to the top 10 and the green color (all figures from 0 to 9) is supposed to give an idea about how many character combinations are not really used in the sample.

Relative totals in last column and row

This part of the map shows the frequency a character is at first position (last column) or second position (last row). Of course the only reason for an asymmetry here can arise from the beginning or end of a word. A strong asymmetry for example is observable for the letter “e” in the English novel. 15.3% of combinations end with “e” but only 10.8% start with “e”.

Conclusions

Apart from the obvious observations mentioned – none yet. I might come back to this perspective on language to play around with it more. You’re welcome to share any insights or ideas – of course. But still, the pictures do look nice, don’t they.

The heatmaps and charts are done using Excel 2010. I will soon publish articles about how this can be achieved using Excel.

One thought on “Frequency of character combinations for three languages

  1. Is it possible to recognize which language is used in text if you have enough information about letter-pairs?

Comments are closed.