Counting the frequency of ordered character pairs

I wrote this PHP script to count the character combinations for a previous post. This script is covering more functionality than I acutally used for the article. First I was fetching web-sites – using a service from www.alchemyapi.com which extracts the actual text information from a site. But after figuring that it was a bit tricky to choose a halfway representative set of web sites for a language, I took a short cut and just evaluated a big text file containing a classic novel for each language.

Another artefact in the script is stemming from my futile attempt to serve a CSV/TSV file (comma / tab separated values) via HTTP and have it automatically mapped onto a table by Excel. I just couldn’t make Excel handle the UTF-8 characters propperly. The furthest I got was to have them displayed correctly (after prepending a BOM to the file’s content) but then the tabs/commas weren’t evaluated into column separators any more. So I just went with PHPExcel eventually which works (almost) like a charm.

I added a few comments to the script but I think that it is quite evident after taking a close look at it anyway. The most interesting aspects of this script are probably the usage of PHPExcel for one and the propper handling of UTF-8 characters in PHP for another.

If you are not sure about what UTF-8, Unicode, ASCII, ISO-… are, how they relate and how they differ I recommend this epic article on that topic by stackoverflow founder Joel Spolsky.

How to read and write Excel files with PHP using PHPExcel I am going to cover in a separate article on joyofdata.de. In case you actually execute this script and you are wondering why you can’t use conditional formatting in the produced spreadsheet – this seems to be a bug in PHPExcel. The trick is to select a whole sheet (in the generated Excel file) and change the background color to “nothing”.