I wrote this PHP script to count the character combinations for a previous post. This script is covering more functionality than I acutally used for the article. First I was fetching web-sites – using a service from www.alchemyapi.com which extracts the actual text information from a site. But after figuring that it was a bit tricky to choose a halfway representative set of web sites for a language, I took a short cut and just evaluated a big text file containing a classic novel for each language.
Another artefact in the script is stemming from my futile attempt to serve a CSV/TSV file (comma / tab separated values) via HTTP and have it automatically mapped onto a table by Excel. I just couldn’t make Excel handle the UTF-8 characters propperly. The furthest I got was to have them displayed correctly (after prepending a BOM to the file’s content) but then the tabs/commas weren’t evaluated into column separators any more. So I just went with PHPExcel eventually which works (almost) like a charm.
I added a few comments to the script but I think that it is quite evident after taking a close look at it anyway. The most interesting aspects of this script are probably the usage of PHPExcel for one and the propper handling of UTF-8 characters in PHP for another.
If you are not sure about what UTF-8, Unicode, ASCII, ISO-… are, how they relate and how they differ I recommend this epic article on that topic by stackoverflow founder Joel Spolsky.
How to read and write Excel files with PHP using PHPExcel I am going to cover in a separate article on joyofdata.de. In case you actually execute this script and you are wondering why you can’t use conditional formatting in the produced spreadsheet – this seems to be a bug in PHPExcel. The trick is to select a whole sheet (in the generated Excel file) and change the background color to “nothing”.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
<?php include './phpexcel/Classes/PHPExcel.php'; $resultFile = "character_sequences.xlsx"; $alphabet = array( "de" => 'aäbcdefghijklmnoöpqrsßtuüvwxyz', "en" => 'abcdefghijklmnopqrstuvwxyz', "ee" => 'abdefghijklmnoprsšzžtuvõäöü', "ru" => '?????????????????????????????????' ); $sources = array( "de" => array( "etah_murr.txt", ), "en" => array( "dickens_expectations.txt", ), "ru" => array( "dosto_cnp.txt", ), "ee" => array( "./ee.txt", ), ); // loop through the provided alphabets and sources foreach($alphabet as $lang => $chars){ addArraysAsTabToXlsx($resultFile,$lang, countCharacterSequences($chars,$sources[$lang]) ); } function countCharacterSequences($alphabet, $sources) { // set up the chross table for all character combinations $targetArray = array_map(function($i) use ($alphabet) { return mb_substr($alphabet,$i,1,'UTF-8'); }, range(0, mb_strlen($alphabet,'UTF-8')-1)); $targetArray = array_fill_keys($targetArray, array_fill_keys($targetArray,0)); $data = array(); // collects the sources and maps all 3-char-sequences into an array foreach($sources as $s){ if(substr($s, -4, 4) === ".txt"){ $text = file_get_contents($s); } else { $params = array( "url" => $s, "apikey" => "[...]", "useMetadata" => 0, "extractLinks" => 0, "outputMode" => "json" ); $url = "http://access.alchemyapi.com/calls/url/URLGetText?" . http_build_query($params); $json = json_decode(file_get_contents($url), true); $text = $json['text']; } preg_match_all("/[$alphabet]{3,}/u",$text,$matches); $data = array_merge($data,$matches[0]); } // walks through all 2-char-couples and increments the respective target array element foreach ($data as $d) { $d = mb_strtolower($d, 'UTF-8'); for($i = 1; $i < mb_strlen($d,'UTF-8'); $i++) { if(array_key_exists(mb_substr($d, $i-1, 1, 'UTF-8'),$targetArray) && array_key_exists(mb_substr($d, $i, 1, 'UTF-8'),$targetArray) ) { $targetArray[mb_substr($d, $i-1, 1, 'UTF-8')][mb_substr($d, $i, 1, 'UTF-8')]++; } } } $s = 0; $merged = array(); foreach ($targetArray as $a) { $s += array_sum($a); echo "<br>" . implode("<br>", $a); } echo "<br><br>" . $s . "<br><br>"; return $targetArray; } // outputs the frequency map as a csv function outputArrayAsCsv($filename, array $targetArray) { header('Content-Type: text/csv; charset=utf-8'); header("Content-Disposition: attachment; filename={$filename}.csv"); echo chr(239).chr(187).chr(191).implode(",",array_merge(array("-"),array_keys($targetArray))) . "\r\n"; foreach($targetArray as $c => $v){ echo implode(",", array_merge(array($c),$v)) . "\r\n"; } } // creates an xlsx with a separate sheet for each language function addArraysAsTabToXlsx($filename, $tabName, array $targetArray) { $excelWorkbook = null; // if file exists - load it - if not - create it if(file_exists($filename)) { $reader = PHPExcel_IOFactory::createReader("Excel2007"); $excelWorkbook = $reader->load($filename); } else { $excelWorkbook = new PHPExcel(); } // delete a sheet with the provided name if it already exists $sheet = $excelWorkbook->getSheetByName($tabName); if ($sheet !== null) { $excelWorkbook->removeSheetByIndex($excelWorkbook->getIndex($sheet)); } // new sheet $sheet = new PHPExcel_Worksheet($excelWorkbook, $tabName); $sheet = $excelWorkbook->addSheet($sheet); $columns = array_keys($targetArray); $rows = array_keys($targetArray[$columns[0]]); for($i = 0; $i < count($columns); $i++){ $sheet->setCellValueByColumnAndRow($i+1,1,$columns[$i]); } for($i = 0; $i < count($rows); $i++){ $sheet->setCellValueByColumnAndRow(0,$i+2,$rows[$i]); } for($i = 0; $i < count($columns); $i++){ for($j = 0; $j < count($rows); $j++) { $sheet->setCellValueByColumnAndRow($i+1, $j+2, $targetArray[$rows[$j]][$columns[$i]]); } } $excelWorkbook->setActiveSheetIndex($excelWorkbook->getIndex($sheet)); $xlsx = new PHPExcel_Writer_Excel2007($excelWorkbook); $xlsx->save($filename); } |