Relation of Word Order and Compression Ratio and Degree of Structure

smallHaving a habit of compulsively wondering approximately every 34.765th day about how zip compression (bzip2 in this case) might be used to measure information contained in data – this time the question popped up in my head of whether or not and if then how permutation of a text’s words would affect its compression ratio. The answer is – it does – and it does so in a very weird and fascinating way.

Lo and behold James Joyce’s “A Portrait of the Artist as a Young Man” and its peculiar distribution of compression ratios …

Continue reading

The tf-idf-Statistic For Keyword Extraction

tf-idfThe tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents from the corpus. In terms of tf-idf a word is important for a specific document if it shows up relatively often within that document and rarely in other documents of the corpus. I used tf-idf for extracting keywords from protocols of sessions of the German Bundestag and am quite happy with the results. Given that I was dealing with (so far) 18 documents, together containing more than one million words which would have to be aggregated for the term frequency, then outer joined and then fed to the formula I was first a bit worried about how R would perform. To my surprise the whole processing from reading the files from disk to the final table of tf-idf-values took about 8 seconds. That’s not  bad at all.

Continue reading

An intuitive interpretation of the beta distribution

First of all this text is not just about an intuitive perspective on the beta distribution but at least as much about the idea of looking behind a measured empirical probability and thinking of it as a product of chance itself. Credits go to David Robinson for approaching the subject from a baseball angle and to John D. Cook for establishing the connection to Bayesian statistics. In this article I want to add a simulation which stochastically prooves the interpretation.

Continue reading

Visualization of voting behaviour in the 17th German Bundestag


Click to get to the interactive 3D scatter plot with labels (PCA plot)

(Attention: The calculations and analysis are not biased by my political views – but the interpretation of the results might be and their verbal formulation certainly is … ;)

About a week ago I came across an article titled “How divided is the Senate?” by Vik Paruchuri where he uses a method called principal component analysis (PCA) to visualize the closeness of votings given by senators of the 113th Congress of the USA. I immediately fell in love with the idea behind this article as well as the method applied – which was a great opportunity to revise some statistics and alebra basics. And because (pretending) transparency is a major foundation of a modern democracy, full detailed word by word protocols of every meeting of the Bundestag are published as PDFs and text files on their website. So I downloaded all those protocols for the 17th Bundestag, extracted the votings and loaded the votes into a data frame. That was quite a drag because judging from typos (Sevim Dadelen, Sevim Dagelen, Sevim Dagdelen, …), different name versions (Erwin Josef Rüddel, Erwin Rüddel) and line breaks within the longer names like Dr. Karl-Theodor Freiherr von und zu Guttenberg (his title is gone, so the name became a tad handier by now) those text files where manually sanitized PDF convertions of live transcripts. I’ll spare you the details – but getting the data finally right took quite some effort.

Continue reading

Customized Returning Visitors with Google Analytics

Google Analytics offers a KPI for “returning visitors” but what if you would like to be more specific about the meaning of “returning”? Actually this figure is customizable with basic API requests and a very simple idea – at least for two consecutive time spans.

The idea

returning visitors

Let’s assume we want to know how many visitors from calender week 2013-1 (Dec 31 2012 until Jan 6 2013) returned to the web-site in calender week 2013-2 (Jan 7 2013 until Jan 13 2013). I’ll refer to calender week 2013-1 as T1, to 2013-2 as T2 and to both combined as T1+T2. The function v maps the time span onto the number of visitors then – so v(T1) = 5 means in calender week 2013-1 Analytics counted 5 unique visitors. Then the number of visitors in T2 who also visited in T1 is:

“Number visitors from T1 who came back in T2” = v(T1) + v(T2) – v(T1+T2)

Continue reading

“Statistics by Use” in Jerusalem

My girlfriend and me just arrived back from an awesome and very sunny two weeks journey to Israel. We spent most of the time in Haifa where we stayed with our friend Shai but of course we also jaunted (first time ever I use this verb) to Eilat, Tel Aviv and Jerusalem. In Jerusalem the major highlight is the old city – a or the center for the jewish, christian and muslim religion. It’s not that large but packed with historical places – so after entering the area we checked out a map hanging next to the gate and of course first thing I did was to pinpoint the place where we where (labelled “you are here”) and Anni pointed out to me that obviously I am not the first person doing that because the color was rubbed off already. This phenomena struck me as quite interesting so I wanted to share it on here. Actually I have still no good idea how to name this or maybe there is a name for that already? You’re welcome to help me out.

Jerusalem's Old City Continue reading

Life and Death and NUTS


Usual administrative units are too heterogenous for regional statistics. To make regions comparable, territorial units of similar population size are required. For the European Union and further states being associated in some way or another the NUTS (Nomenclature des unités territoriales statistiques) classification has been developed in 1980 and is being updated triennially.

There are four NUTS levels 0,1,2 and 3. Every region is designated a code consisting of two to five characters. The first two characters denote the state (the usual ISO-3166 two letter code – Greece being an exception as it is referred to with EL instead of GR). The characters following it in case of NUTS 1,2 and 3 form a hierachical system. So for example DE21H (Munich) belongs to DE21 (Oberbayern) belongs DE2 (Bayern / Bavaria) belongs to DE (Germany).

Continue reading

Eurostat Basics in Action (Unknown Causes of Death)

Eurostat is the institution within the European Union that organizes statistics from the 27 EU member states (f.x. from the German Federal Office of Statistics who also maintain a web-access to their data). Their web-site offers a wealth of statistics, reports, documents and visulization tools. It is pretty huge and I still get lost easily on it or discover new things. So this article doesn’t even try to show you around. I’ll just exemplify here one aspect of their site – the statistics database in context of a concrete question. In case you like population statistics thrown on maps you might be interested in the following articles which use data from Eurostat:

The question we’ll investigate

How regularly did people – differentiated in younger than or at least 65 years of age – in recent past die from a cause categorized as “Ill-defined and unknown causes of mortality”? We will be looking at national level (NUTS 0).

Continue reading

Regional ratio of young women to men in EU

I was curious how gender-ratios of young women and men are distribute geographically in Europe. Eurostat offers absolute numbers for all NUTS2 regions in Europe. The most recent available figures were referring to January 2012 – in few cases like Turkey I was falling back to January 2011 due to missing values.

The figures are drawn from table “demo_r_d2jan” on Eurostat.

Regional ratio of women to men in EU down to NUTS 2 (Jan 2012).

Continue reading