Wow – what a headline … okay, I admit it’s phrased quite sensational given that it anticipates just one possible interpretation of increasingly more births around summer / autumn compared to in spring … but I guess I just get more proactive at marketing with every post I publish!
Having a habit of compulsively wondering approximately every 34.765th day about how zip compression (bzip2 in this case) might be used to measure information contained in data – this time the question popped up in my head of whether or not and if then how permutation of a text’s words would affect its compression ratio. The answer is – it does – and it does so in a very weird and fascinating way.
Lo and behold James Joyce’s “A Portrait of the Artist as a Young Man” and its peculiar distribution of compression ratios …
The tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents from the corpus. In terms of tf-idf a word is important for a specific document if it shows up relatively often within that document and rarely in other documents of the corpus. I used tf-idf for extracting keywords from protocols of sessions of the German Bundestag and am quite happy with the results. Given that I was dealing with (so far) 18 documents, together containing more than one million words which would have to be aggregated for the term frequency, then outer joined and then fed to the formula I was first a bit worried about how R would perform. To my surprise the whole processing from reading the files from disk to the final table of tf-idf-values took about 8 seconds. That’s not bad at all.
First of all this text is not just about an intuitive perspective on the beta distribution but at least as much about the idea of looking behind a measured empirical probability and thinking of it as a product of chance itself. Credits go to David Robinson for approaching the subject from a baseball angle and to John D. Cook for establishing the connection to Bayesian statistics. In this article I want to add a simulation which stochastically prooves the interpretation.