As a Data Scientist it is my Obligation to support #nobagida, #nopegida and any other #no[a-z]{2}gida today :)

Political Opinion on a Scale from 0 to 2?


Just came back with my girlfriend from the demonstration at Sendlinger Tor. Noticed quite a few Palestinian flags being waved around – fair enough – but I thought to myself that I would actually like to see one or two Israeli flags as well. Later we went over the street to have a look at the pegida guys when I noticed no less than two Isareali flags there. That’s was kind of weird … but of course for pegida a lot of their presentation revolves around emphasizing how not-Nazi they are – which is slightly odd given the occasional pegida-israel-flagNeonazi hanging around with them. Also given their focus on how bad muslims are, to those little educated people it might seem plausible to show off how prosemitic they are b/c Jews supposedly share some of their views.

Continue reading

Humor is a powerful, alternative Method for processing Data and reporting Results.


“Je n’ai pas peur des représailles. Je n’ai pas de gosses, pas de femme, pas de voiture, pas de crédit. Ça fait sûrement un peu pompeux, mais je préfère mourir debout que vivre à genoux.”

(“I am not afraid of reprisals, I have no children, no wife, no car, no debt. It might sound a bit pompous, but I’d prefer to die on my feet rather than living on my knees.”)

Charb – Interview 2012

Titanic challenge on Kaggle with decision trees (party) and SVMs (kernlab)

titanic-iconThe Titanic challenge on Kaggle is about inferring from a number of personal details whether a passenger survived the disaster or did not. I gave two algorithms a try, which are decision trees using R package party and SVMs using R package kernlab. I chose to use party for the decision trees over the more prominent rpart because the authors of party make a very good point why their approach is likely to outperform it and other approaches in terms of generalization.

Continue reading

The tf-idf-Statistic For Keyword Extraction

tf-idfThe tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents from the corpus. In terms of tf-idf a word is important for a specific document if it shows up relatively often within that document and rarely in other documents of the corpus. I used tf-idf for extracting keywords from protocols of sessions of the German Bundestag and am quite happy with the results. Given that I was dealing with (so far) 18 documents, together containing more than one million words which would have to be aggregated for the term frequency, then outer joined and then fed to the formula I was first a bit worried about how R would perform. To my surprise the whole processing from reading the files from disk to the final table of tf-idf-values took about 8 seconds. That’s not  bad at all.

Continue reading

Tool for Visualization of Connections between Agents and Entities in Context of Redtubegate

redtubegate-inter-connEarly in December 2013 a lawfirm began to send out approximately 10 to 40 thousand cease-and-desist letters on behalf of the rightholder of a bunch of porn flicks for streaming those films on redtube. So far, so good. Now a lot of people didn’t like to receive bills ranging from 250 to more than a thousand Euro for streaming erotica just before christmas especially when being pretty sure that they didn’t even do so. Now given the magnitude of this case a lot of these people turned sour and started to dig a bit deeper. And what was brought to light is a shady network of companies with links where there should be none and a bunch of business partners who as well turned out to have more in common than what was to be seen at first glance.

Continue reading

Social Network Analysis by Lada Adamic on coursera

courseraDid you know that top researchers and universties from all over the world offer their knowledge in structured and partly certified online courses? Well now you know! Those coures are refered to as MOOC which stands for “Massive Open Online Courses” and is for me one of THE digital discoveries of the year 2013. The three biggest platforms are currently edX, coursera and Udacity. I am following courses on all of those three and I really sometimes can’t believe how awesome this opportunity is.

Continue reading

Visualization of voting behaviour in the 17th German Bundestag


Click to get to the interactive 3D scatter plot with labels (PCA plot)

(Attention: The calculations and analysis are not biased by my political views – but the interpretation of the results might be and their verbal formulation certainly is … ;)

About a week ago I came across an article titled “How divided is the Senate?” by Vik Paruchuri where he uses a method called principal component analysis (PCA) to visualize the closeness of votings given by senators of the 113th Congress of the USA. I immediately fell in love with the idea behind this article as well as the method applied – which was a great opportunity to revise some statistics and alebra basics. And because (pretending) transparency is a major foundation of a modern democracy, full detailed word by word protocols of every meeting of the Bundestag are published as PDFs and text files on their website. So I downloaded all those protocols for the 17th Bundestag, extracted the votings and loaded the votes into a data frame. That was quite a drag because judging from typos (Sevim Dadelen, Sevim Dagelen, Sevim Dagdelen, …), different name versions (Erwin Josef Rüddel, Erwin Rüddel) and line breaks within the longer names like Dr. Karl-Theodor Freiherr von und zu Guttenberg (his title is gone, so the name became a tad handier by now) those text files where manually sanitized PDF convertions of live transcripts. I’ll spare you the details – but getting the data finally right took quite some effort.

Continue reading

Customized Returning Visitors with Google Analytics

Google Analytics offers a KPI for “returning visitors” but what if you would like to be more specific about the meaning of “returning”? Actually this figure is customizable with basic API requests and a very simple idea – at least for two consecutive time spans.

The idea

returning visitors

Let’s assume we want to know how many visitors from calender week 2013-1 (Dec 31 2012 until Jan 6 2013) returned to the web-site in calender week 2013-2 (Jan 7 2013 until Jan 13 2013). I’ll refer to calender week 2013-1 as T1, to 2013-2 as T2 and to both combined as T1+T2. The function v maps the time span onto the number of visitors then – so v(T1) = 5 means in calender week 2013-1 Analytics counted 5 unique visitors. Then the number of visitors in T2 who also visited in T1 is:

“Number visitors from T1 who came back in T2” = v(T1) + v(T2) – v(T1+T2)

Continue reading

“Statistics by Use” in Jerusalem

My girlfriend and me just arrived back from an awesome and very sunny two weeks journey to Israel. We spent most of the time in Haifa where we stayed with our friend Shai but of course we also jaunted (first time ever I use this verb) to Eilat, Tel Aviv and Jerusalem. In Jerusalem the major highlight is the old city – a or the center for the jewish, christian and muslim religion. It’s not that large but packed with historical places – so after entering the area we checked out a map hanging next to the gate and of course first thing I did was to pinpoint the place where we where (labelled “you are here”) and Anni pointed out to me that obviously I am not the first person doing that because the color was rubbed off already. This phenomena struck me as quite interesting so I wanted to share it on here. Actually I have still no good idea how to name this or maybe there is a name for that already? You’re welcome to help me out.

Jerusalem's Old City Continue reading