Illustrated Guide to ROC and AUC

roc(In a past job interview I failed at explaining how to calculate and interprete ROC curves – so here goes my attempt to fill this knowledge gap.) Think of a regression model mapping a number of features onto a real number (potentially a probability). The resulting real number can then be mapped on one of two classes, depending on whether this predicted number is greater or lower than some choosable threshold. Let’s take for example a logistic regression and data on the survivorship of the Titanic accident to introduce the relevant concepts which will lead naturally to the ROC (Receiver Operating Characteristic) and its AUC or AUROC (Area Under ROC Curve).

Continue reading

Introduction to OpenCPU for R on EC2 with Python

OpenCPUopencpu is (simply put) a server implementing a RESTful web API for remotely executing R functions and retrieving results. In this tutorial I am going to showcase how OpenCPU can be installed on an EC2 instance running Ubuntu 14.04. Python and its requests package come into play for the purpose of conveniently handling HTTP communication. First and foremost thanks to the effort Jeroen Ooms put into developing OpenCPU and composing its documentation the whole process is comparatively easy and painfree.

Continue reading

As a Data Scientist it is my Obligation to support #nobagida, #nopegida and any other #no[a-z]{2}gida today :)

Political Opinion on a Scale from 0 to 2π

nopegida

Just came back with my girlfriend from the demonstration at Sendlinger Tor. Noticed quite a few Palestinian flags being waved around – fair enough – but I thought to myself that I would actually like to see one or two Israeli flags as well. Later we went over the street to have a look at the pegida guys when I noticed no less than two Isareali flags there. That’s was kind of weird … but of course for pegida a lot of their presentation revolves around emphasizing how not-Nazi they are – which is slightly odd given the occasional pegida-israel-flagNeonazi hanging around with them. Also given their focus on how bad muslims are, to those little educated people it might seem plausible to show off how prosemitic they are b/c Jews supposedly share some of their views.

Continue reading

Hierarchical Clustering with R (feat. D3.js and Shiny)

hclust-shinyAgglomerative hierarchical clustering is a simple, intuitive and well-understood method for clustering data points. I used it with good results in a project to estimate the true geographical position of objects based on measured estimates. With this tutorial I would like to describe the basics of this method, how to implement it in R with hclust and some ideas on how to decide where to cut the tree. This was also a great opportunity for composing anohter Shiny/D3.js app (GitHub for the code, shinyapps.io for the app) – something I wanted to do for a while now. At the end of the text I am writing a bit about what I learned in that regard.

Continue reading

MongoDB – State of the R

mongodbNaturally there are two reasons for why you need to access MongoDB from R:

  1. MongoDB is already used for whatever reason and you want to analyze the data stored therein
  2. You decide you want store your data in MongoDB instead of using native R technology like data.table or data.frame

In-memory data storage like data.table is very fast especially for numerical data, provided the data actually fits into your RAM – but even then MongoDB comes along with a bag of goodies making it a tempting choice for a number of use cases:

  • Flexible schema-less data structures
  • spatial and textual indexing
  • spatial queries
  • persistence of data
  • easily accessible from other languages and systems

In case you would like to learn more about MongoDB then I have good news for you – MongoDB Inc. provides a number of very well made online courses catering to various languages. An overview you may find here.

Continue reading

Twitter’s REST API v1.1 with R (for Linux and Windows)

twitterIn this tutorial I am going to describe a straightforward way of how to make use of Twitter’s REST API v1.1. For that purpose I composed a little package (RTwitterAPI), so that requesting data just needs the API URL, the API parameters and a vector containing the OAuth parameters.

Before you can get started you have to login to your Twitter account on dev.twitter.comcreate an application and generate an “Access Token” for it. So let’s jump right in and fetch IDs of 10 followers of @hrw (Human Rights Watch). The necessary code is located on GitHub as a package named RTwitterAPI which may be installed using devtools::install_github().

Continue reading

Reasonable Inheritance of Cluster Identities in Repetitive Clustering

… or Inferring Identity from Observations

cluster-identityLet’s assume the following application:

A conservation organisation starts a project to geographically catalogue the remaining representatives of an endangered plant species. For that purpose hikers are encouraged to communicate the location of the plant if they encounter it. Due to those hikers using GPS technology ranging from cheap smartphones to highend GPS devices and weather as well as environmental circumstances the measurements are of varying accuracy. The goal of the conservation organisation is to build up a map locating all found plants with an ID assigned to them. Now every time a new location measurement is entered into the system a clustering is applied to identify related measurements – i.e. belonging to the same plant.

Continue reading

FIR Filter Design and Digital Signal Processing in R

iconThis article serves the purpose of illustrating that signal processing with R is possible – thanks to the signal package – and to keep a reference of some of the stuff that I learned at my last edX course. Anyway, I am by no means an expert on signal processing so I’d prefer to let the pictures and the code speak for themselves. But to give you the idea – I show case the creation and application of an FIR band pass filter (Chebyshev Type 1 in this case) and of an FIR filter created using the Parks-McClellan method with the Remez exchange algorithm. The code snippets are taken from a larger R script which you can find on GitHub. I aim to focus on the essential parts. You’re welcome to share your knowledge and corrections by leaving a comment.

Continue reading

Relation of Word Order and Compression Ratio and Degree of Structure

smallHaving a habit of compulsively wondering approximately every 34.765th day about how zip compression (bzip2 in this case) might be used to measure information contained in data – this time the question popped up in my head of whether or not and if then how permutation of a text’s words would affect its compression ratio. The answer is – it does – and it does so in a very weird and fascinating way.

Lo and behold James Joyce’s “A Portrait of the Artist as a Young Man” and its peculiar distribution of compression ratios …

Continue reading