You all know why MapReduce is fancy – so let’s just jump right in. I like researching data and I like to see results fast – does that mean I enjoy the process of setting up a Hadoop cluster? No, I doubt there is any correlation – neither causal nor merely statistical. The good news is there are already quite a lot of cloud computing providers offering Hadoop clusters on demand! For this article I got my hands on Amazon’s Elastic MapReduce (EMR) service (which is an extension of its EC2 service) that sets up the Hadoop cluster for you. Okay – almost at least. For this article we are going to count 2-grams in (dummy text) data using the stringdist library.
For the previous article I needed a quick way to figure out if two sets of points are linearly separable. But for crying out loud I could not find a simple and efficient implementation for this task. Except for the perceptron and SVM – both are sub-optimal when you just want to test for linear separability. The perceptron is guaranteed to finish off with a happy ending – if feasible – but it can take quite a while. And SVMs are designed for soft-margin classification which means that they might settle for a not separating plane and also they maximize the margin around the separating plane – which is wasted computational effort.
I am excited to announce that this is supposed to be my first article published also on r-bloggers.com :)
The processing of data needs to take dimensionality into account as usual metrics change their behaviour in subtle ways, which impacts the efficiency of algorithms and methods that are based on distances / similarities of data points. This has been tagged the “curse of dimensionality“. Just as well, in some cases high dimensionality can aid us when investigating data – “blessing of dimensionality”. But in general it is, as usual, a good thing to know what’s going on and so let’s have a look at what dimensionality does to data.
The Titanic challenge on Kaggle is about inferring from a number of personal details whether a passenger survived the disaster or did not. I gave two algorithms a try, which are decision trees using R package party and SVMs using R package kernlab. I chose to use party for the decision trees over the more prominent rpart because the authors of party make a very good point why their approach is likely to outperform it and other approaches in terms of generalization.
The tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents from the corpus. In terms of tf-idf a word is important for a specific document if it shows up relatively often within that document and rarely in other documents of the corpus. I used tf-idf for extracting keywords from protocols of sessions of the German Bundestag and am quite happy with the results. Given that I was dealing with (so far) 18 documents, together containing more than one million words which would have to be aggregated for the term frequency, then outer joined and then fed to the formula I was first a bit worried about how R would perform. To my surprise the whole processing from reading the files from disk to the final table of tf-idf-values took about 8 seconds. That’s not bad at all.
This article is about the “Digit Recognizer” challenge on Kaggle. You are provided with two data sets. One for training: consisting of 42’000 labeled pixel vectors and one for the final benchmark: consisting of 28’000 vectors while labels are not known. The vectors are of length 784 (28×28 matrix) with values from 0 to 255 (to be interpreted as gray values) and are supposed to be classified as to what number (from 0 to 9) it represents. The classification is realized using SVMs which I implement with kernlab package in R.
Three representation of the data set
The pre- and post-processing in all cases consists of removing the unused (white – pixel value = 0) frame (rows and columns) of every matrix and finally to scale the computed feature vectors (feature-wise) to mean 0 and standard deviation 1. I gave three representations a try:
(This article is referring to an initial proof-of-concept version of r-big-pivot)
I have to admit that I very much enjoy pivoting through data using Excel. Its pivoting tool is great for getting a quick insight into a data set’s structure and for discovering interesting anomalies (the sudden rise of deaths due to viral hepatitis serves as a nice example). Unfortunately Excel itself is handicapped by several restrictions:
- a maximum number of one point something million rows per data set (which is crucial because the data needs to be formatted long)
- Excel is slow and often instable when you go beyond 100’000 rows
- cumbersome integration into available data infrastructures
- restricted charting and analysis capabilities
First of all this text is not just about an intuitive perspective on the beta distribution but at least as much about the idea of looking behind a measured empirical probability and thinking of it as a product of chance itself. Credits go to David Robinson for approaching the subject from a baseball angle and to John D. Cook for establishing the connection to Bayesian statistics. In this article I want to add a simulation which stochastically prooves the interpretation.
Who would’ve thought that parallel processing in R is THAT simple? Thanks to a collection of packages having a task use all available cores is a cinch. I call an Intel i7-3632QM my own – which means 4 physical cores each providing 2 virtual cores running at something around 3 GHz. So let’s have a look at how this feat is accomplished.
The method of parallelization applied here works by starting a specified number of regular R processes and have those run on the available cores. The launch, stop of the those processes and the assignment of tasks is controlled from the original R process. The assignment of those processes to the cores is controlled by the OS. Because this method can be extended to cores beyond one CPU or computer the packages use the acronym SNOW for “Simple Network Of Workstations”.
For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually. But for larger projects this approach would not be feasible. So I had a look at what R would offer me for fuzzy string matching beyond good ol’ Levenshtein distance and came across a rather new package answering to the name of “stringdist” maintained by Mark van der Loo. To my pleasant surprise it offers not two, not three , but a variety of configurable algorithms for that purpose. But I have no idea what is for example the effective difference between a Jaccard distance and a cosine distance. So I played around a bit with them and finally came up with the idea of something like a slope graph showing the distances for alternations of one string – in this case “Cosmo Kramer” – just to get started and an idea about what’s going on and how different algorithms are affected by certain alternations.