“Learning From Data” by Yaser Abu-Mostafa (Caltech) on edX.org

edx-logo-headerTo deepen my knowledge about Machine Learning I decided last year to attend “Learning From Data” on edX. This online course was designed by Yaser Abu-Mostafa – a renowned expert on the subject and professor of Electrical Engineering and Computer Science at California Institute of Technology (Caltech). I can say without the slightest hesitation that this course was a wonderful intellectual experience. Prof. Abu-Mostafa conceived the course so skilfully that it was as much a joy to attend, as it was challenging. And this finding couldn’t be further from a naturalness, especially given that the syllabus took a path through quite theoretical terrain.

Continue reading

Comparison of String Distance Algorithms

stringdist-slopegraphFor the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually. But for larger projects this approach would not be feasible. So I had a look at what R would offer me for fuzzy string matching beyond good ol’ Levenshtein distance and came across a rather new package answering to the name of “stringdist” maintained by Mark van der Loo. To my pleasant surprise it offers not two, not three , but a variety of configurable algorithms for that purpose. But I have no idea what is for example the effective difference between a Jaccard distance and a cosine distance. So I played around a bit with them and finally came up with the idea of something like a slope graph showing the distances for alternations of one string – in this case “Cosmo Kramer” – just to get started and an idea about what’s going on and how different algorithms are affected by certain alternations.

Continue reading

Illustration of principal component analysis (PCA)

Why (a) PCA?

3d

3D model of ellipsoid and its three principal components

A principal component analysis is a way to reduce dimensionality of a data set consisting of numeric vectors to a lower dimensionality. Then it is possible to visualize the data set in three or less dimensions. Have a look at this use case. I’ll try to explain the motivation using a simple example.

Think of a very flat square (e.g. 1x1x.05) in a three dimensional space. What you see of this cuboid when looking at it from a specific angle while assuming perspectivic visual indicators away is basically a 2 dimensional projection of it. From one angle it looks like a rhombus, looking straight from the edge side all you see is a line-like shape. So obviously the representation offering the most information about this cuboid is the one you see when looking at it perpendicular to the main surface. Continue reading