# “Learning From Data” by Yaser Abu-Mostafa (Caltech) on edX.org

To deepen my knowledge about Machine Learning I decided last year to attend “Learning From Data” on edX. This online course was designed by Yaser Abu-Mostafa – a renowned expert on the subject and professor of Electrical Engineering and Computer Science at California Institute of Technology (Caltech). I can say without the slightest hesitation that this course was a wonderful intellectual experience. Prof. Abu-Mostafa conceived the course so skilfully that it was as much a joy to attend, as it was challenging. And this finding couldn’t be further from a naturalness, especially given that the syllabus took a path through quite theoretical terrain.

# Comparison of String Distance Algorithms

For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually. But for larger projects this approach would not be feasible. So I had a look at what R would offer me for fuzzy string matching beyond good ol’ Levenshtein distance and came across a rather new package answering to the name of “stringdist” maintained by Mark van der Loo. To my pleasant surprise it offers not two, not three , but a variety of configurable algorithms for that purpose. But I have no idea what is for example the effective difference between a Jaccard distance and a cosine distance. So I played around a bit with them and finally came up with the idea of something like a slope graph showing the distances for alternations of one string – in this case “Cosmo Kramer” – just to get started and an idea about what’s going on and how different algorithms are affected by certain alternations.

# Why (a) PCA?

3D model of ellipsoid and its three principal components

A principal component analysis is a way to reduce dimensionality of a data set consisting of numeric vectors to a lower dimensionality. Then it is possible to visualize the data set in three or less dimensions. Have a look at this use case. I’ll try to explain the motivation using a simple example.

Think of a very flat square (e.g. 1x1x.05) in a three dimensional space. What you see of this cuboid when looking at it from a specific angle while assuming perspectivic visual indicators away is basically a 2 dimensional projection of it. From one angle it looks like a rhombus, looking straight from the edge side all you see is a line-like shape. So obviously the representation offering the most information about this cuboid is the one you see when looking at it perpendicular to the main surface. Continue reading