Comparison of String Distance Algorithms

stringdist-slopegraphFor the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually. But for larger projects this approach would not be feasible. So I had a look at what R would offer me for fuzzy string matching beyond good ol’ Levenshtein distance and came across a rather new package answering to the name of “stringdist” maintained by Mark van der Loo. To my pleasant surprise it offers not two, not three , but a variety of configurable algorithms for that purpose. But I have no idea what is for example the effective difference between a Jaccard distance and a cosine distance. So I played around a bit with them and finally came up with the idea of something like a slope graph showing the distances for alternations of one string – in this case “Cosmo Kramer” – just to get started and an idea about what’s going on and how different algorithms are affected by certain alternations.

Continue reading

Illustration of principal component analysis (PCA)

Why (a) PCA?

3d

3D model of ellipsoid and its three principal components

A principal component analysis is a way to reduce dimensionality of a data set consisting of numeric vectors to a lower dimensionality. Then it is possible to visualize the data set in three or less dimensions. Have a look at this use case. I’ll try to explain the motivation using a simple example.

Think of a very flat square (e.g. 1x1x.05) in a three dimensional space. What you see of this cuboid when looking at it from a specific angle while assuming perspectivic visual indicators away is basically a 2 dimensional projection of it. From one angle it looks like a rhombus, looking straight from the edge side all you see is a line-like shape. So obviously the representation offering the most information about this cuboid is the one you see when looking at it perpendicular to the main surface. Continue reading

Comparison of word frequency in english literature

Clipboard02

The scatterplot shows the frequency of occuring words for two sets of texts. You click on one circle and you see the words for it on the left hand side. The app is built on d3.js (my second small project using it) and I am planning to write an introductory article on it soon. Apart from a few issues it is fun to work with d3.

Continue reading

Correlations of quotes for 30 German stocks

For a new data article I thought it would be interesting to see if a tabular visualization of stock quote correlations might unveil interesting patterns. So for the purpose of investigating the correlations I came up with a little javascript web application that allows you to zoom into a scatter plot matrix, a density plot matrix and a correlogram (next to each other) to have closer look at an individual plot for two stocks.

Stock quotes zoomerThe top row keeps the maps and the bottom row the respective magnifying areas. On the right hand side you will find explanations on how to use this tool and further explanatory links. The combination of placing red and green mark lines aiming at a field at a high zoom level allows you to quickly locate the plot you are looking for.

Continue reading