For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually. But for larger projects this approach would not be feasible. So I had a look at what R would offer me for fuzzy string matching beyond good ol’ Levenshtein distance and came across a rather new package answering to the name of “stringdist” maintained by Mark van der Loo. To my pleasant surprise it offers not two, not three , but a variety of configurable algorithms for that purpose. But I have no idea what is for example the effective difference between a Jaccard distance and a cosine distance. So I played around a bit with them and finally came up with the idea of something like a slope graph showing the distances for alternations of one string – in this case “Cosmo Kramer” – just to get started and an idea about what’s going on and how different algorithms are affected by certain alternations.
Tag Archives: scatter plot
Illustration of principal component analysis (PCA)
Why (a) PCA?
A principal component analysis is a way to reduce dimensionality of a data set consisting of numeric vectors to a lower dimensionality. Then it is possible to visualize the data set in three or less dimensions. Have a look at this use case. I’ll try to explain the motivation using a simple example.
Think of a very flat square (e.g. 1x1x.05) in a three dimensional space. What you see of this cuboid when looking at it from a specific angle while assuming perspectivic visual indicators away is basically a 2 dimensional projection of it. From one angle it looks like a rhombus, looking straight from the edge side all you see is a line-like shape. So obviously the representation offering the most information about this cuboid is the one you see when looking at it perpendicular to the main surface. Continue reading
Comparison of word frequency in english literature
The scatterplot shows the frequency of occuring words for two sets of texts. You click on one circle and you see the words for it on the left hand side. The app is built on d3.js (my second small project using it) and I am planning to write an introductory article on it soon. Apart from a few issues it is fun to work with d3.
ggplot2 basics in action
ggplot2 is for plotting in R, very flexible and ably designed by Hadley Wickham following a concept called “grammar of graphics” and anyway pretty awesome – so let’s jump right in with some simple examples that should help you get it going.
R code for article on animated scatter plots
This is the R code I used to create the PNGs which are afterwards put together with ffmpeg into a clip (check ’em out). I am just commenting on the programmatic aspects. In a separate article I will write about how ggplot2 is used and how ffmpeg turns the PNGs into a clip.
Commented R protocol for the stock quotes article
This R protocol shows the steps I took to create the plots for my article on stock quote correlations. You might also be interested in what the difference is between a long and a wide table and how to set up a data source in Windows 7 for R and MySQL.
Continue reading
Correlations of quotes for 30 German stocks
For a new data article I thought it would be interesting to see if a tabular visualization of stock quote correlations might unveil interesting patterns. So for the purpose of investigating the correlations I came up with a little javascript web application that allows you to zoom into a scatter plot matrix, a density plot matrix and a correlogram (next to each other) to have closer look at an individual plot for two stocks.
The top row keeps the maps and the bottom row the respective magnifying areas. On the right hand side you will find explanations on how to use this tool and further explanatory links. The combination of placing red and green mark lines aiming at a field at a high zoom level allows you to quickly locate the plot you are looking for.