For the previous article I needed a quick way to figure out if two sets of points are linearly separable. But for crying out loud I could not find a simple and efficient implementation for this task. Except for the perceptron and SVM – both are sub-optimal when you just want to test for linear separability. The perceptron is guaranteed to finish off with a happy ending – if feasible – but it can take quite a while. And SVMs are designed for soft-margin classification which means that they might settle for a not separating plane and also they maximize the margin around the separating plane – which is wasted computational effort.
The tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents from the corpus. In terms of tf-idf a word is important for a specific document if it shows up relatively often within that document and rarely in other documents of the corpus. I used tf-idf for extracting keywords from protocols of sessions of the German Bundestag and am quite happy with the results. Given that I was dealing with (so far) 18 documents, together containing more than one million words which would have to be aggregated for the term frequency, then outer joined and then fed to the formula I was first a bit worried about how R would perform. To my surprise the whole processing from reading the files from disk to the final table of tf-idf-values took about 8 seconds. That’s not bad at all.
Why (a) PCA?
A principal component analysis is a way to reduce dimensionality of a data set consisting of numeric vectors to a lower dimensionality. Then it is possible to visualize the data set in three or less dimensions. Have a look at this use case. I’ll try to explain the motivation using a simple example.
Think of a very flat square (e.g. 1x1x.05) in a three dimensional space. What you see of this cuboid when looking at it from a specific angle while assuming perspectivic visual indicators away is basically a 2 dimensional projection of it. From one angle it looks like a rhombus, looking straight from the edge side all you see is a line-like shape. So obviously the representation offering the most information about this cuboid is the one you see when looking at it perpendicular to the main surface. Continue reading