Reasonable Inheritance of Cluster Identities in Repetitive Clustering

… or Inferring Identity from Observations

cluster-identityLet’s assume the following application:

A conservation organisation starts a project to geographically catalogue the remaining representatives of an endangered plant species. For that purpose hikers are encouraged to communicate the location of the plant if they encounter it. Due to those hikers using GPS technology ranging from cheap smartphones to highend GPS devices and weather as well as environmental circumstances the measurements are of varying accuracy. The goal of the conservation organisation is to build up a map locating all found plants with an ID assigned to them. Now every time a new location measurement is entered into the system a clustering is applied to identify related measurements – i.e. belonging to the same plant.

Continue reading

Interactive Heatmaps with Google Maps API v3

indiaThanks to the Google Maps API it is pretty easy to code up a small JavaScript to turn a bunch of points into an interactively explorable and lovely looking heatmap. You’re welcome to give it a try on heatmap.joyofdata.de where you can load a CSV to display its contained points. The CSV is supposed to be semicolon delimited and contain at least two columns “lat” and “lon” for the geographical location and an optional third numerical column “weight”. The order does not matter. And of course the parsing is done with Papa Parse – what else!

Continue reading

Parsing a Local CSV File with JavaScript and Papa Parse

In this tutorial I am going to show you how to read a local CSV file using JavaScript and parse it with the Papa Parse library.

For whatever reason JavaScript developed into an awesome tool for breathing life into data. But before you can visualize data, you have to read and parse it. JSON is with JavaScript a cinch anyway – and CSVs are now a cinch, too, thanks to Papa Parse – yes, I kind of don’t like the name as well, but let’s be pragmatic. Thanks to HTML5′s rather new File API it became possible to read locally stored files with JavaScript from a browser. Due to security concerns this file access is bound to files that have been opened manually by the user though. That means it is not (yet) possible to simply read arbitrary files from a folder.

Continue reading

Free and Certified MongoDB Online Courses

MongoDB_University_LogoIn case you are interested in learning about MongoDB or generally curious about non-relational approaches to storage of data then my recommendation for you is to check out the online courses offered by MongoDB Incorporation. I promise you won’t be disappointed. MongoDB Inc’s educational department – MongoDB University – offers five courses for developers and dev ops:

Continue reading

Transforming an XML Document into a CSV using XMLStarlet

In this little tutorial I am going to describe a handy tool for transforming an XML document into a more easily processable CSV format. There are many ways of getting this job done - but most are more tedious than necessary (like writing a custom made RegEx parser – yuck!). Using XMLStarlet and XPath expressions this is going to be cinch. Let’s evaluate a number of typical XML data configurations and turn them into a flat CSV structure.

Continue reading

How to Import a CSV into MongoDB using AWK

In case the desired JSON objects structure is just a set of simple attributes this can be achieved by using mongoimport directly. But in case some of the fields are supposed to be combined into an array or a sub-document, mongoimport won’t help you. In this tutorial I will show you how to transform a CSV into a collection of GeoJSON objects and in the course of that teach you the basics of AWK.

Continue reading

Talking to Twitter’s REST API v1.1 with R

twitterIn this text I am going to describe a very straightforward way of how to make use of Twitter’s REST API v1.1. I put some code together for that purpose, so that requesting data just needs the API URL, the API parameters and a vector containing the OAuth parameters.

Before you can get started you have to login to your Twitter account on dev.twitter.comcreate an application and generate an “Access Token” for it. So let’s jump right in and fetch IDs of 10 followers of @hrw (Human Rights Watch). The necessary code is located on GitHub – download all three files and then you can just edit the example below as suggested.

Continue reading

Mondrian Schema for OLAP Cube Definition ft. Google Analytics and Saiku

data-insightsWhat I am going to showcase in this tutorial is how to load web stats from Google Analytics into a fact table with Penthao Kettle/PDI. And then how to represent that fact table with Mondrian 3.6 schema so we can visualize the data with Saiku on Pentaho BI Server. In the end I’ll give my two cents on Saiku Analytics and possible options involving d3.js and Roland Bouman‘s xmla4js.

In case you are new to this I recommend reading my articles on the following topics involved here:

Continue reading

Using the Dimension Lookup/Update Step in Pentaho Kettle

dim_lookup_update_iconIn a traditional star schema the dimensions are located within specialized tables which are referred to by numeric keys from the fact table. A dimension can represent anything from the gender (“male”, “female”, “intersex”) over a hierarchy representing a location (“Germany”, “RLP“, “Mainz“) to an individual user’s profile (name, address, date of birth, …). Now thanks to Mr. Kimball we know there are different types of what he refers to as Slow Changing Dimensions (SCD – “slow” because they are expected to change only infrequently):

Continue reading

A StackOverflow for Business Intelligence – or what BI Can Learn from PHP!

A gamified, high-speed, high-quality Q&A-site for topics revolving around making professionally sense of a company’s data  - a.k.a. “Business Intelligence” – wouldn’t that be awesome? And let’s face it – asking a question on how to configure a step in Pentaho Kettle does not fit any StackExchange site’s realm yet. Usually this type of question is asked on StackOverflow but the feedback-latency is quite high to say the least. Or let’s take a question on how to design a KPI – this one usually ends up on CrossValidated but will often be greeted with disdain given the statistical triviality - plus most people in statistics are not working with BI and won’t be open for the subject’s specific intricacies. And finally you are wondering about how to configure a MySQL RDMS for a data warehouse – where to ask that? On dba.SE … I guess. And suddenly you get weird issues with TomCat which you need for Pentaho BI Server – hmmm, SuperUser? Or ServerFault?

It’s just too distributed!

Continue reading