Agglomerative hierarchical clustering is a simple, intuitive and well-understood method for clustering data points. I used it with good results in a project to estimate the true geographical position of objects based on measured estimates. With this tutorial I would like to describe the basics of this method, how to implement it in R with hclust and some ideas on how to decide where to cut the tree. This was also a great opportunity for composing anohter Shiny/D3.js app (GitHub, shinyapps.io) – something I wanted to do for a while now. At the end of the text I am writing a bit about what I learned in that regard.
Naturally there are two reasons for why you need to access MongoDB from R:
- MongoDB is already used for whatever reason and you want to analyze the data stored therein
- You decide you want store your data in MongoDB instead of using native R technology like data.table or data.frame
In-memory data storage like data.table is very fast especially for numerical data, provided the data actually fits into your RAM – but even then MongoDB comes along with a bag of goodies making it a tempting choice for a number of use cases:
- Flexible schema-less data structures
- spatial and textual indexing
- spatial queries
- persistence of data
- easily accessible from other languages and systems
In case you would like to learn more about MongoDB then I have good news for you – MongoDB Inc. provides a number of very well made online courses catering to various languages. An overview you may find here.
In this tutorial I am going to describe a straightforward way of how to make use of Twitter’s REST API v1.1. For that purpose I composed a little package (RTwitterAPI), so that requesting data just needs the API URL, the API parameters and a vector containing the OAuth parameters.
Before you can get started you have to login to your Twitter account on dev.twitter.com, create an application and generate an “Access Token” for it. So let’s jump right in and fetch IDs of 10 followers of @hrw (Human Rights Watch). The necessary code is located on GitHub as a package named RTwitterAPI which may be installed using devtools::install_github().
… or Inferring Identity from Observations
A conservation organisation starts a project to geographically catalogue the remaining representatives of an endangered plant species. For that purpose hikers are encouraged to communicate the location of the plant if they encounter it. Due to those hikers using GPS technology ranging from cheap smartphones to highend GPS devices and weather as well as environmental circumstances the measurements are of varying accuracy. The goal of the conservation organisation is to build up a map locating all found plants with an ID assigned to them. Now every time a new location measurement is entered into the system a clustering is applied to identify related measurements – i.e. belonging to the same plant.
In case you are interested in learning about MongoDB or generally curious about non-relational approaches to storage of data then my recommendation for you is to check out the online courses offered by MongoDB Incorporation. I promise you won’t be disappointed. MongoDB Inc’s educational department – MongoDB University – offers five courses for developers and dev ops:
- MongoDB for Python Devs (next session Sep 9 2014)
- MongoDB for Java Devs (next session Aug 5 2014)
- MongoDB for node.js Devs (next session Aug 12 2014)
- MongoDB for DBAs (next session Jul 15 2014)
- MongoDB Advanced Deployment and Operations (next session Jul 15 2014)
In this little tutorial I am going to describe a handy tool for transforming an XML document into a more easily processable CSV format. There are many ways of getting this job done – but most are more tedious than necessary (like writing a custom made RegEx parser – yuck!). Using XMLStarlet and XPath expressions this is going to be cinch. Let’s evaluate a number of typical XML data configurations and turn them into a flat CSV structure.
In case the desired JSON objects structure is just a set of simple attributes this can be achieved by using mongoimport directly. But in case some of the fields are supposed to be combined into an array or a sub-document, mongoimport won’t help you. In this tutorial I will show you how to transform a CSV into a collection of GeoJSON objects and in the course of that teach you the basics of AWK.
What I am going to showcase in this tutorial is how to load web stats from Google Analytics into a fact table with Penthao Kettle/PDI. And then how to represent that fact table with Mondrian 3.6 schema so we can visualize the data with Saiku on Pentaho BI Server. In the end I’ll give my two cents on Saiku Analytics and possible options involving d3.js and Roland Bouman‘s xmla4js.
In case you are new to this I recommend reading my articles on the following topics involved here:
- Getting Started With Pentaho BI Server 5, Mondrian and Saiku
- Setting Up a Time Dimension Table in MySQL
- Using the Dimension Lookup/Update Step in Pentaho Kettle
- Google Analytics API with Pentaho Kettle