How to Import a CSV into MongoDB using AWK

In case the desired JSON objects structure is just a set of simple attributes this can be achieved by using mongoimport directly. But in case some of the fields are supposed to be combined into an array or a sub-document, mongoimport won’t help you. In this tutorial I will show you how to transform a CSV into a collection of GeoJSON objects and in the course of that teach you the basics of AWK.

Continue reading

Talking to Twitter’s REST API v1.1 with R

The up-to-date tutorial on using Twitter API on Linux and Windows from R may be found here.

twitterIn this text I am going to describe a very straightforward way of how to make use of Twitter’s REST API v1.1. I put some code together for that purpose, so that requesting data just needs the API URL, the API parameters and a vector containing the OAuth parameters.

Before you can get started you have to login to your Twitter account on dev.twitter.comcreate an application and generate an “Access Token” for it. So let’s jump right in and fetch IDs of 10 followers of @hrw (Human Rights Watch). The necessary code is located on GitHub – download all three files and then you can just edit the example below as suggested.

Continue reading

Mondrian Schema for OLAP Cube Definition ft. Google Analytics and Saiku

data-insightsWhat I am going to showcase in this tutorial is how to load web stats from Google Analytics into a fact table with Penthao Kettle/PDI. And then how to represent that fact table with Mondrian 3.6 schema so we can visualize the data with Saiku on Pentaho BI Server. In the end I’ll give my two cents on Saiku Analytics and possible options involving d3.js and Roland Bouman‘s xmla4js.

In case you are new to this I recommend reading my articles on the following topics involved here:

Continue reading

Using the Dimension Lookup/Update Step in Pentaho Kettle

dim_lookup_update_iconIn a traditional star schema the dimensions are located within specialized tables which are referred to by numeric keys from the fact table. A dimension can represent anything from the gender (“male”, “female”, “intersex”) over a hierarchy representing a location (“Germany”, “RLP“, “Mainz“) to an individual user’s profile (name, address, date of birth, …). Now thanks to Mr. Kimball we know there are different types of what he refers to as Slow Changing Dimensions (SCD – “slow” because they are expected to change only infrequently):

Continue reading

A StackOverflow for Business Intelligence – or what BI Can Learn from PHP!

Update 2015-08-25:

The proposal was not successful and has been deleted :(

A gamified, high-speed, high-quality Q&A-site for topics revolving around making professionally sense of a company’s data  – a.k.a. “Business Intelligence” – wouldn’t that be awesome? And let’s face it – asking a question on how to configure a step in Pentaho Kettle does not fit any StackExchange site’s realm yet. Usually this type of question is asked on StackOverflow but the feedback-latency is quite high to say the least. Or let’s take a question on how to design a KPI – this one usually ends up on CrossValidated but will often be greeted with disdain given the statistical triviality – plus most people in statistics are not working with BI and won’t be open for the subject’s specific intricacies. And finally you are wondering about how to configure a MySQL RDMS for a data warehouse – where to ask that? On dba.SE … I guess. And suddenly you get weird issues with TomCat which you need for Pentaho BI Server – hmmm, SuperUser? Or ServerFault?

It’s just too distributed!

Continue reading

FIR Filter Design and Digital Signal Processing in R

iconThis article serves the purpose of illustrating that signal processing with R is possible – thanks to the signal package – and to keep a reference of some of the stuff that I learned at my last edX course. Anyway, I am by no means an expert on signal processing so I’d prefer to let the pictures and the code speak for themselves. But to give you the idea – I show case the creation and application of an FIR band pass filter (Chebyshev Type 1 in this case) and of an FIR filter created using the Parks-McClellan method with the Remez exchange algorithm. The code snippets are taken from a larger R script which you can find on GitHub. I aim to focus on the essential parts. You’re welcome to share your knowledge and corrections by leaving a comment.

Continue reading

“Discrete Time Signals and Systems” at edX by Richard Baraniuk

Attending “Discrete Time Signals and Systems” by Richard Baraniuk from Rice University was an awesome experience on many levels. Right after “Learning from Data” my second favorite MOOC so far. First of all the subject of extracting a signal from a discrete time series in terms of frequency composition is interesting by itself and provided a smooth opportunity for me to revise some of the math I studied many years ago. But this by itself wouldn’t make a learning experience that superb – what it takes for that is a teacher who knows how to get the knowledge across to the student. And in that regard – apart from the science itself – Richard is a master! It was obvious how much effort Baraniuk and his team put into designing the course. Every detail about the lectures and the exercises seemed superbially well crafted. And this is apparently not by chance – as after googling the professor’s name I found that he is actually something like a MOOC evangelist and very passionate about offering such an opportunity to the learners around the world.

Continue reading

Getting Started With Pentaho BI Server 5, Mondrian and Saiku

saikuPentaho’s BI Server or BA platform allows you to access business data in the form of dashboards, reports or OLAP cubes via a convient web interface. Additionally it provides an interface to administer your BI setup and schedule processes. The aim of this tutorial is to illustrate how to get started with the BI Server and for that purpose I am going to use a small and artifical data set – as otherwise I would have to get deeper into further technologies – so I can keep this text lean. I am going to elaborate on Mondrian schemas, data warehouse design, MDX and further related concepts in separate articles. When you are through with this text and still hungry, make sure you check out “Mondrian Schema for OLAP Cube Definition ft. Google Analytics and Saiku” for a more advanced use case.

Continue reading

Relation of Word Order and Compression Ratio and Degree of Structure

smallHaving a habit of compulsively wondering approximately every 34.765th day about how zip compression (bzip2 in this case) might be used to measure information contained in data – this time the question popped up in my head of whether or not and if then how permutation of a text’s words would affect its compression ratio. The answer is – it does – and it does so in a very weird and fascinating way.

Lo and behold James Joyce’s “A Portrait of the Artist as a Young Man” and its peculiar distribution of compression ratios …

Continue reading