LOCF and Linear Imputation with PostgreSQL

This tutorial will introduce various tools offered by PostgreSQL, and SQL in general – like custom functions, window functions, aggregate functions, WITH clause (or CTE for Common Table Expression) – for the purpose of implementing a program which imputes numeric observations within a column applying linear interpolation where possible and locf-and-linearforward and backward padding where necessary. I’m going to progressively add and explain those constructs, step by step, so no problem if you are new to the scene. I am very much interested in input regarding potential downsides of the implementation and possible improvements.

Continue reading

MongoDB – State of the R

mongodbNaturally there are two reasons for why you need to access MongoDB from R:

  1. MongoDB is already used for whatever reason and you want to analyze the data stored therein
  2. You decide you want store your data in MongoDB instead of using native R technology like data.table or data.frame

In-memory data storage like data.table is very fast especially for numerical data, provided the data actually fits into your RAM – but even then MongoDB comes along with a bag of goodies making it a tempting choice for a number of use cases:

  • Flexible schema-less data structures
  • spatial and textual indexing
  • spatial queries
  • persistence of data
  • easily accessible from other languages and systems

In case you would like to learn more about MongoDB then I have good news for you – MongoDB Inc. provides a number of very well made online courses catering to various languages. An overview you may find here.

Continue reading

Free and Certified MongoDB Online Courses

MongoDB_University_LogoIn case you are interested in learning about MongoDB or generally curious about non-relational approaches to storage of data then my recommendation for you is to check out the online courses offered by MongoDB Incorporation. I promise you won’t be disappointed. MongoDB Inc’s educational department – MongoDB University – offers five courses for developers and dev ops:

Continue reading

How to Import a CSV into MongoDB using AWK

In case the desired JSON objects structure is just a set of simple attributes this can be achieved by using mongoimport directly. But in case some of the fields are supposed to be combined into an array or a sub-document, mongoimport won’t help you. In this tutorial I will show you how to transform a CSV into a collection of GeoJSON objects and in the course of that teach you the basics of AWK.

Continue reading

MapReduce with R on Hadoop and Amazon EMR

hadoop-logoYou all know why MapReduce is fancy – so let’s just jump right in. I like researching data and I like to see results fast – does that mean I enjoy the process of setting up a Hadoop cluster? No, I doubt there is any correlation – neither causal nor merely statistical. The good news is there are already quite a lot of cloud computing providers offering Hadoop clusters on demand! For this article I got my hands on Amazon’s Elastic MapReduce (EMR) service (which is an extension of its EC2 service) that sets up the Hadoop cluster for you. Okay – almost at least. For this article we are going to count 2-grams in (dummy text) data using the stringdist library.

Continue reading

Generating virtual tables for JOIN operations in MySQL

Let me introduce this post with an example. You offer a web service and every request to it is logged in a database table – keeping the type of request, the user, a lot of other stuff and first and foremost the (date)time it happened. Now you want to know the number of requests within an hour, chronologically ordered, for the last four weeks – to get an idea about when your customers are using the service. If you encounter this situation for the first time your query might look like this:

Continue reading

Counting weeks (ISO 8601)

Using the ‘week’ as a time unit can be a bit confusing for at least two reasons:

  1. Does the week start on Sunday or Monday?
  2. How do you count a week that stretches two years?

Now probably your and anybody else’s take on those two questions will not just depend on cultural upbringing – but even more on – how it has been done so far. Nobody wants to change old statistics to a new counting or maintain two parallel statistics for one year with both countings. So to make a long story short, I’ll just stick with ISO 8601 standard because it seems reasonable to me. Here is what ISO 8601 has to say about the two questions:

Continue reading

Make R(ODBC) talk to MySQL on Windows 7 64bit

When you are dealing with large amounts of big data sets it is much more efficient to organizes those in database tables instead of CVSs or other files. Just yesterday I set up R for fetching data from a MySQL DBMS loading a table of stock quotes consisting of more than 300’000 rows into a data frame within seconds. That is pretty cool – and if necessary you can join huge tables in no time benefting of the indexing infrastructure of the DBMS of your choice.

Continue reading