Guide to EC2 from the Command Line

AWSThis tutorial aims at guiding your first steps at controlling your EC2 instances from the command line. It is by no means even remotely complete but it will give you an impression of the basic structure and concepts, so you can quickly fill in the gaps for your personal use case. The tutorial starts with setting up your account and forges a bridge from requesting a Spot instance, over exchanging files with it, hooking up additional storage, to finally terminating it. I am not though explaining interaction with the AWS web console – we’ll only resort it for some initial configuration. As usual the target audience are Linux users but the AWS CLI tools are pretty much identical for Windows.

Continue reading

A Guide on OCR with tesseract 3.03

Tesseract is tough … so tough indeed, even Chuck Norris would have to check the manual twice. Not kidding you. Okay, so this article aimes at structuring what I needed to learn about tesseract to OCR-convert PDFs to text and how to train tesseract for application to new fonts. Let me dampen your expectations – you *will* have to read further texts (esp. the official documentation) to actually perform successful training! This text is describing usage of tesseract 3.03 RC on Ubuntu 14.04. Tesseract is also available for other Linuxes and Windows – the work flow will be mostly the same across OSes – of course some commands I use are though specific to Ubuntu. Also mind that tesseract 3.03 is considerably different to 3.02, which again differs from  3.01 as well – the changes are partially more fundamental than what you might expect from the version numbers.

Continue reading

OAuth 2.0 for Google (Analytics) API with Python Explained

oauth2In this tutorial I am going to explain how OAuth 2.0 works and how to apply it for interacting with Google Analytics API using Python. Google provides for that purpose a Python package – which so far only supports Python 2 though … well.

OAuth2 seems to be quite a mess at first and Google’s documentation on this subject is not that well organized in my opinion. So with this article I do my best to save you the sweat I had to invest. After all it’s not that complicated anyway, as you will probably agree.

Continue reading

Transforming an XML Document into a CSV using XMLStarlet

In this little tutorial I am going to describe a handy tool for transforming an XML document into a more easily processable CSV format. There are many ways of getting this job done – but most are more tedious than necessary (like writing a custom made RegEx parser – yuck!). Using XMLStarlet and XPath expressions this is going to be cinch. Let’s evaluate a number of typical XML data configurations and turn them into a flat CSV structure.

Continue reading

How to Import a CSV into MongoDB using AWK

In case the desired JSON objects structure is just a set of simple attributes this can be achieved by using mongoimport directly. But in case some of the fields are supposed to be combined into an array or a sub-document, mongoimport won’t help you. In this tutorial I will show you how to transform a CSV into a collection of GeoJSON objects and in the course of that teach you the basics of AWK.

Continue reading

Using the Linux Shell for Web Scraping

bbcLet’s assume we want to scrape the “Most Popular in News” box from bbc.com. What we need first is a CSS selector to locate what we are interested in. In this case it is simply a div tag with the ID “mostPopular” and you can figure this out using the Developer Tools of your favorite web browser. And now we are going to apply a chain of command line tools – each feeding their output to the next tool (that is called piping btw) and in the end we have a layouted text representation of the box’ content:

Continue reading

Tool for Visualization of Connections between Agents and Entities in Context of Redtubegate

redtubegate-inter-connEarly in December 2013 a lawfirm began to send out approximately 10 to 40 thousand cease-and-desist letters on behalf of the rightholder of a bunch of porn flicks for streaming those films on redtube. So far, so good. Now a lot of people didn’t like to receive bills ranging from 250 to more than a thousand Euro for streaming erotica just before christmas especially when being pretty sure that they didn’t even do so. Now given the magnitude of this case a lot of these people turned sour and started to dig a bit deeper. And what was brought to light is a shady network of companies with links where there should be none and a bunch of business partners who as well turned out to have more in common than what was to be seen at first glance.

Continue reading

Pivoting Data in R Excel-style

(This article is referring to an initial proof-of-concept version of r-big-pivot)

I have to admit that I very much enjoy pivoting through data using Excel. Its pivoting tool is great for getting a quick insight into a data set’s structure and for discovering interesting anomalies (the sudden rise of deaths due to viral hepatitis serves as a nice example). Unfortunately Excel itself is handicapped by several restrictions:

  • a maximum number of one point something million rows per data set (which is crucial because the data needs to be formatted long)
  • Excel is slow and often instable when you go beyond 100’000 rows
  • cumbersome integration into available data infrastructures
  • restricted charting and analysis capabilities

Continue reading

HD clips with ffmpeg for Youtube and Vimeo

In this article I am going to show you how to convert a sequence of images (PNGs in this case) into a clip that can be uploaded to and watched on Youtube or Vimeo in HD. ffmpeg – the tool we are going to use – has to be dealt with through the command line. The large number of different parameters and settings which often depend on each other or are mutually exclusive are quite intimidating at first sight (at second as well). Actually it took me quite a while to produce a clip that did not weigh more than 100 MB while keeping a good quality and didn’t produce obscure issues. But in the course of this odyssey I learned the basics of digital videos and also, at the end of the day, the settings which led to my desired outcome even started to make (some) sense!

First of all I created a thousand PNGs (named image000.png, image001.png, …, image999.png) using ggplot2 in R. If you don’t know ggplot2 yet, check out my introductory article on it and the code I used for the article (about stock quote scatter plots changing in time).

Okay, here we go (the new lines are added just for clarity):

Continue reading