In my last post I was describing how to calculate “returning visitors” in a customizable way depending on how you want to define “returning”. At work as well as for personal projects I use for ETL processes Pentaho Data Integration (PDI) aka Kettle. PDI provides a step for fetching data from Google Analytics and I am going to describe in this post how to use this feature on the basis of the job I clicked together for the article on “returning visitors”. I will focus on the steps and aspects relevant to the subject.
You can download the zipped Job here. You might have to adjust the paths.
The job
Because the API request for the number of unique visitors has to be invoked for every needed time span – the calender weeks (Monday to Sunday) and the time spans stretching two consecutive calendar weeks. “week list” creates a set of rows for every time span keeping the calendar week number (cw), the startDate and the endDate of the time span.
This set of rows is passed on to the next transformation “google analytics” which is executed for every row individually. How the data is transported from “week list” to “google analytics” I described in my last post on bulk-downloading files with Kettle – there I just used a Job on the receiving end but the principle is exactly the same.
For every API request the resulting figures are stored in a TSV file which can then be fed to the spreadsheet program of your choice for further processing.
Setting up Google Analytics API access
Before you can access Google Analytics with its API you have to activate that service via the Google API Console. There you can set the permissions for Google accounts – mind that for the API access you need an account that does not use Two-Factor-Authentication (a technology I otherwise very much recommend – it won’t keep the NSA away though). Under “API Access” you’ll finally find the API Key we are going to need for accessing the API.
Querying the API
Okay, what do I want to query? I want to request the number of visitors (ga:visitors) which did not access the web site from Munich (ga:city!=Munich) and that for the time span from ${startDate} to ${endDate} which are parameters originating from the respective fields served by the “week list” transformation.
And don’t forget the credentials and the API key in the top section – otherwise you’ll get a Java Exception instead of the numbers.
Best way to get used to the Google API syntax is by playing around with the “Google Analytics Query Explorer” and reading through the official documentation.
Appropriate date formats
Google expects the dates to be provided using the format yyyy-mm-dd, f.x. 2013-06-24. Kettle will naturally output dates though using a different format – looking like this: 2013/06/24 00:00:00.000. To convert the formatting to how Google Analytics expects it I use the “Select/Rename values” step in the “week list” transformation – check it out.
If something remaind mysterious because I skipped it or didn’t explain it well – you are welcome to post a comment with your question.
found it:)
Once again me:)
What is to change in the GA that the output is not a csv/txt or whatever file…
I want to load the results directly in the data warehouse?
I tried it with a table output directly from the get parameters but something is going wrong also the database connection is successful:(
Do I need to store the information before in a data grid or something and then load it via table output??
Found it:)
Thanks!!
Thanks for the fast reply!
True…as a newbie I didn’t check the registered value and the defaults! :)
But how can I change that the default value in dates is this one from the main job??
Best regards,
Holger
Hi Holger,
thanks for the kind words.
I assume you tried to implement an own job. Passing around parameters is tricky. If you stay as close as possible to the setup used here then my best guess is that you didn’t “register” the parameters in dates.ktr via the Transformation properties:
Regards
Raffael
Hello There,
thanks for that really nice Example!
I have problems with passing the parameters initialDate and numofWeeks from the main.kjb to the dates.ktr
No matter what I enter in the main.kjb the value is not passed to the dates.ktr and therefore is not analyzing the right timeframe.
What am I doing wrong??
Thanks for the help!
Holger