**Imputing Observations**

A standard situation for somebody working with data is being faced with missing observations. Depending on the type of observation imputation might be possible. The most simple methods of padding those gaps are to forward or backfill observations up and down an ordered and related sequence of records. This is often referred to as LOCF (last observation carried forward) and FOCB (first observation carried backward). Another often reasonable method is to assume a linear transformation between observations.

(GitHub / SQL Fiddle – *at the beginning of every section you will find a link to the code on GitHub and an SQL Fiddle session*)

We start with a table (

tbl) of four columns

t::float,

a::int,

b::intand

v::float.

aand

bhold the keys for identifiying related records – this is what we will partition over.

trepresents the time and

vthe observations that we want to impute. For example, think of

bas a town in country

awhere temperature

vwas measured on 12pm on date

t.

(GitHub / SQL Fiddle)

An LOCF is at the end of the day just a

COALESCE(b,a)of two observations with

ahaving been observed before

b. If

b is not NULLwe take it, if it is we use the value of the observation before, i.e.

a. If we perform those steps iteratively starting with an initial observation (initial condition) of

NULLthen we will inductively fill all gaps after stumbling upon a first non-NULL observation. The following function implements this mechanism, but first we will apply it in a non-aggregating way. The script will just add another column

v_or_twhich holds

vin case

v IS NOT NULLand

tif

v IS NULL.

CREATE OR REPLACE FUNCTION fun(a FLOAT, b FLOAT) RETURNS FLOAT LANGUAGE SQL AS ' SELECT COALESCE(b, a) '; select a,b,t,v, fun(t,v) as v_or_t from tbl order by a,b,t ;

Here we

CREATE OR REPLACEa

FUNCTIONnamed fun, which takes two

FLOATtyped arguments

aand

band

RETURNS FLOATas well. The function is implemented by means of the

LANGUAGE SQL(instead of PL/pgSQL or some other language). An SQL programmed function will return the result of the last evaluated

SELECTstatement. The code itself has to be delimited by single quotes (

') or a string like f.x. “$$” and follows after the keyword

AS.

(GitHub / SQL Fiddle)

Now instead of applying the function (which I renamed to

locf) horizontally within a record we tell PostgreSQL to apply it iteratively in a specified

ORDERand within specified boundaries (the

PARTITION) vertically across two records:

create or replace function locf_s(a float, b float) returns float language sql as ' select coalesce(b, a) '; drop aggregate if exists locf(float); CREATE AGGREGATE locf(FLOAT) ( SFUNC = locf_s, STYPE = FLOAT ); select a,b,t,v, locf(v) over (PARTITION by a,b ORDER by t) as v_locf from tbl order by a,b,t ;

As you can see the process of creating the LOCF function is now split into two steps. First we define the transformation of two values into one (

locf_s) and then we create the actual aggregating function (

locf) by specifying how to apply the transformation. Here we just define what in this context is referred to as the “state transition function” (

SFUNC) which returns a value of the “state type” (

STYPE)

FLOAT.

In the final query we apply

locf()over the

PARTITIONdefined by

aand

band traverse the records by ordering

tascending.

(GitHub / SQL Fiddle)

Now we go one step further and fill all the missing oberservations by padding a missing oberservation preferredly with the last seen value and if that is not possible we use the first available value. Technically we fill one column

v_locfwith the LOCF values, another column

v_focbwith the FOCB values and then choose for

v_finalthe value from

v_locfif it is not NULL and otherwise the value from

v_focb.

select a,b,t,v, locf(v) over t_asc as v_locf, locf(v) over t_desc as v_focb, COALESCE( locf(v) over t_asc, locf(v) over t_desc ) as v_final from tbl WINDOW t_asc as (partition by a,b order by t), t_desc as (partition by a,b order by t desc) order by a,b,t ;

WINDOWkeyword simply allows us to to name an ordered partition and then use it in the corresponding select query. Second of all, the FOCB mechanism is just LOCF applied to the inversely ordered partition (

t_desc). And finally, it might be worth noting that you can use multiple aggregate functions within another function –

COALESCEin this case.

(GitHub / SQL Fiddle –* if you think SQL Fiddle is pretty cool – why not donate a few bucks to support the project?!*)

This query is composed of two main steps. First we unite all the necessary values spread across multiple rows, so those values are available locally to every row and then we calculate row based the imputed value.

WITH tbl0 as ( SELECT a, b, t, v, locf(v) over t_asc as v_locf, locf(v) over t_desc as v_focb, locf(case when v is not null then t else null end) over t_asc as t_locf, locf(case when v is not null then t else null end) over t_desc as t_focb from tbl window t_asc as (partition by a,b order by t), t_desc as (partition by a,b order by t desc) ), tbl1 as ( SELECT a, b, t, v, v_locf, v_focb, t_locf, t_focb, case when t_focb != t_locf and v_locf is not null and v_focb is not null then v_locf + ( (v_focb - v_locf) * (t - t_locf) / (t_focb - t_locf) ) else coalesce(v_locf, v_focb) end as v_final from tbl0 order by a,b,t ) SELECT * from tbl1 order by a,b,t;

WITHclause is simply a way to chain temporary tables. The result of the first SELECT is named

tbl0and used within the second SELECT whose result set is named

tbl1and referred to by the final (technically not necessary) SELECT query. The statements formulated using a WITH clause are also referred to as CTEs or Common Table Expressions. They can make complex queries more readable and allow the formulation of recursive queries, for example to traverse a tree. CTEs are controversial with respect to their performance (“optimization fence“).

Now with the different parameters assembled we can interpolate with the following simple formula:

Now if we subsitute above’s last SELECT query with the following UPDATE statement then the imputed set of values for

vare directly written to

tbl:

update tbl as tb set v = t1.v_final from t1 where tb.a = t1.a and tb.b = t1.b and tb.t = t1.t

(original article published on www.joyofdata.de)

]]>