Recently we had some clients who had the desire to store timeseries in PostgreSQL. One of the questions which interested them is related to calculating the difference between values in timeseries data. How can you calculate the difference between the current and the previous row?
To answer this question I have decided to share some simple queries outlining what can be done. Note that this is not a complete tutorial about analytics and windowing functions but just a short introduction to what can be done in general.

Loading sample data

Let us load some sample data:

cypex=# CREATE TABLE t_oil 
        (
          region       text, 
          country      text, 
          year         int, 
          production   int, 
          consumption  int
        );

cypex=# COPY t_oil 
        FROM PROGRAM 'curl https://www.cybertec-postgresql.com/secret/oil_ext.txt';
COPY 644

If you are a superuser you can use COPY … FROM PROGRAM directly. Otherwise you have to load the text file in a different way.

lag: How to access a different row in SQL

If you want to calculate the difference between two rows, you can make use of the “lag” function. However, there is a question naturally arising here: If we want to access the previous row. What is the previous row? We will need some kind of order. To do that SQL provides the OVER-clause:

cypex=# SELECT country, year, production, 
               lag(production, 1) OVER (ORDER BY year) 
        FROM   t_oil 
        WHERE country = 'USA' 
        LIMIT 10;
 country | year | production |  lag  
---------+------+------------+-------
 USA     | 1965 |       9014 |      
 USA     | 1966 |       9579 |  9014
 USA     | 1967 |      10219 |  9579
 USA     | 1968 |      10600 | 10219
 USA     | 1969 |      10828 | 10600
 USA     | 1970 |      11297 | 10828
 USA     | 1971 |      11156 | 11297
 USA     | 1972 |      11185 | 11156
 USA     | 1973 |      10946 | 11185
 USA     | 1974 |      10461 | 10946
(10 rows)

In my example I have ordered the data by year to make sure that the previous year can indeed be found in the previous row.

Once you find the correct row, the rest is easy:

cypex=# SELECT country, year, 
               production - lag(production, 1) OVER (ORDER BY year) AS diff 
        FROM   t_oil 
        WHERE  country = 'USA' 
        LIMIT 10;
 country | year | diff 
---------+------+------
 USA     | 1965 |     
 USA     | 1966 |  565
 USA     | 1967 |  640
 USA     | 1968 |  381
 USA     | 1969 |  228
 USA     | 1970 |  469
 USA     | 1971 | -141
 USA     | 1972 |   29
 USA     | 1973 | -239
 USA     | 1974 | -485
(10 rows)

What is important to see here is that the first row contains a NULL entry because there is no known difference to the previous row.

first_value function

What many people need is the difference between the current and the first row in the data set. PostgreSQL (or ANSI SQL in general to be more precise) offers the “first_value” function which returns the first row given the order provided by us. Here is how it works:

cypex=# SELECT t_oil.country, 
               t_oil.year, 
               t_oil.production, 
               t_oil.production - first_value(t_oil.production) OVER (ORDER BY t_oil.year) AS diff_first
        FROM   t_oil
        WHERE  t_oil.country = 'USA'::text 
        LIMIT 10;
 country | year | production | diff_first 
---------+------+------------+------------
 USA     | 1965 |       9014 |          0
 USA     | 1966 |       9579 |        565
 USA     | 1967 |      10219 |       1205
 USA     | 1968 |      10600 |       1586
 USA     | 1969 |      10828 |       1814
 USA     | 1970 |      11297 |       2283
 USA     | 1971 |      11156 |       2142
 USA     | 1972 |      11185 |       2171
 USA     | 1973 |      10946 |       1932
 USA     | 1974 |      10461 |       1447
(10 rows)

As you can see, in this case everything is relative to the first row.

To visualize the results I have quickly built a mini CYPEX dashboard:

cypex visualization

The data looks correct so we can move on to the next examples.

Mixing data sets in analytics

But what happens if we start to look at two countries? If we order by year we might hit the wrong row. If we order by both columns we might still hit a row associated to a different country. The solution is the PARTITION BY clause. PostgreSQL will break up the data in various groups and calculate the difference again (for each group). Here is how it works:

cypex=# SELECT country, year, production, 
               lag(production) OVER (PARTITION BY country ORDER BY year) AS diff 
        FROM   t_oil 
        WHERE country IN ('Canada', 'Mexico') 
               AND year < 1970;
 country | year | production | diff 
---------+------+------------+------
 Canada  | 1965 |        920 |     
 Canada  | 1966 |       1012 |  920
 Canada  | 1967 |       1106 | 1012
 Canada  | 1968 |       1194 | 1106
 Canada  | 1969 |       1306 | 1194
 Mexico  | 1965 |        362 |     
 Mexico  | 1966 |        370 |  362
 Mexico  | 1967 |        411 |  370
 Mexico  | 1968 |        439 |  411
 Mexico  | 1969 |        461 |  439
(10 rows)

In this example each group contains a NULL value because there is no “previous” value. This is proof that PostgreSQL handles the groups separately.

Using sliding windows

One more thing many people are interested in is the necessity to calculate moving averages. I decided to include this example in this post about differences because the problem pops up so often that it deserves some more attention. In many cases this type of operation is calculated on the application level which is clearly the wrong place to do because of performance reasons:

cypex=# SELECT country, year, production, 
	       avg(production) OVER (ORDER BY year 
		   ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS mov 
        FROM 	t_oil 
        WHERE country IN ('Saudi Arabien') 
              AND year BETWEEN 1975 AND 1990;
    country    | year | production |         mov          
---------------+------+------------+-----------------------
 Saudi Arabien | 1975 |       7216 | 8465.6666666666666667
 Saudi Arabien | 1976 |       8762 | 8487.7500000000000000
 Saudi Arabien | 1977 |       9419 | 8758.4000000000000000
 Saudi Arabien | 1978 |       8554 | 9369.2000000000000000
 Saudi Arabien | 1979 |       9841 | 9668.0000000000000000
 Saudi Arabien | 1980 |      10270 | 9176.4000000000000000
 Saudi Arabien | 1981 |      10256 | 8455.8000000000000000
 Saudi Arabien | 1982 |       6961 | 7394.4000000000000000
 Saudi Arabien | 1983 |       4951 | 6060.6000000000000000
 Saudi Arabien | 1984 |       4534 | 5051.0000000000000000
 Saudi Arabien | 1985 |       3601 | 4578.6000000000000000
 Saudi Arabien | 1986 |       5208 | 4732.4000000000000000
 Saudi Arabien | 1987 |       4599 | 4952.6000000000000000
 Saudi Arabien | 1988 |       5720 | 5653.4000000000000000
 Saudi Arabien | 1989 |       5635 | 5764.7500000000000000
 Saudi Arabien | 1990 |       7105 | 6153.3333333333333333
(16 rows)

In SQL you can use ROWS BETWEEN … PRECEDING AND … FOLLOWING.

This defines the number of rows going into the aggregate function (in our case “avg”). The idea of a moving average is to flatten the curve and create a smoother line. The following picture shows how this works:

Cypex dashboard

As you can see in the CYPEX visualization the moving average is a lot smoother than the rate underlying data. Saudi Arabia is a so called “swing producer”. Depending on the political situation the production rate might vary significantly so using a moving average actually makes a lot of sense.

Composite types and row comparisons

However, there is more: Some of you might know that PostgreSQL supports composite data type. Basically every row can be seen as a single element containing various components. Usually a SELECT clause lists all desired fields, but you can also see a table as a single field as shown in the next example:

cypex=# SELECT t_oil, lag(t_oil) OVER (ORDER BY year) 
        FROM   t_oil 
        WHERE  country = 'Mexico'  
               AND year IN (1980, 1981);
                  t_oil                  |                   lag                   
-----------------------------------------+-----------------------------------------
 ("North America",Mexico,1980,2129,1048) | 
 ("North America",Mexico,1981,2553,1172) | ("North America",Mexico,1980,2129,1048)
(2 rows)

In this case all columns of a row are packed into a single field. You can use the “lag” function normally …

The trick now is: You can use “=” to compare two rows directly. Why is that important? Sometimes you want to see if two rows were imported twice or you simply want to know if two consecutive rows are identical. This is how it works:

cypex=# SELECT t_oil = lag(t_oil) OVER (ORDER BY year) 
        FROM   t_oil 
        WHERE  country = 'Mexico' 
               AND year IN (1980, 1981);
 ?column? 
----------
 
 f
(2 rows)

It is possible to compare entire rows to each other. PostgreSQL will inspect one field after the other and only issue true in case all fields are the same. In other words: “lag” can even be abused to detect duplicate rows.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.