Understanding the relationship between data points is crucial. For instance, you might need to identify the most recent orders for each customer or track changes in sensor readings over time. Unlike aggregate functions, which summarise data into a single row, it is window functions that allow you to analyse data while preserving each row’s details. This is the core of the logic, but don’t worry if you struggle to imagine the difference, as we will cover all of it in this article.
PostgreSQL supports SQL window functions, facilitating complex calculations across related rows within a table. These functions are particularly useful for tasks such as ranking entries, calculating running totals, finding moving averages, and comparing individual entries. Mastering window functions can significantly enhance your data analysis capabilities.
You can easily use similar window functions as you would with aggregation. Let’s take a simple example of sensor readings:
CREATE TABLE sensor_readings (
sensor_id bigint,
reading_value decimal,
reading_time timestamp with time zone default current_timestamp
);
INSERT INTO sensor_readings (sensor_id, reading_value, reading_time) VALUES
(1, 32.7, '2024-07-01 11:24:34'),
(1, 33.1, '2024-07-02 11:29:01'),
(1, 33.1, '2024-07-02 12:03:59'),
(1, 33.0, '2024-07-03 10:12:15'),
(2, 32.8, '2024-07-01 13:17:01'),
(2, 35.8, '2024-07-02 09:18:11'),
(3, 29.1, '2024-07-01 13:54:03'),
(3, 30.3, '2024-07-01 14:12:09'),
(3, 31.5, '2024-07-02 16:07:43');
When you start with the aggregation functions, it is straightforward for anybody familiar with SQL:
SELECT
sensor_id,
AVG(reading_value)
FROM sensor_readings
GROUP BY sensor_id;
with the output
sensor_id | avg
-----------+---------------------
3 | 30.3000000000000000
2 | 34.3000000000000000
1 | 32.9750000000000000
While similar, using the window function AVG we can get almost same result:
SELECT
sensor_id,
reading_value,
AVG(reading_value) OVER (PARTITION BY sensor_id) AS avg_reading_value
FROM sensor_readings;
such as
sensor_id | reading_value | avg_reading_value
-----------+---------------+---------------------
1 | 32.7 | 32.9750000000000000
1 | 33.1 | 32.9750000000000000
1 | 33.1 | 32.9750000000000000
1 | 33.0 | 32.9750000000000000
2 | 32.8 | 34.3000000000000000
2 | 35.8 | 34.3000000000000000
3 | 29.1 | 30.3000000000000000
3 | 30.3 | 30.3000000000000000
3 | 31.5 | 30.3000000000000000
This reiterates the fundamental difference between the two sets of functions. As mentioned earlier, while the results for sensor_id
are the same in both cases, the aggregate function summarised it into a single row (grouped by sensor_id
), whereas the window function provides the value for the set of rows in the partition defined by sensor_id
.
After showing the average in a window function, it’s important to note that traditional aggregation functions might not be the most helpful in the context of window function logic. Despite functions like AVG
, COUNT
, and SUM
—which are the most used aggregation functions—being available as window functions, we used the above only to demonstrate the difference.
OVER clause
Before diving into the individual functions, let’s cover the syntax first. From the sample query above, you already get the basic syntax of the window functions, with the OVER
clause being a primary identification of windowing functionality.
window_function ([expression...]) OVER window_definition
In our example, the basic window functions can be similar to the aggregate ones— like AVG
, SUM
and COUNT
- and a number of the functions we will cover shortly.
The window definition part specifies how the window function will see the data it works over. It can include:
- Partitioning to divide the result sets into separate partitions (think groups in aggregate functions).
- Definition of the ordering of the rows within each partition.
- Specification of how to apply framing of the subset of rows for each row’s calculation.
From all the components of the window syntax, only partitioning is mandatory.
If we revisit our first window function example above, you can identify the partitioning part (PARTITION BY sensor_id
). As already mentioned, you can easily compare the partitioning logic to the grouping used in aggregate functions, with the same properties—like partitioning data segments by multiple fields, using functions and other logic.
With partitioning comes hand in hand ordering of the data. When you start thinking of row properties, instead of grouping data into a single row, order starts to make a difference. Let’s take the following example:
window_function ([expression...]) OVER (PARTITION BY sensor_id ORDER BY reading_time)
This will provide different data compared to unsorted ones. If you struggle to find the application for this, think of the moving average or row numbering.
The last component of the window definition is framing, allowing you to further limit the data over which the window function is calculated. There are two framing expressions: ROWS BETWEEN
and RANGE BETWEEN
. Using the framing, we can turn AVG from the above example to provide our first real use case when you might want to use window functions.
SELECT
reading_time,
sensor_id,
reading_value,
AVG(reading_value) OVER (
PARTITION BY sensor_id
ORDER BY reading_time
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS moving_avg_reading_value
FROM sensor_readings;
Implementing the moving average of the individual readings, taking into account a maximum of 3 rows including the current one.
reading_time | sensor_id | reading_value | moving_avg_reading_value
------------------------+-----------+---------------+--------------------------
2024-07-01 11:24:34+02 | 1 | 32.7 | 32.9000000000000000
2024-07-02 11:29:01+02 | 1 | 33.1 | 32.9666666666666667
2024-07-02 12:03:59+02 | 1 | 33.1 | 33.0666666666666667
2024-07-03 10:12:15+02 | 1 | 33.0 | 33.0500000000000000
2024-07-01 13:17:01+02 | 2 | 32.8 | 34.3000000000000000
2024-07-02 09:18:11+02 | 2 | 35.8 | 34.3000000000000000
2024-07-01 13:54:03+02 | 3 | 29.1 | 29.7000000000000000
2024-07-01 14:12:09+02 | 3 | 30.3 | 30.3000000000000000
2024-07-02 16:07:43+02 | 3 | 31.5 | 30.9000000000000000
Exploring Window functions
Now that we have a foundational understanding of window functions and their components, let’s dive into specific window functions. We’ll cover some of the most commonly used functions, such as ROW_NUMBER()
, RANK()
, DENSE_RANK()
, LAG()
, LEAD()
, FIRST_VALUE()
, LAST_VALUE()
and more. For each function, we’ll provide examples to illustrate their practical applications.
ROW_NUMBER
The ROW_NUMBER()
function assigns a unique sequential integer to rows within a partition of a result set, starting with one for the first row in each partition.
SELECT
sensor_id,
reading_time,
reading_value,
ROW_NUMBER() OVER (PARTITION BY sensor_id ORDER BY reading_time) AS row_num
FROM sensor_readings;
This makes it easy to find the first/last entries for a specified window. As an example, you can experiment with getting only the last reading for each hour.
SELECT
sensor_id,
reading_time,
reading_value
FROM (
SELECT
sensor_id,
reading_time,
reading_value,
ROW_NUMBER() OVER (
PARTITION BY sensor_id, date_trunc('hour', reading_time)
ORDER BY reading_time DESC
) AS row_num
FROM sensor_readings
) AS ranked_readings
WHERE row_num = 1;
RANK and DENSE_RANK
The RANK()
and DENSE_RANK()
functions are used to assign a rank to each row within a partition, based on the order of one or more values. These functions are useful when scoring the values and handling ties. The main difference between RANK
and DENSE_RANK
is how they deal with gaps.
Example use to find the highest reading_value per sensor:
SELECT
sensor_id,
reading_time,
reading_value,
RANK() OVER (PARTITION BY sensor_id ORDER BY reading_value DESC) AS rank
FROM sensor_readings;
If you consider the sorting for the sensor_id
1 in our sample seed the difference between RANK
and DENSE_RANK
is easy to demonstrate.
sensor_id | reading_time | reading_value | rank
-----------+------------------------+---------------+------
1 | 2024-07-02 11:29:01+02 | 33.1 | 1
1 | 2024-07-02 12:03:59+02 | 33.1 | 1
1 | 2024-07-03 10:12:15+02 | 33.0 | 3
1 | 2024-07-01 11:24:34+02 | 32.7 | 4
Giving a natural ranking with tie on the reading value 33.1 and leaving a gap on 2nd rank, whereas DENSE_RANK
wouldn't include the gap.
sensor_id | reading_time | reading_value | rank
-----------+------------------------+---------------+------
1 | 2024-07-02 11:29:01+02 | 33.1 | 1
1 | 2024-07-02 12:03:59+02 | 33.1 | 1
1 | 2024-07-03 10:12:15+02 | 33.0 | 2
1 | 2024-07-01 11:24:34+02 | 32.7 | 3
LAG
and LEAD
To evaluate previous or subsequent rows without the need to self-join the dataset, you can utilise the power of the window functions LAG
and LEAD
. These functions are particularly useful for calculating differences between rows, comparing current and previous values, or fetching future values for comparison.
The LAG
function provides access to a value in a previous row within the partition, while LEAD
provides access to a subsequent row.
SELECT
sensor_id,
reading_time,
reading_value,
LAG(reading_value, 1) OVER (PARTITION BY sensor_id ORDER BY reading_time) AS previous_value,
LEAD(reading_value, 1) OVER (PARTITION BY sensor_id ORDER BY reading_time) AS next_value
FROM sensor_readings;
In this query, we are using a value of 1, but you can choose any position necessary. By using LAG()
and LEAD()
, you can perform advanced analyses that require looking backward or forward within your dataset, making it easier to derive meaningful insights and trends.
FIRST_value
and LAST_VALUES
The functions FIRST_value
and LAST_VALUES
are similar, except they (as the name says) give the first/last value of the partition.
The FIRST_VALUE
is useful when comparing the partition values to the first value (for example opening price for a day), and LAST_VALUE
to identity the closing values.
Other Window functions
The complete list of the window functions is available in the documentation.
Re-using the window definition
As you might have noticed, the query we used to demonstrate LAG
and LEAD
was rather verbose. This was due to the repeated definition of the window for both columns. Luckily, the syntax of the window functions allows you to define and re-use the window definition.
SELECT
sensor_id,
reading_time,
reading_value,
LAG(reading_value, 1) OVER readings_window AS previous_value,
LEAD(reading_value, 1) OVER readings_window AS next_value
FROM sensor_readings
WINDOW readings_window AS (PARTITION BY sensor_id ORDER BY reading_time);
This way you can ensure the consistent window definition and consistency in complex queries.