Featured image of post Discover and isolate unchanged rows in timeseries data with SQL

Discover and isolate unchanged rows in timeseries data with SQL

I few months back we finally had our solar power system installed. This is all quite nice from an ecological point of view but one of the best things is that it yields a lot of data. I like data.

So first thing was to set up TimescaleDB, Node-RED and Grafana in Docker and attach it all to SolarEdge’s monitoring portal. That worked great and I’m tracking production, load, import, export and WHAT THE HELL IS THAT FLAT LINE DOING THERE!?

Even 20 seconds of unchanged values is suspect let alone several minutes 🤨

While it turns out that the inverter eventually updates all values, the monitoring portal itself can be somewhat flaky.

This his what the result set looks like:

+---------------------------------+----+----+---+------+------+
|time                             |grid|load|pv |import|export|
+---------------------------------+----+----+---+------+------+
|2022-01-15 08:20:01.698941 +00:00|220 |260 |40 |220   |0     |
|2022-01-15 08:20:11.678517 +00:00|1500|1530|30 |1500  |0     |
|2022-01-15 08:20:21.927625 +00:00|1500|1530|30 |1500  |0     |
|2022-01-15 08:20:31.638271 +00:00|1500|1530|30 |1500  |0     |
+---------------------------------+----+----+---+------+------+
 ... this just goes on and on every 10 seconds ...
+---------------------------------+----+----+---+------+------+
|2022-01-15 09:46:21.969070 +00:00|1500|1530|30 |1500  |0     |
|2022-01-15 09:46:31.960681 +00:00|1500|1530|30 |1500  |0     |
|2022-01-15 09:46:41.981632 +00:00|250 |300 |50 |250   |0     |
|2022-01-15 09:46:51.926296 +00:00|370 |260 |630|0     |370   |
|2022-01-15 09:47:01.943392 +00:00|360 |260 |620|0     |360   |
+---------------------------------+----+----+---+------+------+

This raises a couple of questions:

  • How to detect if rows had repeated values over time?
  • What are the time periods that need to be reconstructed?

Step 1: Detect unchanged rows

For the sake of readability I’ll stick to one column in this example. I’ll go with grid in this one since it’d be quite unusual to have the exact same energy usage over time. The real query is a bit more complex of course but you get the gist.

SELECT s."time",
       s.grid                                         AS current,
       lag(s.grid) OVER w1                            AS previous,
       coalesce(s.grid <> lag(s.grid) OVER w1, FALSE) AS changed
FROM solaredge AS s
    WINDOW w1 AS (ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
ORDER BY "time";

There are a couple of things happening here: First thing to notice is the WINDOW FUNCTION. It gives us an additional way to access our results via a set of functions. (Think of it as aggregate functions without the aggregation …) Our window w1 is defined as ROWS BETWEEN 1 PRECEDING AND CURRENT ROW which will let us access the current row and its predecessor. This can be done using lag(<column>) OVER w1. We’re now able to compare the current value of s.grid with its previous value using s.grid <> lag(s.grid) OVER w1.

Next we wrap this comparison with coalesce(). The first row won’t have a preceding row, so lag() will yield NULL. While it’s usually used to return the first non null value in a list of arguments, it can also be used to substitue a default value for null. We go with false.

Our result now looks like this:

+---------------------------------+-------+--------+-------+
|time                             |current|previous|changed|
+---------------------------------+-------+--------+-------+
|2022-01-15 08:20:11.678517 +00:00|1500   |220     |true   |
|2022-01-15 08:20:21.927625 +00:00|1500   |1500    |false  |
|2022-01-15 08:20:31.638271 +00:00|1500   |1500    |false  |
|2022-01-15 08:20:41.648238 +00:00|1500   |1500    |false  |
|2022-01-15 08:20:51.607577 +00:00|1500   |1500    |false  |
 ...
|2022-01-15 09:46:31.960681 +00:00|1500   |1500    |false  |
|2022-01-15 09:46:41.981632 +00:00|250    |1500    |true   |
+---------------------------------+-------+--------+-------+

Step 2: Get time period for reconstruction

To get the time periods we can simply wrap the previous query as a subquery and select only the rows WHERE changed IS TRUE.

SELECT *
FROM (SELECT s."time",
             grid                                           AS current,
             lag(s.grid) OVER w1                            AS previous,
             coalesce(s.grid <> lag(s.grid) OVER w1, FALSE) AS changed
      FROM solaredge AS s
          WINDOW w1 AS (ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
      ORDER BY "time") s1
WHERE changed IS TRUE;

Giving us …

+---------------------------------+-------+--------+-------+
|time                             |current|previous|changed|
+---------------------------------+-------+--------+-------+
|2022-01-15 08:20:11.678517 +00:00|1500   |220     |true   |
|2022-01-15 09:46:41.981632 +00:00|250    |1500    |true   |
|2022-01-15 09:46:51.926296 +00:00|370    |250     |true   |
|2022-01-15 09:47:01.943392 +00:00|360    |370     |true   |
|2022-01-15 09:47:11.918865 +00:00|380    |360     |true   |
+---------------------------------+-------+--------+-------+

To finally get the start- and end-times of our missing data, we wrap this one once more:

SELECT "time"                        AS start,
       lead("time") OVER w2          AS end,
       lead("time") OVER w2 - "time" AS duration
FROM (SELECT *
      FROM (SELECT s."time",
                   grid                                           AS current,
                   lag(s.grid) OVER w1                            AS previous,
                   coalesce(s.grid <> lag(s.grid) OVER w1, FALSE) AS changed
            FROM solaredge AS s
                WINDOW w1 AS (ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
            ORDER BY "time") s1
      WHERE changed IS TRUE) s3
    WINDOW w2 AS (ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING);

As you can see, we apply another window function, but this time ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING, covering the current line and the following one. This one can be accessed using lead() instead of lag(). Last thing is to subtract "time" from lead("time") and we’re done:

+---------------------------------+---------------------------------+----------------------------------------------------+
|start                            |end                              |duration                                            |
+---------------------------------+---------------------------------+----------------------------------------------------+
|2022-01-15 08:20:11.678517 +00:00|2022-01-15 09:46:41.981632 +00:00|0 years 0 mons 0 days 1 hours 26 mins 30.303115 secs|
|2022-01-15 09:46:41.981632 +00:00|2022-01-15 09:46:51.926296 +00:00|0 years 0 mons 0 days 0 hours 0 mins 9.944664 secs  |
|2022-01-15 09:46:51.926296 +00:00|2022-01-15 09:47:01.943392 +00:00|0 years 0 mons 0 days 0 hours 0 mins 10.017096 secs |
|2022-01-15 09:47:01.943392 +00:00|2022-01-15 09:47:11.918865 +00:00|0 years 0 mons 0 days 0 hours 0 mins 9.975473 secs  |
|2022-01-15 09:47:11.918865 +00:00|2022-01-15 09:47:35.931409 +00:00|0 years 0 mons 0 days 0 hours 0 mins 24.012544 secs |
+---------------------------------+---------------------------------+----------------------------------------------------+
Built with Hugo
Theme Stack designed by Jimmy