Building a timeseries data pipeline

I’ve long had a dream to have a simple system that stores and organises the vast amount of public time-series data published by StatsNZ and other sources (e.g. this large dump of CSV files). This type of data measures key economic and social indicators at regular intervals (usually monthly or quarterly). It would be very handy to have local copies of up-to-date versions of this data and easily usable for doing analysis like tracking trends and forecasting. I also wanted to link the database of time-series to some sort of tool for easily making online data visualisations and have the visualisations stay up to date as new data is released. Recently I finally started to work on this idea. Here is a quick summary of what I built and how it works.

If you want to see an example of something I made with the tools described below, have a look at my NZ inflation explorer, then come back here to read about the details.

Technology

The three key bits of technology that I built on are DuckDB, Observable Framework, and of course R:

  • DuckDB is a very simple and lightweight SQL database that runs in-process as required. It takes a lot of the headaches out of using a database as you don’t need to set up and maintain a fully-fledged server like MySQL or PostgreSQL. For the size of data that I’m working with (maybe 100,000 data points), storage isn’t an issue, and DuckDB is very fast. It provides just enough SQL features to get the job done and also offers convenient packages for interfacing with it in R and Python. In R, you can also use it as a data source for dbplyr to reduce the need to write SQL queries.
  • Observable Framework is a very new open-source tool from Observable that simplifies the process of making web-based data visualisations using Observable’s other tools like Observable Plot. Observable Framework does two key things. First, it generates static web pages for dashboards and data visualisation reports. These are self-contained and can be easily hosted on your own site. When it builds the static pages for a dashboard or report, it takes care of most of the fiddly bits like making the pages look good and be responsive on different devices. Second, it builds all the data required into the visualisation itself, as one or more static data files. This means that when anyone views the visualisation, it doesn’t need to load data from a server first, which makes the whole thing more responsive and stable. There are other ways to achieve this, but Observable Framework makes it super easy by using ‘data loaders’ that grab the latest version of data each time you re-build the static site.
  • R is the glue that binds everything together. I wrote R code to download, clean, and standardise data from its original sources and push it to DuckDB. I also used R to write data loaders for Observable Framework that get the data required for a visualisation out of DuckDB again. I used R for this because I know it, but Python or other languages would work just as well.

The diagram below summarises how it all fits together. There’s a separate ‘data cleaner’ for each data source because they all have their own unique quirks. In the database, everything is clean and tidy, and standardised across datasets. The data loaders are specific to a data visualisation (dashboard or report) and extract the required data from the database and do any other transformations or manipulations required by the visualisations.

Design choices

I decided to store metadata and numeric data for each time-series in separate tables in DuckDB (linked by a unique series id). Metadata captures all non-numeric aspects of the data like text descriptions of the time-series, and information about the relationships among time-series for those that are part of hierarchical data frameworks. For now, I’ve handled hierarchies in a simple way by tagging each time-series with the id of its ‘parent’ series in the hierarchy (if any). This allows hierarchies of data series (e.g. the different levels of Consumer Price Index data) to be expressed as a ‘rooted tree’ network graph. This makes it easy to do things that might be useful for analysing the data, like find all series at a certain level in a hierarchy, or find all the outer-most ’leaves’ of the tree (i.e. the most narrowly defined or most detailed time-series in the hierarchy). In R, the tidygraph package is great for this sort of thing.

The other design decisions I made aim to keep the system as simple as possible. I decided to store only the latest version of each time-series. In practice, time-series data changes over time as new data data points get added and sometimes existing data points are revised. This leads to different versions of a time-series that existed at different points in time in the past. It’s possible to keep track of these versions to be able to ‘roll back’ a dataset to see exactly what it looked like at any previous point in time. I decided that the added complexity of keeping track of historic versions wasn’t worth it for my use cases. So whenever new data for a time-series is available, all existing data in the database is replaced with the new data. I’m aware this means that errors in a data update could wipe out previously valid data, but given the types of data sources I’m using for now (official statistics), this is unlikely, or would be corrected very quickly if it happened.

Another limitation is that my framework only works with regular time-series that are measured on a fixed schedule, e.g. monthly, quarterly, or annual. It’s not set up to handle data that is more frequent than daily (I’m only storing dates, not timestamps). For public time-series data of the sort published by statistics agencies this isn’t a problem, and building a system that can cope with realtime or ad-hoc frequency is a lot more complicated.

Making it go

The ‘data cleaner’ R code is mostly custom code for each source. It downloads a data file and checks if it’s different from the previous download of the same file (if any). If there are any differences, the whole thing is loaded into the database, updating and appending any data that was already there. Each data source requires its own bespoke cleaning to separate the data and metadata, standardise column names, reformat dates, and fix other oddities (Fun fact: in some Stats NZ datasets, dates in October are coded as “1” for the month and January dates are coded as “01”. Funner fact: StatsNZ knows about this, but they won’t fix it in case it breaks the code that anyone has written to clean up the errors in their data!).

The ‘data loader’ code is more straighforward. It just grabs whatever data is needed for a visualisation from DuckDB, maybe does some calculations or transformations, and dumps the output in CSV format. A data visualisation in Observable Framework then loads this data and does whatever needed to show charts, etc, using Observable Plot or other Observable tools.

Next steps

I’ve only pulled in a few time-series datasets so far, so the next task is to import a lot more. Since each dataset has its own little quirks, it’s a bit time-consuming to write and test the cleaning code, but that’s a one-off task (until something changes in the source format, and it breaks).

For now, the process to update each dataset will be run manually, so I can keep an eye on things and fix any errors. Once it seems to be working smoothly, I’ll wrap the data cleaners in a master script that runs them all on a regular schedule, and produces a summary of what changed so I can keep tabs on things.