Comment on page
Cybersyn provides point-in-time tables of all data historically published in our data products. We refer to these point-in-time tables as “history tables.” This is a premium feature for our public domain datasets, available in Cybersyn Public Domain Pro.
History tables create an auditable record of when and how data changed over time. Whether data changed because the underlying source restated historical values, Cybersyn backfilled older history, or a data vendor corrected a previously-undetected error in their methodology, accessing a full history of what data was available when allows for point-in-time backtesting and creates accountability and transparency that helps clients understand changes in underlying data.
Why are point-in-time history tables useful?
- You want to conduct backtesting and need an accurate representation of what data would have been available to you at a historical point in time
- You want to understand how the underlying data source has revised values over time
- You want to audit what your downstream systems were consuming from the Cybersyn datashare at any point in time
Each month the US Bureau Of Labor Statistics releases an Employment Situation report that includes the change in the number of nonfarm payroll jobs nationally. These reports come out monthly, but the data is noted to be a “preliminary estimate” until finalized numbers are published 2 months later. Consider the real example below using reports released on Aug 4, Sep 1, and Oct 6, 2023:
If someone asked “How many jobs were added in the US in July 2023?” the best known answer would vary depending on when the person asked. If they asked on August 15th the answer would be 187k, but if they asked today, the revised answer would be 236k. This is critical for understanding why historical numbers might change over time and why you need to back-test data (since you would not have known future revisions when the data was first published).
The numerous revisions complicate what should otherwise be a simple univariate timeseries of employment numbers by month. Cybersyn’s history tables solve this.
Cybersyn publishes the most up to date data in snapshot tables which reflect the best known values as of today. We also record and timestamp each value we have historically published in history tables. History tables, which are denoted with the suffix
_history, can be queried to recreate what Cybersyn’s data would have looked like at any given time.
The history tables include all of the same fields as the current snapshot tables in addition to
_effective_end_timestampfields. These timestamps represent the time that Cybersyn first published that particular row-level data and the last time the row was valid in the data, respectively.
Note that the latest, currently valid data has an
_effective_end_timestampvalue of NULL. In addition to updated rows receiving start and end time timestamps, deleted rows also receive the same with the
_effective_end_timestampbeing the run time during which the row or value was deleted. The effective timestamps are recorded in Eastern Time and include the offset from UTC.
Using the above employment data as an example, the rows in the history table would look like this:
From this history table a user can understand when data was updated, how data has changed over time, and what data looked like at a given date. Assuming the above sample data is in a table called
t1_history, the following queries are examples of how a user could use the point-in-time history tables:
Pull all historical revisions to a particular data point
Check the history for a single datapoint. Pull all past values that were previously reported for the particular variable and date combination.
WHERE date = '2023-07-31'
AND variable = 'Change in Employees'
ORDER BY _effective_start_timestamp;
Filter to current data
Note, you could equivalently use the snapshot tables to recover this result. This query is for demonstration purposes.
WHERE _effective_end_timestamp IS NULL;
View values as of a specific date or time
View all data as it would have appeared on 2023-09-15 at 9:30am ET. The logic here filters to records that were both 1) uploaded before that time, and 2) last valid after that time or are still valid.
WHERE _effective_start_timestamp <= '2023-09-15 09:30:00'
AND (_effective_end_timestamp >= '2023-09-15 09:30:00' OR _effective_end_timestamp IS NULL);
Get timestamps for all updates to Cybersyn’s data
SELECT DISTINCT _effective_start_timestamp
ORDER BY _effective_start_timestamp;
Last modified 1mo ago