Comment on page

Unified Schema

Our goal is to create a single, unified schema across public and proprietary datasets. The Cybersyn schema aims to strike a balance between flexibility to accommodate arbitrarily shaped data along with consistency in core tables. Any user who has used a Cybersyn Data Product before should feel oriented in a new dataset.
Cybersyn Data Products are built around two concepts: entities and timeseries.
  • Entities are concrete things or objects (a geography, a company, a mortgage application).
  • Timeseries are abstract measures (ie. statistics) related to an entity and a date.
The two core tables in Cybersyn Data Products are an index table, that contains all entities of a certain type, and a timeseries table that contains all timeseries' dates and values that refer to that entity type. Additional tables, such as the relationships table and attributes table, are used to describe the entities and timeseries in the core tables.
Cybersyn Term
Entities are distinct, independent things about which Cybersyn provides data such as websites, companies, or geographies. This table for a given entity type contains all possible instances of that entity type, along with a unique identifier. The unique identifier, _id, should be used to join across datasets. This table contains permanent or long-lived characteristics describing an entity.
Each row represents a distinct entity. The table is wide, in that immutable characteristics are expressed in their own fields.
Links between two entities. These links can be hierarchical (ie. a geography contained within another geography) or not (ie. a geography overlapping with another geography). Relationships can also be temporal – valid for an interval defined by specific start and end dates.
Each row represents a relationship or characteristic of an entity. The table is long, as every distinct attribute or characteristic related to an entity has its own row along with its associated metadata (e.g. the start and end date of the relationship). So, a distinct entity appears multiple times in this table, once for every characteristic and relationship it has.
Descriptors of an entity that are temporal. They have a start date and end date. For convenience, characteristics are included in the relationships table with the difference that there is no explicit entity id referenced. If a characteristic is immutable, then it can be included in wide form in the index table.
See Relationships.
Timeseries are temporal statistics or measures centered around an entity and timestamp. Timeseries are abstract concepts (ie. a measure) rather than a concrete thing. Timeseries are identified by an id that can be used to join to their attributes table, that describes the timeseries in a structured form. A timeseries may have more than one entity_id (e.g., geography + company).
Each row represents a distinct timeseries, date, and value. So, every timeseries id will have multiple rows, one for each value in the timeseries.
Attributes are descriptors of a timeseries. This table can be used to filter through time series IDs using structured, wide fields to filter on the desired timeseries Id. An attribute is the equivalent of a characteristic except for the abstract timeseries rather than the concrete entity.
Each row represents a distinct timeseries along with attributes that describe that timeseries in a wide format. There is a single row for each distinct timeseries.
Last modified 19d ago