Data Commons

Overview

Data Commons is an open source project that integrates data from various public sources to power contextual Google Search. Data Commons covers a variety of topics including demographic, economic, government spending, and environmental statistics at national, state, county, and municipal levels. Select sources include the World Bank, US Bureau of Labor Statistics, Center for Disease Control, United Nations, and US Drug Enforcement Agency.

Example topics covered:

  • GDP

  • Unemployment

  • Household income

  • Population

  • Annual electricity consumption

Key Attributes

Geographic Coverage

Global

Entity Level

Country, State, County, City, Census Tract, Zip Code, Core Based Statistical Area - Varies by underlying source

Time Granularity

Daily, Weekly, Monthly, Quarterly - Varies by underlying source

Release Frequency

Varies by underlying source

History

Varies by underlying source

As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.

Notes & Methodology

The majority of the data centers around timeseries containing demographic, economic, government spending, and environmental statistics at national, state, county, and municipal levels. This data primarily revolves around geographic entities from the national, state, county, municipal, zip code, and census tract levels.

The timeseries table contains the core data and the GEOGRAPHY_INDEX table contains human readable names for geographies. Variable attributes can be joined to the timeseries data for additional metadata about the variables themselves (measurement category, units, frequency, etc.). In cases where a single measure is reported by more than one source, the variable_name includes both the variable being measured and the source for the data. For example, β€œTotal Population, un.org” and β€œTotal Population, census.gov” both exist for US population estimates.

Tables & Sources

TablesSources

DATACOMMONS_ATTRIBUTES DATACOMMONS_TIMESERIES

GEOGRAPHY_RELATIONSHIPS GEOGRAPHY_CHARACTERISTICS GEOGRAPHY_INDEX GEOGRAPHY_HIERARCHY GEOGRAPHY_OVERLAPS

Cybersyn Products

The Data Commons tables above are available in the following Cybersyn data products.

Examples & Sample Queries

Compare economic statistics across different geographic levels

Show unemployment rates in New York City vs. New York state.

SELECT ts.date,
       geo.geo_name,
       geo.level,
       ts.value
FROM cybersyn.datacommons_timeseries AS ts
JOIN cybersyn.geography_index AS geo ON (ts.geo_id = geo.geo_id)
WHERE geo.geo_name = 'New York'
  AND geo.level IN ('State', 'City')
  AND ts.variable_name ILIKE 'Unemployment Rate%'
  AND date >= '2015-01-01'
ORDER BY date;

Show populations of various geographies

Search populations of the United States, Canada, and Mexico since 2000, including human readable names.

SELECT att.variable_name,
       geo.geo_name,
       geo.geo_id,
       date,
       value
FROM cybersyn.datacommons_timeseries AS ts
JOIN cybersyn.datacommons_attributes AS att ON (ts.variable = att.variable)
JOIN cybersyn.geography_index AS geo ON (ts.geo_id = geo.geo_id)
WHERE att.variable_group ='Total Population'
  AND geo.geo_id IN ('country/USA', 'country/CAN', 'country/MEX')
  AND date >= '2000-01-01'
ORDER BY date DESC;

Display available measures for cities

Explore all of the variables that are available at the core-based statistical area (CBSA) level. A CBSA is a geographic region in the US that contains a large population - typically cities and their surrounding areas.

SELECT DISTINCT variable_name
FROM cybersyn.datacommons_timeseries AS ts
JOIN cybersyn.geography_index AS geo ON (ts.geo_id = geo.geo_id)
WHERE level = 'CensusCoreBasedStatisticalArea';

Compare median income to median age by zip code

The complexity here comes from using latest available data for each variable. We filter independently for the latest value for each of the comparisons we want to make.

WITH income_age_data AS (
    SELECT
        geo.geo_name,
        ts.geo_id,
        ts.variable_name,
        ts.value
    FROM cybersyn.datacommons_timeseries AS ts
    INNER JOIN cybersyn.geography_index AS geo
        ON (ts.geo_id = geo.geo_id)
    WHERE
        geo.level = 'CensusZipCodeTabulationArea'
        AND ts.variable_name IN ('Median Income for All Households', 'Median Age of Population, census.gov')
    QUALIFY ROW_NUMBER() OVER (PARTITION BY ts.geo_id, ts.variable_name ORDER BY ts.date DESC) = 1
)

SELECT
    geo_name,
    geo_id,
    MAX(CASE WHEN variable_name = 'Median Income for All Households' THEN value END)     AS median_income,
    MAX(CASE WHEN variable_name = 'Median Age of Population, census.gov' THEN value END) AS median_age
FROM income_age_data
GROUP BY
    geo_name,
    geo_id
HAVING median_income IS NOT NULL AND median_age IS NOT NULL
ORDER BY geo_name;

Disclaimers

The data in this product is sourced from Data Commons 2024, CDC Places, electronic dataset. Cybersyn has reformatted the data from Data Commons as licensed here.

Cybersyn is not endorsed by or affiliated with any of these providers. Contact support@cybersyn.com for questions.

Last updated

Copyright Β© 2024 Cybersyn