Data Commons
Overview
Data Commons is an open source project that integrates data from various public sources to power contextual Google Search. Data Commons covers a variety of topics including demographic, economic, government spending, and environmental statistics at national, state, county, and municipal levels. Select sources include the World Bank, US Bureau of Labor Statistics, Center for Disease Control, United Nations, and US Drug Enforcement Agency.
Example topics covered:
- GDP
- Unemployment
- Household income
- Population
- Annual electricity consumption
Key Attributes
Geographic Coverage | Global |
Entity Level | Country, State, County, City, Census Tract, Zip Code, Core Based Statistical Area - Varies by underlying source |
Time Granularity | Daily, Weekly, Monthly, Quarterly - Varies by underlying source |
Release Frequency | Varies by underlying source |
History | Varies by underlying source |
As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.
Notes & Methodology
The majority of the data centers around timeseries containing demographic, economic, government spending, and environmental statistics at national, state, county, and municipal levels. This data primarily revolves around geographic entities from the national, state, county, municipal, zip code, and census tract levels.
The timeseries table contains the core data and the GEOGRAPHY_INDEX
table contains human readable names for geographies. Variable attributes can be joined to the timeseries data for additional metadata about the variables themselves (measurement category, units, frequency, etc.).
In cases where a single measure is reported by more than one source, the variable_name
includes both the variable being measured and the source for the data. For example, “Total Population, un.org” and “Total Population, census.gov” both exist for US population estimates.
Cybersyn Products
The Data Commons tables above are available in the following Cybersyn data products.
Sample Queries
Compare economic statistics across different geographic levels
Show unemployment rates in New York City vs. New York state.
SELECT ts.date,
geo.geo_name,
geo.level,
ts.value
FROM cybersyn.datacommons_timeseries AS ts
JOIN cybersyn.geography_index AS geo ON (ts.geo_id = geo.geo_id)
WHERE geo.geo_name = 'New York'
AND geo.level IN ('State', 'City')
AND ts.variable_name ILIKE 'Unemployment Rate%'
AND date >= '2015-01-01'
ORDER BY date;
Show populations of various geographies
Search populations of the United States, Canada, and Mexico since 2000, including human readable names.
SELECT att.variable_name,
geo.geo_name,
geo.geo_id,
date,
value
FROM cybersyn.datacommons_timeseries AS ts
JOIN cybersyn.datacommons_attributes AS att ON (ts.variable = att.variable)
JOIN cybersyn.geography_index AS geo ON (ts.geo_id = geo.geo_id)
WHERE att.variable_group ='Total Population'
AND geo.geo_id IN ('country/USA', 'country/CAN', 'country/MEX')
AND date >= '2000-01-01'
ORDER BY date DESC;
Display available measures for cities
Explore all of the variables that are available at the core-based statistical area (CBSA) level. A CBSA is a geographic region in the US that contains a large population - typically cities and their surrounding areas.
SELECT DISTINCT variable_name
FROM cybersyn.datacommons_timeseries AS ts
JOIN cybersyn.geography_index AS geo ON (ts.geo_id = geo.geo_id)
WHERE level = 'CensusCoreBasedStatisticalArea';
Compare median income to median age by zip code
The complexity here comes from using latest available data for each variable. We filter independently for the latest value for each of the comparisons we want to make.
WITH income_age_data AS (
SELECT
geo.geo_name,
ts.geo_id,
ts.variable_name,
ts.value
FROM cybersyn.datacommons_timeseries AS ts
INNER JOIN cybersyn.geography_index AS geo
ON (ts.geo_id = geo.geo_id)
WHERE
geo.level = 'CensusZipCodeTabulationArea'
AND ts.variable_name IN ('Median Income for All Households, USD', 'Median Age of Population, census.gov')
QUALIFY ROW_NUMBER() OVER (PARTITION BY ts.geo_id, ts.variable_name ORDER BY ts.date DESC) = 1
)
SELECT
geo_name,
geo_id,
MAX(CASE WHEN variable_name = 'Median Income for All Households, USD' THEN value END) AS median_income,
MAX(CASE WHEN variable_name = 'Median Age of Population, census.gov' THEN value END) AS median_age
FROM income_age_data
GROUP BY
geo_name,
geo_id
HAVING median_income IS NOT NULL AND median_age IS NOT NULL
ORDER BY geo_name;
Disclaimers
The data in this product is sourced from Data Commons 2024, CDC Places, electronic dataset. Cybersyn has reformatted the data from Data Commons as licensed here.
Cybersyn is not endorsed by or affiliated with any of these providers. Contact snowflake-public-data@snowflake.com for questions.