Skip to main content

GitHub

See on Snowflake

Overview

Open-source developers write code and documentation that they make publicly available. GitHub (GH) is among the most popular website for storing and publishing this data. GH Archive is a project to record public GitHub activity and make it easily accessible.

Topics covered across users and repositories:

  • Stars
  • Pull Requests
  • Issues

Key Attributes

Geographic CoverageGlobal
Entity LevelGitHub Repository, GitHub User
Time GranularityDaily
Update FrequencyDaily at 11pm ET
HistorySince February 12, 2011
info

As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.

Notes

The GitHub Archive is a historical record of public activities on GitHub. All GitHub actions, such as push operations, forks, issues opened, and pull requests (along with the acted upon user, repository, etc.) across all public repositories on GitHub are included in a single schema. The number of GitHub stars is aggregated for active repositories by date in the github_stars table. These aggregations are based on a sum of actions from the github_events table. The github_repos table provides a reference table of unique repositories found in the github_events table by repo_id.

Snowflake Blog: Essential tips for querying GitHub Archive from Snowflake Developer Advocate, Felipe Hoffa.

Incomplete Repo Names: In the GitHub events dataset, a repository name should appear as a combination of the repository organization and name. In 2018, there are repository names that are incomplete. For example, rust-lang/ appeared without its full repository name, rust-lang/rust, for ~3 months (see query below). Additionally, in 2019, some repository names appeared with duplicate organizations (e.g. nolimits4web/nolimits4web/swiper). In both cases, GitHub implemented a go-forward fix.

As a workaround, we added a new field, last_seen, to the github_repos table that includes the last date a repo_name was used. This field allows you to identify the latest repo_name for a particular repo_id. Please note, there are some instances where duplication is correct, depending on the repository owner's naming conventions. Additionally, Cybersyn does not make any modifications to the original GitHub events data. The github_repos table was built to aid in the cleaning process that users may employ.

SELECT
repo_id,
repo_name,
first_seen,
last_seen
FROM cybersyn.github_repos
WHERE repo_id = 724712
ORDER BY last_seen DESC;

Example Use Cases & Queries

Find top starred repos

Pull the repos with the most stars in the past year

WITH latest_repo_name AS (
SELECT repo_name,
repo_id
FROM cybersyn.github_repos
QUALIFY ROW_NUMBER() OVER (PARTITION BY repo_id ORDER BY first_seen DESC) = 1
)
SELECT repo.repo_name,
repo.repo_id,
SUM(stars.count) AS sum_stars
FROM cybersyn.github_stars AS stars
JOIN latest_repo_name AS repo
ON (repo.repo_id = stars.repo_id)
WHERE stars.date >= DATEADD('day', -365, CURRENT_DATE)
GROUP BY repo.repo_name, repo.repo_id
ORDER BY sum_stars DESC NULLS LAST
LIMIT 50;

Disclaimers

The data in this dataset is sourced from GitHub Archive.

Cybersyn is not endorsed by or affiliated with any of these providers. Contact snowflake-public-data@snowflake.com for questions.