GitHub
Overview
Open-source developers write code and documentation that they make publicly available. GitHub (GH) is among the most popular website for storing and publishing this data. GH Archive is a project to record public GitHub activity and make it easily accessible.
Topics covered across users and repositories:
- Stars
- Pull Requests
- Issues
Key Attributes
Geographic Coverage | Global |
Entity Level | GitHub Repository, GitHub User |
Time Granularity | Daily |
Update Frequency | Daily at 11pm ET |
History | Since February 12, 2011 |
As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.
Notes
The GitHub Archive is a historical record of public activities on GitHub. All GitHub actions, such as push operations, forks, issues opened, and pull requests (along with the acted upon user, repository, etc.) across all public repositories on GitHub are included in a single schema. The number of GitHub stars is aggregated for active repositories by date in the github_stars
table. These aggregations are based on a sum of actions from the github_events
table. The github_repos
table provides a reference table of unique repositories found in the github_events
table by repo_id
.
Snowflake Blog: Essential tips for querying GitHub Archive from Snowflake Developer Advocate, Felipe Hoffa.
Incomplete Repo Names: In the GitHub events dataset, a repository name should appear as a combination of the repository organization and name. In 2018, there are repository names that are incomplete. For example, rust-lang/ appeared without its full repository name, rust-lang/rust, for ~3 months (see query below). Additionally, in 2019, some repository names appeared with duplicate organizations (e.g. nolimits4web/nolimits4web/swiper). In both cases, GitHub implemented a go-forward fix.
As a workaround, we added a new field, last_seen
, to the github_repos
table that includes the last date a repo_name
was used. This field allows you to identify the latest repo_name
for a particular repo_id
. Please note, there are some instances where duplication is correct, depending on the repository owner's naming conventions. Additionally, Cybersyn does not make any modifications to the original GitHub events data. The github_repos
table was built to aid in the cleaning process that users may employ.
SELECT
repo_id,
repo_name,
first_seen,
last_seen
FROM cybersyn.github_repos
WHERE repo_id = 724712
ORDER BY last_seen DESC;
Example Use Cases & Queries
Find top starred repos
Pull the repos with the most stars in the past year
WITH latest_repo_name AS (
SELECT repo_name,
repo_id
FROM cybersyn.github_repos
QUALIFY ROW_NUMBER() OVER (PARTITION BY repo_id ORDER BY first_seen DESC) = 1
)
SELECT repo.repo_name,
repo.repo_id,
SUM(stars.count) AS sum_stars
FROM cybersyn.github_stars AS stars
JOIN latest_repo_name AS repo
ON (repo.repo_id = stars.repo_id)
WHERE stars.date >= DATEADD('day', -365, CURRENT_DATE)
GROUP BY repo.repo_name, repo.repo_id
ORDER BY sum_stars DESC NULLS LAST
LIMIT 50;
Disclaimers
The data in this dataset is sourced from GitHub Archive.
Cybersyn is not endorsed by or affiliated with any of these providers. Contact snowflake-public-data@snowflake.com for questions.