GitHub

GithHub stars, pulls requests, and issues from the GitHub Archive project

Overview

Open-source developers write code and documentation that they make publicly available. GitHub (GH) is among the most popular website for storing and publishing this data. GH Archive is a project to record public GitHub activity and make it easily accessible.

Topics covered across users and repositories:

  • Stars

  • Pull Requests

  • Issues

Key Attributes

As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.

Notes

The GitHub Archive is a historical record of public activities on GitHub. All GitHub actions, such as push operations, forks, issues opened, and pull requests (along with the acted upon user, repository, etc.) across all public repositories on GitHub are included in a single schema. The number of GitHub stars is aggregated for active repositories by date in the github_stars table. These aggregations are based on a sum of actions from the github_events table. The github_repos table provides a reference table of unique repositories found in the github_events table by repo_id.

EAV Model: All Cybersyn products follow the EAV (entity, attributes, value) model with a unified schema. Entities are tangible objects (e.g. geography, company) that Cybersyn provides data on. All timeseries' dates and values that refer to the entity are included in a timeseries table. Descriptors of the timeseries are included in an attributes table. Data is joinable across all Cybersyn products that have a GEO_ID. Refer to Cybersyn Concepts for more details.

Snowflake Blog: Essential tips for querying GitHub Archive from Snowflake Developer Advocate, Felipe Hoffa. Incomplete Repo Names: In the GitHub events dataset, a repository name should appear as a combination of the repository organization and name. In 2018, there are repository names that are incomplete. For example, rust-lang/ appeared without its full repository name, rust-lang/rust, for ~3 months (see query below). Additionally, in 2019, some repository names appeared with duplicate organizations (e.g. nolimits4web/nolimits4web/swiper). In both cases, GitHub implemented a go-forward fix. As a workaround, we added a new field, last_seen, to the github_repos table that includes the last date a repo_name was used. This field allows you to identify the latest repo_name for a particular repo_id. Please note, there are some instances where duplication is correct, depending on the repository owner's naming conventions. Additionally, Cybersyn does not make any modifications to the original GitHub events data. The github_repos table was built to aid in the cleaning process that users may employ.

SELECT
    repo_id,
    repo_name,
    first_seen,
    last_seen
FROM cybersyn.github_repos
WHERE repo_id = 724712
ORDER BY last_seen DESC;

Data Dictionary

Example Use Cases & Queries

Find top starred repos

Pull the repos with the most stars in the past year

WITH latest_repo_name AS (
    SELECT repo_name,
           repo_id
    FROM cybersyn.github_repos
    QUALIFY ROW_NUMBER() OVER (PARTITION BY repo_id ORDER BY first_seen DESC) = 1
)
SELECT repo.repo_name,
       repo.repo_id,
       SUM(stars.count) AS sum_stars
FROM cybersyn.github_stars AS stars
JOIN latest_repo_name AS repo
    ON (repo.repo_id = stars.repo_id)
WHERE stars.date >= DATEADD('day', -365, CURRENT_DATE)
GROUP BY repo.repo_name, repo.repo_id
ORDER BY sum_stars DESC NULLS LAST
LIMIT 50;

Releases & Changelog

2/15/24: GitHub Archive - Added language field with the programming language of the repo

Added language the github_events table that provides the programming language (e.g. Python, SQL) of the repository

2/8/24: GitHub Archive - Added last_seen field with repo name last seen date

Added a new field to the github_repos table. last_seen includes the last date a repo_name was used. This field allows you to identify the latest repo_name for a particular repo_id. The field was added as a workaround for a known issue with the source data: In the GitHub events dataset, a repository name should appear as a combination of the repository organization and name. In 2018, there are repository names that are incomplete. For example, rust-lang/ appeared without its full repository name, rust-lang/rust, for ~3 months. Additionally, in 2019, some repository names appeared with duplicate organizations (e.g. nolimits4web/nolimits4web/swiper). In both cases, GitHub implemented a go-forward fix.

Disclaimers

The data in this dataset is sourced from GitHub Archive.

Cybersyn is not endorsed by or affiliated with any of these providers. Contact support@cybersyn.com for questions.

Last updated

Copyright © 2024 Cybersyn