GitHub Events
GithHub stars, pulls requests, and issues from the GitHub Archive project
Overview
Open-source developers write code and documentation that they make publicly available. GitHub (GH) is among the most popular website for storing and publishing this data. GH Archive is a project to record public GitHub activity and make it easily accessible.
Topics covered across users and repositories:
Stars
Pull Requests
Issues
Key Attributes
Geographic Coverage | Global |
Entity Level | GitHub Repository, GitHub User |
Time Granularity | Daily |
Update Frequency | Daily at 11pm ET |
History | Since February 12, 2011 |
As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.
Notes
The GitHub Archive is a historical record of public activities on GitHub. All GitHub actions, such as push operations, forks, issues opened, and pull requests (along with the acted upon user, repository, etc.) across all public repositories on GitHub are included in a single schema. The number of GitHub stars is aggregated for active repositories by date in the github_stars
table. These aggregations are based on a sum of actions from the github_events
table. The github_repos
table provides a reference table of unique repositories found in the github_events
table by repo_id
.
EAV Model: All Cybersyn products follow the EAV (entity, attributes, value) model with a unified schema. Entities are tangible objects (e.g. geography, company) that Cybersyn provides data on. All timeseries' dates and values that refer to the entity are included in a timeseries table. Descriptors of the timeseries are included in an attributes table. Data is joinable across all Cybersyn products that have a GEO_ID
. Refer to Cybersyn Concepts for more details.
Snowflake Blog: Essential tips for querying GitHub Archive from Snowflake Developer Advocate, Felipe Hoffa.
Incomplete Repo Names: In the GitHub events dataset, a repository name should appear as a combination of the repository organization and name. In 2018, there are repository names that are incomplete. For example, rust-lang/ appeared without its full repository name, rust-lang/rust, for ~3 months (see query below). Additionally, in 2019, some repository names appeared with duplicate organizations (e.g. nolimits4web/nolimits4web/swiper). In both cases, GitHub implemented a go-forward fix.
As a workaround, we added a new field, last_seen
, to the github_repos
table that includes the last date a repo_name
was used. This field allows you to identify the latest repo_name
for a particular repo_id
. Please note, there are some instances where duplication is correct, depending on the repository owner's naming conventions. Additionally, Cybersyn does not make any modifications to the original GitHub events data. The github_repos
table was built to aid in the cleaning process that users may employ.
Tables & Source Link
Table Names | Source |
---|---|
|
Example Use Cases & Queries
Find top starred repos
Pull the repos with the most stars in the past year
Releases & Changelog
Disclaimers
The data in this dataset is sourced from GitHub Archive.
Cybersyn is not endorsed by or affiliated with any of these providers. Contact support@cybersyn.com for questions.
Last updated