Links
Comment on page

GitHub Archive

GithHub events (stars, pull requests, issues) across users and repositories; from the GH Archive project

Overview

Open-source developers write code and documentation that they make publicly available. GitHub is among the most popular website for storing and publishing this data. GH Archive is a project to record the public GitHub activity, archive it, and make it easily accessible for further analysis with stars, pull requests, and issues across users and repositories.

Key Attributes

Geographic Coverage
Global
Entity Level
GitHub repository
Time Granularity
Daily
Update Frequency
Daily at 11pm ET
History
Since February 12, 2011

Description

All Cybersyn products follow the EAV (entity, attributes, value) model with a unified schema. Entities are tangible objects (e.g. geography, company). Entities may have characteristics (i.e. descriptors of the entity) in an index table and values (i.e. statistics, measure) in a timeseries table. Refer to Cybersyn Concepts for more details.
All GitHub actions (along with the acted upon user, repository, etc.) are available in a single schema. The product also includes a couple of common aggregations that are likely similarly updated based on new data.
This dataset is useful for recruiting, investing, lead-gen, due diligence, customer support among other use cases:
  • Investing: Identify trending or inflecting projects & code repositories for promising new open source projects run by startups
  • Recruiting: Find developers, companies, or organizations that use or have expertise in particular technologies, filtering for users with public contributions in specific languages or libraries.
  • Lead-gen: Find organizations that work on a particular tech-stack, depend on specific open source dependencies. Identify the audience (organization, development activity, location)
  • Due Diligence: audit any open source projects’ or developers’ history, timeline, and content of contributors (pull requests), comments (issues), and other activity. Audit use of a project in other project (dependencies), validate popularity and diversity of project contributors, and community activity on project pages.
  • Customer Support: Build alerts on user issues, questions, or use of your open source project or particular technology across all of Github.
  • Training set: Github Issues and PRs contain free form natural language text that can be used in the training and development of chatbots.
  • Market Research: examine major trends in software engineering.

Data Dictionary

Data Sources & Release Frequency

As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.
Tables Names
Source
Source Schedule
github_events github_repos github_stars
Daily at 11pm ET

Notes & Methodology

Star aggregation

The number of Github stars is aggregated for active repositories by date in the github_stars table. These aggregations are based on a sum of actions from the github_events table.

Streamlit Demos

Cybersyn builds Streamlit demos to visualization the data available in this product and provide a jumping off point.

Example Use Cases & Queries

Use Case: Find top starred repos
Pull the repos with the most stars in the past year
WITH latest_repo_name AS (
SELECT repo_name,
repo_id
FROM cybersyn.github_repos
QUALIFY ROW_NUMBER() OVER (PARTITION BY repo_id ORDER BY first_seen DESC) = 1
)
SELECT repo.repo_name,
repo.repo_id,
SUM(stars.count) AS sum_stars
FROM cybersyn.github_stars AS stars
JOIN latest_repo_name AS repo
ON (repo.repo_id = stars.repo_id)
WHERE stars.date >= DATEADD('day', -365, CURRENT_DATE)
GROUP BY repo.repo_name, repo.repo_id
ORDER BY sum_stars DESC NULLS LAST
LIMIT 50;

Releases & Changelog

There are no updates at this time.

Disclaimers

The data in this dataset is sourced here. Links to provider license, terms and disclaimers are provided where appropriate:
Cybersyn is not endorsed by or affiliated with any of these providers. Contact [email protected] for questions.
Last modified 1mo ago