Comment on page
GitHub Archive
GithHub events (stars, pull requests, issues) across users and repositories; from the GH Archive project
Open-source developers write code and documentation that they make publicly available. GitHub is among the most popular website for storing and publishing this data. GH Archive is a project to record the public GitHub activity, archive it, and make it easily accessible for further analysis with stars, pull requests, and issues across users and repositories.
Geographic Coverage | Global |
Entity Level | GitHub repository |
Time Granularity | Daily |
Update Frequency | Daily at 11pm ET |
History | Since February 12, 2011 |
All Cybersyn products follow the EAV (entity, attributes, value) model with a unified schema. Entities are tangible objects (e.g. geography, company). Entities may have characteristics (i.e. descriptors of the entity) in an index table and values (i.e. statistics, measure) in a timeseries table. Refer to Cybersyn Concepts for more details.
All GitHub actions (along with the acted upon user, repository, etc.) are available in a single schema. The product also includes a couple of common aggregations that are likely similarly updated based on new data.
This dataset is useful for recruiting, investing, lead-gen, due diligence, customer support among other use cases:
- Investing: Identify trending or inflecting projects & code repositories for promising new open source projects run by startups
- Recruiting: Find developers, companies, or organizations that use or have expertise in particular technologies, filtering for users with public contributions in specific languages or libraries.
- Lead-gen: Find organizations that work on a particular tech-stack, depend on specific open source dependencies. Identify the audience (organization, development activity, location)
- Due Diligence: audit any open source projects’ or developers’ history, timeline, and content of contributors (pull requests), comments (issues), and other activity. Audit use of a project in other project (dependencies), validate popularity and diversity of project contributors, and community activity on project pages.
- Customer Support: Build alerts on user issues, questions, or use of your open source project or particular technology across all of Github.
- Training set: Github Issues and PRs contain free form natural language text that can be used in the training and development of chatbots.
- Market Research: examine major trends in software engineering.
As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.
Tables Names | Source | Source Schedule |
---|---|---|
github_events
github_repos
github_stars | Daily at 11pm ET |
The number of Github stars is aggregated for active repositories by date in the
github_stars
table. These aggregations are based on a sum of actions from the github_events
table.Cybersyn builds Streamlit demos to visualization the data available in this product and provide a jumping off point.
Use Case: Find top starred repos
Pull the repos with the most stars in the past year
WITH latest_repo_name AS (
SELECT repo_name,
repo_id
FROM cybersyn.github_repos
QUALIFY ROW_NUMBER() OVER (PARTITION BY repo_id ORDER BY first_seen DESC) = 1
)
SELECT repo.repo_name,
repo.repo_id,
SUM(stars.count) AS sum_stars
FROM cybersyn.github_stars AS stars
JOIN latest_repo_name AS repo
ON (repo.repo_id = stars.repo_id)
WHERE stars.date >= DATEADD('day', -365, CURRENT_DATE)
GROUP BY repo.repo_name, repo.repo_id
ORDER BY sum_stars DESC NULLS LAST
LIMIT 50;
There are no updates at this time.
The data in this dataset is sourced here. Links to provider license, terms and disclaimers are provided where appropriate:
Cybersyn is not endorsed by or affiliated with any of these providers. Contact [email protected] for questions.
Last modified 1mo ago