Links

LLM Training Essentials

Text-based reports including SEC filings, US patent grants, US government contracts, and the OpenAlex global research system index
​​
​
​

Overview

LLM Training Essentials includes four text-based datasets covering corporate filings with the SEC, US patent grants, US government contracts, and the global research system catalog from OpenAlex. This product was built to provide users with a set of text-rich data for training, fine-tuning, and inference of large language models (LLMs).
Example topics covered:
  • SEC filings including fiscal calendars, press releases, earnings, annual & quarterly reports, major company events, and quarterly fund holdings
  • US patent grant applications, patent type, invention title, and contributor name/location
  • US federal government contract opportunities and awards
  • OpenAlex index of scholarly entities (e.g. works, sources, authors, funders, publishers) and how they are connected to one another
Users can leverage the data to train natural language models, summarize current events based on corporate actions and executive commentary, and track the impact of emerging technology trends.

Data Sources, Attributes, Example Queries

A detailed description of the data is available by source. Source pages include key attributes (e.g. geographic coverage, time granularity, history, entity level), release frequency, notes & methodologies, and sample queries.
All Cybersyn products follow the EAV (entity, attributes, value) model with a unified schema. Entities are tangible objects (e.g. geography, company) that Cybersyn provides data on. All timeseries' dates and values that refer to the entity are included in a timeseries table. Descriptors of the timeseries are included in an attributes table. Data is joinable across all Cybersyn products that have a GEO_ID. Refer to Cybersyn Concepts for more details.
As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.

Data Dictionary

Releases & Changelog

1/26/24: Added open catalog of scholarly entities and how they are connected from OpenAlex
Added OpenAlex's catalog on scholarly entities and how they are connected to each other. Entities are defined as scholarly works (e.g. journal articles, books, datasets, theses), authors, sources, affiliated organizations, topics covered, publishers, and funders. This data is derived from a wide range of sources, offering an extensive overview of academic research and its contributors.
The data is available in the following tables:
  • OPENALEX_AUTHORS_INDEX
  • OPENALEX_CONCEPTS_INDEX
  • OPENALEX_FUNDERS_INDEX
  • OPENALEX_INSTITUTIONS_INDEX
  • OPENALEX_PUBLISHERS_INDEX
  • OPENALEX_SOURCES_INDEX
  • OPENALEX_WORKS_INDEX
9/17/23: Added COMPANY_INDEX and COMPANY_CHARACTERISTICS tables; added PermIds
Added three new tables:
  • The COMPANY_INDEX table aggregates commonly used company identifiers (i.e. CIKs, EINs, and LEIs) into a single a single company_id, which can be used across Cybersyn’s datasets as a unique identifier for corporate entities.
  • TheCOMPANY_CHARACTERISTICS table includes categorical characteristics of a Company (e.g. industry, address, previous names). A characteristic may be temporal with start and end dates indicating the range for which the data is valid.

Disclaimer

The data in this dataset is sourced on the individual source pages. Links to provider terms and disclaimers are provided where appropriate.
Cybersyn is not endorsed or affiliated with any of these providers. Contact [email protected] for questions.
Last modified 29d ago