Links
Comment on page

LLM Training Essentials

Text-based reports including SEC filings, US patent grants, and US government contracts

Overview

LLM Training Essentials includes three text-based datasets covering corporate filings with the SEC, US patent grants, and US government contracts. This product was built to provide users with a set of text-rich data for training, fine-tuning, and inference of large language models (LLMs).The data is sourced from the SEC, United States Patent and Trademark Office, and the System for Award Management (SAM.gov).
Users can leverage the data to train natural language models, summarize current events based on corporate actions and executive commentary, and track the impact of emerging technology trends.
Topics covered include:
  • High-tech microchip patents and the latest on biotechnology research to
  • Defense contractor job descriptions
  • Recent public company filings

Key attributes

Geographic Coverage
United States
Entity Level
Contributor, Patent, Contract, Contract Award, Company
Time Granularity
Daily
Update Frequency
Varies depending on source
History
January 1976 (US Patents), January 2019 (SEC Filings), January 2002 (US Contracts)

Additional details

Vist the following Cybersyn Docs pages for specifics on each dataset included in LLM Essentials.

Data Dictionary

Data Sources & Release Frequency

As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.
Tables Names
See Source & Source Schedule
sec_cik_index sec_fiscal_calendars sec_report_index sec_report_attributes sec_report_text_attributes
company_index company_characteristics
uspto_patent_index uspto_contributor_index uspto_patent_contributor_relationships
government_contract_index government_contract_award_index
geography_index geography_relationships
Data Commons is an aggregator of government data sources. Release calendars vary by underlying source. The US Census Bureau publishes datasets about the US people and it’s economy, release schedules vary by dataset.

Notes & Methodology

Patent identifier

The unique identifiers of the uspto_patent_index table, patent_id and patent_extended_id, are generated using a combination of the document number, as well as a number of other metadata properties in the data such as patent issue dates. The ID and extended_ID are designed to be searchable in USPTO Patent Public Search and Google Patent search, respectively.
Due to changes in patent document numbers and classifications over time, there may be cases where there is an imperfect match between Cybersyn’s identifier and the lookup capabilities of the USPTO public search and Google Patent search.

Contract and contract award identifiers

In the government_contract tables, the contract_award_id and contract_solicitation_id fields can be used to find the award and the original contract on the sam.gov search portal, respectively.

Covered quarters

In the sec_report_attributes table, the covered_qtrs denotes how many periods the value covers. For example, a year-to-date measure for revenue from Q1-Q3 would be 3. The covered quarters for a row measuring only the revenue for Q3 would be 1. A point-in-time measure that is “as of” a specific date (i.e., balance sheet) has a 0 for covered_qtrs.

Text formatting

In the sec_report_text_attributes table, the raw text is pulled from the originally-filed reports. The raw text is stripped of HTML formatting and is presented as a single block of text. Note that this text includes both text-based sections of the reports (such as Management Discussions & Analysis, Market Risk Factors, and Business Overview) as well as non-formatted text of values from financial statements.

Geography mapping

Cybersyn mapped portions of companies’ addresses (city, state, country) to a unique geo_id that corresponds to that location. This is a feature of all Cybersyn datasets to allow for easy comparisons across datasets that use geographic identifiers.

Sample Queries

Recent company disclosures and attitudes toward large language models
Find filings that mention ‘large language models’ and include metadata about company industry and location.
SELECT
companies.cik,
companies.company_name,
companies.sic_code_category,
companies.sic_code_description,
companies.country,
txt.period_end_date,
txt.value
FROM cybersyn.sec_report_text_attributes AS txt
JOIN cybersyn.sec_cik_index AS companies ON (companies.cik = txt.cik)
WHERE txt.period_end_date >= '2020-01-01'
AND value ILIKE '%large language model%';
Text of patents concerning superconductors
Query the USPTO data and filter by raw patent text containing ‘superconductor’ as well as who the patent was assigned to and related patent category classifications.
WITH assignees AS (
SELECT patent_id, ARRAY_AGG(contributor_name) AS corporate_assignee
FROM cybersyn.uspto_contributor_index AS contributors
JOIN cybersyn.uspto_patent_contributor_relationships AS rship
ON (contributors.contributor_id = rship.contributor_id)
WHERE contribution_type IN ('Assignee - United States Company Or Corporation',
'Assignee - Foreign Company Or Corporation')
GROUP BY patent_id
)
SELECT
patents.patent_id,
assignees.corporate_assignee,
patents.invention_title,
patents.patent_type,
patents.document_publication_date,
patents.cpc_section_description,
patents.cpc_class_description,
patents.cpc_subclass_description,
patents.cpc_group_description,
patents.cpc_subgroup_description,
patents.patent_raw_text
FROM cybersyn.uspto_patent_index AS patents
JOIN assignees ON (assignees.patent_id = patents.patent_id)
WHERE patent_raw_text ILIKE '%superconductor%'
LIMIT 100;
Details about the highest-value contracts awarded by government agencies
Pull descriptions of the largest Missile Defense Agency contract awarded in recent years.
SELECT
awards.award_name,
awards.award_description,
contracts.department,
contracts.agency,
contracts.agency_office,
contracts.first_posted_date,
awards.award_date,
awards.award_amount,
contracts.naics_description
FROM cybersyn.government_contract_index AS contracts
JOIN cybersyn.government_contract_award_index AS awards
ON (contracts.contract_solicitation_id = awards.contract_solicitation_id)
WHERE contracts.department = 'Dept Of Defense'
AND contracts.agency_office = 'Missile Defense Agency (Mda)'
AND YEAR(awards.award_date) >= 2021
ORDER BY award_amount DESC NULLS LAST
LIMIT 50;

Releases & Changelog

9/17/23: Added COMPANY_INDEX and COMPANY_CHARACTERISTICS tables; added PermIds
Added three new tables:
  • The COMPANY_INDEX table aggregates commonly used company identifiers (i.e. CIKs, EINs, and LEIs) into a single a single company_id, which can be used across Cybersyn’s datasets as a unique identifier for corporate entities.
  • TheCOMPANY_CHARACTERISTICS table includes categorical characteristics of a Company (e.g. industry, address, previous names). A characteristic may be temporal with start and end dates indicating the range for which the data is valid.

Disclaimer

The data in this dataset is sourced here. Links to provider terms and disclaimers are provided where appropriate.
Cybersyn is not endorsed or affiliated with any of these providers. Contact [email protected] for questions.
Last modified 1mo ago