Comment on page
LLM Training Essentials
Text-based reports including SEC filings, US patent grants, and US government contracts
LLM Training Essentials includes three text-based datasets covering corporate filings with the SEC, US patent grants, and US government contracts. This product was built to provide users with a set of text-rich data for training, fine-tuning, and inference of large language models (LLMs).The data is sourced from the SEC, United States Patent and Trademark Office, and the System for Award Management (SAM.gov).
Users can leverage the data to train natural language models, summarize current events based on corporate actions and executive commentary, and track the impact of emerging technology trends.
Topics covered include:
- High-tech microchip patents and the latest on biotechnology research to
- Defense contractor job descriptions
- Recent public company filings
Geographic Coverage | United States |
Entity Level | Contributor, Patent, Contract, Contract Award, Company |
Time Granularity | Daily |
Update Frequency | Varies depending on source |
History | January 1976 (US Patents), January 2019 (SEC Filings), January 2002 (US Contracts) |
Vist the following Cybersyn Docs pages for specifics on each dataset included in LLM Essentials.
As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.
Tables Names | See Source & Source Schedule |
---|---|
sec_cik_index
sec_fiscal_calendars
sec_report_index
sec_report_attributes
sec_report_text_attributes | |
company_index
company_characteristics | |
uspto_patent_index
uspto_contributor_index
uspto_patent_contributor_relationships | |
government_contract_index
government_contract_award_index | |
geography_index
geography_relationships | Data Commons is an aggregator of government data sources. Release calendars vary by underlying source.
The US Census Bureau publishes datasets about the US people and it’s economy, release schedules vary by dataset. |
The unique identifiers of the
uspto_patent_index
table, patent_id
and patent_extended_id
, are generated using a combination of the document number, as well as a number of other metadata properties in the data such as patent issue dates. The ID and extended_ID
are designed to be searchable in USPTO Patent Public Search and Google Patent search, respectively.Due to changes in patent document numbers and classifications over time, there may be cases where there is an imperfect match between Cybersyn’s identifier and the lookup capabilities of the USPTO public search and Google Patent search.
In the
government_contract tables
, the contract_award_id and contract_solicitation_id
fields can be used to find the award and the original contract on the sam.gov search portal, respectively.In the
sec_report_attributes
table, the covered_qtrs
denotes how many periods the value covers. For example, a year-to-date measure for revenue from Q1-Q3 would be 3. The covered quarters for a row measuring only the revenue for Q3 would be 1. A point-in-time measure that is “as of” a specific date (i.e., balance sheet) has a 0 for covered_qtrs
.In the
sec_report_text_attributes
table, the raw text is pulled from the originally-filed reports. The raw text is stripped of HTML formatting and is presented as a single block of text. Note that this text includes both text-based sections of the reports (such as Management Discussions & Analysis, Market Risk Factors, and Business Overview) as well as non-formatted text of values from financial statements.Cybersyn mapped portions of companies’ addresses (city, state, country) to a
unique geo_id
that corresponds to that location. This is a feature of all Cybersyn datasets to allow for easy comparisons across datasets that use geographic identifiers.Recent company disclosures and attitudes toward large language models
Find filings that mention ‘large language models’ and include metadata about company industry and location.
SELECT
companies.cik,
companies.company_name,
companies.sic_code_category,
companies.sic_code_description,
companies.country,
txt.period_end_date,
txt.value
FROM cybersyn.sec_report_text_attributes AS txt
JOIN cybersyn.sec_cik_index AS companies ON (companies.cik = txt.cik)
WHERE txt.period_end_date >= '2020-01-01'
AND value ILIKE '%large language model%';
Text of patents concerning superconductors
Query the USPTO data and filter by raw patent text containing ‘superconductor’ as well as who the patent was assigned to and related patent category classifications.
WITH assignees AS (
SELECT patent_id, ARRAY_AGG(contributor_name) AS corporate_assignee
FROM cybersyn.uspto_contributor_index AS contributors
JOIN cybersyn.uspto_patent_contributor_relationships AS rship
ON (contributors.contributor_id = rship.contributor_id)
WHERE contribution_type IN ('Assignee - United States Company Or Corporation',
'Assignee - Foreign Company Or Corporation')
GROUP BY patent_id
)
SELECT
patents.patent_id,
assignees.corporate_assignee,
patents.invention_title,
patents.patent_type,
patents.document_publication_date,
patents.cpc_section_description,
patents.cpc_class_description,
patents.cpc_subclass_description,
patents.cpc_group_description,
patents.cpc_subgroup_description,
patents.patent_raw_text
FROM cybersyn.uspto_patent_index AS patents
JOIN assignees ON (assignees.patent_id = patents.patent_id)
WHERE patent_raw_text ILIKE '%superconductor%'
LIMIT 100;
Details about the highest-value contracts awarded by government agencies
Pull descriptions of the largest Missile Defense Agency contract awarded in recent years.
SELECT
awards.award_name,
awards.award_description,
contracts.department,
contracts.agency,
contracts.agency_office,
contracts.first_posted_date,
awards.award_date,
awards.award_amount,
contracts.naics_description
FROM cybersyn.government_contract_index AS contracts
JOIN cybersyn.government_contract_award_index AS awards
ON (contracts.contract_solicitation_id = awards.contract_solicitation_id)
WHERE contracts.department = 'Dept Of Defense'
AND contracts.agency_office = 'Missile Defense Agency (Mda)'
AND YEAR(awards.award_date) >= 2021
ORDER BY award_amount DESC NULLS LAST
LIMIT 50;
Added three new tables:
- The
COMPANY_INDEX
table aggregates commonly used company identifiers (i.e. CIKs, EINs, and LEIs) into a single a singlecompany_id
, which can be used across Cybersyn’s datasets as a unique identifier for corporate entities.
- The
COMPANY_CHARACTERISTICS
table includes categorical characteristics of a Company (e.g. industry, address, previous names). A characteristic may be temporal with start and end dates indicating the range for which the data is valid.
The data in this dataset is sourced here. Links to provider terms and disclaimers are provided where appropriate.
Cybersyn is not endorsed or affiliated with any of these providers. Contact [email protected] for questions.
Last modified 1mo ago