Comment on page
LLM Training Essentials
Text-based reports including SEC filings, US patent grants, and US government contracts
LLM Training Essentials includes three text-based datasets covering corporate filings with the SEC, US patent grants, and US government contracts. This product was built to provide users with a set of text-rich data for training, fine-tuning, and inference of large language models (LLMs).The data is sourced from the SEC, United States Patent and Trademark Office, and the System for Award Management (SAM.gov).
Users can leverage the data to train natural language models, summarize current events based on corporate actions and executive commentary, and track the impact of emerging technology trends.
Topics covered include:
- High-tech microchip patents and the latest on biotechnology research to
- Defense contractor job descriptions
- Recent public company filings
Vist the following Cybersyn Docs pages for specifics on each dataset included in LLM Essentials.
As with all Public Domain datasets, Cybersyn aims to release data on Snowflake Marketplace as soon as the underlying source releases new data. We check periodically for changes to the underlying source and, upon detecting a change, propagate the data to Snowflake Marketplace immediately. See our release process for more details.
The unique identifiers of the
patent_extended_id, are generated using a combination of the document number, as well as a number of other metadata properties in the data such as patent issue dates. The ID and
extended_IDare designed to be searchable in USPTO Patent Public Search and Google Patent search, respectively.
Due to changes in patent document numbers and classifications over time, there may be cases where there is an imperfect match between Cybersyn’s identifier and the lookup capabilities of the USPTO public search and Google Patent search.
government_contract tables, the
contract_award_id and contract_solicitation_idfields can be used to find the award and the original contract on the sam.gov search portal, respectively.
covered_qtrsdenotes how many periods the value covers. For example, a year-to-date measure for revenue from Q1-Q3 would be 3. The covered quarters for a row measuring only the revenue for Q3 would be 1. A point-in-time measure that is “as of” a specific date (i.e., balance sheet) has a 0 for
sec_report_text_attributestable, the raw text is pulled from the originally-filed reports. The raw text is stripped of HTML formatting and is presented as a single block of text. Note that this text includes both text-based sections of the reports (such as Management Discussions & Analysis, Market Risk Factors, and Business Overview) as well as non-formatted text of values from financial statements.
Cybersyn mapped portions of companies’ addresses (city, state, country) to a
unique geo_idthat corresponds to that location. This is a feature of all Cybersyn datasets to allow for easy comparisons across datasets that use geographic identifiers.
Recent company disclosures and attitudes toward large language models
Find filings that mention ‘large language models’ and include metadata about company industry and location.
FROM cybersyn.sec_report_text_attributes AS txt
JOIN cybersyn.sec_cik_index AS companies ON (companies.cik = txt.cik)
WHERE txt.period_end_date >= '2020-01-01'
AND value ILIKE '%large language model%';
Text of patents concerning superconductors
Query the USPTO data and filter by raw patent text containing ‘superconductor’ as well as who the patent was assigned to and related patent category classifications.
WITH assignees AS (
SELECT patent_id, ARRAY_AGG(contributor_name) AS corporate_assignee
FROM cybersyn.uspto_contributor_index AS contributors
JOIN cybersyn.uspto_patent_contributor_relationships AS rship
ON (contributors.contributor_id = rship.contributor_id)
WHERE contribution_type IN ('Assignee - United States Company Or Corporation',
'Assignee - Foreign Company Or Corporation')
GROUP BY patent_id
FROM cybersyn.uspto_patent_index AS patents
JOIN assignees ON (assignees.patent_id = patents.patent_id)
WHERE patent_raw_text ILIKE '%superconductor%'
Details about the highest-value contracts awarded by government agencies
Pull descriptions of the largest Missile Defense Agency contract awarded in recent years.
FROM cybersyn.government_contract_index AS contracts
JOIN cybersyn.government_contract_award_index AS awards
ON (contracts.contract_solicitation_id = awards.contract_solicitation_id)
WHERE contracts.department = 'Dept Of Defense'
AND contracts.agency_office = 'Missile Defense Agency (Mda)'
AND YEAR(awards.award_date) >= 2021
ORDER BY award_amount DESC NULLS LAST
Added three new tables:
COMPANY_INDEXtable aggregates commonly used company identifiers (i.e. CIKs, EINs, and LEIs) into a single a single
company_id, which can be used across Cybersyn’s datasets as a unique identifier for corporate entities.
COMPANY_CHARACTERISTICStable includes categorical characteristics of a Company (e.g. industry, address, previous names). A characteristic may be temporal with start and end dates indicating the range for which the data is valid.
Last modified 1mo ago