Skip to main content

LLM Training

See on Snowflake

Overview

LLM Training includes four text-based datasets covering corporate filings with the SEC, US patent grants, US government contracts, and the global research system catalog from OpenAlex. This product was built to provide users with a set of text-rich data for training, fine-tuning, and inference of large language models (LLMs).

Example topics covered:

  • SEC filings including fiscal calendars, press releases, earnings, annual & quarterly reports, major company events, and quarterly fund holdings
  • US patent grants, patent type, invention title, and contributor name/location
  • US federal government contract opportunities and awards
  • OpenAlex index of scholarly entities (e.g. works, sources, authors, funders, publishers) and how they are connected to one another

Users can leverage the data to train natural language models, summarize current events based on corporate actions and executive commentary, and track the impact of emerging technology trends.

Data Sources

A detailed description of the data is available by source. Source pages include key attributes (e.g. geographic coverage, time granularity, history, entity level), release frequency, notes & methodologies, and sample queries.

Sample Queries

Text of patents concerning superconductors

Query the USPTO data and filter by raw patent text containing ‘superconductor’ as well as who the patent was assigned to and related patent category classifications.

WITH assignees AS (
SELECT patent_id, ARRAY_AGG(contributor_name) AS corporate_assignee
FROM cybersyn.uspto_contributor_index AS contributors
JOIN cybersyn.uspto_patent_contributor_relationships AS rship
ON (contributors.contributor_id = rship.contributor_id)
WHERE contribution_type IN ('Assignee - United States Company Or Corporation',
'Assignee - Foreign Company Or Corporation')
GROUP BY patent_id
)
SELECT
patents.patent_id,
assignees.corporate_assignee,
patents.invention_title,
patents.patent_type,
patents.document_publication_date,
patents.cpc_section_description,
patents.cpc_class_description,
patents.cpc_subclass_description,
patents.cpc_group_description,
patents.cpc_subgroup_description,
patents.patent_raw_text
FROM cybersyn.uspto_patent_index AS patents
JOIN assignees ON (assignees.patent_id = patents.patent_id)
WHERE patent_raw_text ILIKE '%superconductor%'
LIMIT 100;

Disclaimers

See the individual source pages for source-specific disclaimers. Links to provider terms, license, and disclaimers are provided where appropriate.

Cybersyn is not endorsed by or affiliated with any of these providers. Contact support@cybersyn.com for questions.