Web Traffic Foundation

Projected web traffic metrics

Web Traffic Foundation is an experimental product currently in alpha testing. Metrics from the past 7 weeks may change slightly as more sample data is added.

Overview

Web Traffic Foundation provides projected web traffic metrics for the top ~65,000 domains globally.

Metrics include weekly:

  • Sessions

  • Pageviews

  • Users

Key Attributes

Geographic Coverage

Global

Entity Level

Domain (e.g. google.com)

Time Granularity

Weekly - see visual for more details

Update Frequency

Weekly on Thursdays

History

January 1, 2021 - Present

Lag

4 day lag - see visual for more details

Description

All Cybersyn products follow the EAV (entity, attributes, value) model with a unified schema. Entities are tangible objects (e.g. URL domain, company, geography) that Cybersyn provides data on. Entities may have characteristics (i.e. descriptors of the entity) in an index table and values (i.e. statistics) that refer to an entity and a date in a timeseries table. Descriptors of a timeseries are included in an attributes table. Refer to Cybersyn Concepts for more details.

The WEBTRAFFIC_SYNDICATE_TIMESERIES table provides projected values for weekly sessions, pageviews, and users by domain over time.

The WEBTRAFFIC_SYNDICATE_ATTRIBUTES table provides a wide format of each variable. Fields include MEASURE , UNIT, FREQUENCY, DEVICE, and MODEL_VERSION.

The DOMAIN_INDEX table includes a repository of over 300M domains cleaned and aggregated in a standardized format. The DOMAIN_ID strips away any subdomain (e.g. www) and protocol (e.g. https) information.

The COMPANY_DOMAIN_RELATIONSHIPS table serves as a mapping between companies and the domains that they own. The table maps DOMAIN_ID to COMPANY_ID and COMPANY_NAME which can be tied back to CIK, LEI, EIN, and company-level PermID information. All Cybersyn datasets that include Company entities use the COMPANY_ID field as the unique ID for the Company, allowing users to join as needed.

The COMPANY_INDEX table serves as the spine for Cybersyn data that involves company-level identifiers. This table is a list of ~100K public and private companies aggregated from the Securities and Exchange Commission (SEC), Refinitiv, the Global Legal Entity Identifier Foundation (GLEIF), and the IRS. Each of these sources have their own unique identifier for companies (EIN, CIK, LEI, PermID) and Cybersyn maps these IDs together to allow users to join datasets using common unique identifiers. All Cybersyn datasets that include Company entities use the COMPANY_ID field as the unique ID for the Company.

The COMPANY_CHARACTERISTICS table serves as a compliment to the COMPANY_INDEX table. This table includes a unique ID for each company (COMPANY_ID) and associated categorical characteristics: address, legal structure, previous names, SEC industry group, EIN, CIK, LEI, PermID, Refinitiv business sector and industry code/description, and SIC code/description. A characteristic may be temporal with start and end dates indicating the range for which the data is valid.

Data Dictionary

📖pageData Dictionary

Entity Relationship Diagram

Notes & Methodology

Updates to Data - Experimental Product

Web Traffic Foundation is an experimental product currently in alpha testing. Metrics from the past 7 weeks may change slightly as more sample data is added. Beyond the most recent 7 weeks, the output of each model should not change, though we reserve the right to change data with each release while the product is in alpha testing.

Domains Included

Today, web traffic estimates for ~65,000 domain URLS (e.g. google.com) are included. This number will continue to grow as we improve our data and methodologies.

Subdomain estimate (e.g. maps.google.com) are not currently available but will be added in future releases.

Estimate Accuracy

We use an out-of-sample domain (“ground truth”) set to calculate accuracy metrics:

  • Average Correlation: For larger domains (> 250k users per week), our estimates have an average correlation of ~60% with ground truth. For smaller domains (< 250k users per week), our estimates have an average correlation of ~50% with ground truth.

  • Mean Absolute Percent Error for Nominal User Counts (MAPE): For larger domains (> 250k users per week), our estimates have an average percent error of ~40% compared with ground truth. For smaller domains (< 250k users per week), our estimates have an average percent error of ~70% compared with ground truth.

Weekly Aggregations

Data is provided at the weekly level. Each date represents the week ending on that date (always a Sunday). For example, 1/14/24 represents data from 1/8/24 - 1/14/24. Click here for a visual of the release timeline. Monthly and daily aggregations, including monthly active users (MAUs) and daily active users (DAUs), will be added in future releases.

Makeup of User Panel

The panel of web traffic users is primarily sourced from desktop users. Domains that skew heavily mobile may be underrepresented in the panel. The user panel does not include users from China. As a result, the largest domains from China are largely excluded.

User-level Metrics

Measures for “Users” are meant to represent the number of unique active users for the given time period set in the FREQUENCY field.

Model Versions

A “model version” is included to create transparency in estimates and changes in methodologies. As models improve, this will give users the ability to evaluate how predictions have changed over time. Additionally, customers can choose to use a previous model version to limit any impacts from data changes outputs that rely on the data.

Numerous model versions will be published in parallel as new methodologies are developed. More recently timestamped model versions will feature incremental improvements over older model versions.

Release Timeline

Example Use Cases & Queries

Benchmark company web traffic metrics against industry peers

Compare weekly users of airbnb.com relative to vrbo.com.

WITH select_data AS (
    SELECT domain_id, geo_name, measure, date, value
    FROM cybersyn.webtraffic_syndicate_timeseries AS ts
    JOIN cybersyn.webtraffic_syndicate_attributes AS att
        ON (ts.variable = att.variable)
    WHERE measure = 'Users'
    AND device = 'All Devices'
    AND frequency = 'Week'
    AND geo_name = 'Worldwide'
), airbnb AS (
    SELECT date, value AS airbnb_users
    FROM select_data
    WHERE domain_id = 'airbnb.com'
), vrbo AS (
    SELECT date, value AS vrbo_users
    FROM select_data
    WHERE domain_id = 'vrbo.com'
)
SELECT
    t1.date,
    airbnb_users,
    vrbo_users,
    airbnb_users / vrbo_users AS airbnb_relative_to_vrbo
FROM airbnb AS t1
LEFT JOIN vrbo AS t2
    ON (t1.date = t2.date)
ORDER BY t1.date;

Releases & Changelog

1/26/24 - Added additional AI-focused domains; improved pre-2022 model quality

Added additional domains and limited subdomains to measure traffic for popular AI companies. New domains include chat.openai.com, bard.google.com, claude.ai, and more.

Improved model quality to more accurately predict web traffic in older periods before 2022.

Errata & Future Improvements

We note known issues and planned future improvements. If you would like to submit a bug report or feature request, email us at support@cybersyn.com

Terms

Customers are subject to the Cybersyn terms of service.

Last updated

Copyright © 2024 Cybersyn