Data

SportsCapital has a Data Operations team that trains and evaluates artificial intelligence models. These models form the core foundation of how we collect and process information.

Data Collection

SportsCapital maintains a data infrastructure that continuously monitors and ingests content from various news sources. News reports are published, processed, and available in our API in less than 8 seconds (as of May 1, 2025).

See coverage by reporter type:

League Total Reporters Regional Reporters National Reporters
MLB 248 207 41
NBA 178 134 44
NHL 137 105 32
Total 563 446 117

Data Processing

Once the information is collected, we run it through a series of processing steps to generate a snippet, classify, link entities, and extract key information.

        graph LR
    A[Social Media] --> D[Data Ingest]
    B[Articles] --> D
    C[Videos/Podcasts] --> D
    D --> E[Text Extraction]
    E --> F[Snippet Generation]
    F --> G[Entity Linking]
    F --> H[News Classification]
    F --> I[Relevancy Scoring]
    F --> Q[Snippet Linking]
    G --> J[Enriched Data]
    H --> J
    I --> J
    Q --> J
    J --> K[API]
    J --> L[Platform]
    J --> M[Integrations]
    M --> N[Slack]
    M --> O[Amazon S3]
    M --> P[MS Teams]
    
    classDef sources fill:#f5f5f5,stroke:#333;
    classDef ingest fill:#d4e6f1,stroke:#2874a6;
    classDef process fill:#d1f0e0,stroke:#196f3d;
    classDef enrich fill:#e8daef,stroke:#6c3483;
    classDef enrichedData fill:#fadbd8,stroke:#943126;
    classDef delivery fill:#fdebd0,stroke:#b9770e;
    classDef integrations fill:#fdebd0,stroke:#b9770e;
    
    class A,B,C sources;
    class D ingest;
    class E,F process;
    class G,H,I,Q enrich;
    class J enrichedData;
    class K,L delivery;
    class M,N,O,P integrations;
    

Snippets

Snippets are short extractions or segments of text taken from a larger body of content. Social media posts are cleaned by removing elements like hashtags and tags. Snippets from longer form content, such as broadcast feeds, articles, podcasts, and videos, are generated using LLMs to retain only key information.

Entities

Entities are any players, teams, staff, or agents working around the league. We keep our entity database current with daily roster updates and the latest staff movement.

See the definitions of each entity type below:

  • Players: We only track players on the active daily major league roster and injury reserve list. This does not include minor league players or prospects as of January 10, 2025.

  • Teams: We only track major leagues teams. This does not include minor leagues and international clubs as of January 10, 2025.

  • Staff: We track all coaches, executives, scouts, and medical staff listed on the active roster (if the information is available).

  • Agents: We only track licensed agents (if the information is available)

Enrichments

SportsCapital uses proprietary models to extract and generate additional metadata highly relevant to each snippet. This allows users to perform advanced searching and alerting. More information about our primary models can be found in the descriptions below.

Entity Linking

We have a proprietary model that links all entities to a snippet whenever it mentions the entity’s name, last name, or nickname. Our entity linking process uses advanced fuzzy matching algorithms combined with contextual awareness to accurately identify and link entities mentioned in text.

        graph LR
    A[Snippet Text] --> B[Preprocessing]
    B --> C[Entity Candidates]
    C --> D{Type Analysis}
    D -->|Teams| E[Team Matching]
    D -->|People| F[People Matching]
    E --> G[Entity Prioritization]
    F --> G
    G --> H[Final Entity Matches]
    
    classDef process fill:#d1f0e0,stroke:#196f3d;
    classDef decision fill:#e8daef,stroke:#6c3483;
    classDef result fill:#fadbd8,stroke:#943126;
    
    class A,B,C process;
    class D decision;
    class E,F,G process;
    class H result;
    

Our entity linking system combines preprocessing techniques, fuzzy matching, and contextual awareness to resolve complex identification challenges:

  1. Preprocessing & Candidate Generation

    • Handle hyphenated names, remove noise, and generate multi-word entity mentions

    • Match against our comprehensive entity database including nicknames and variations

  2. Contextual Analysis

    • Leverage reporter affiliations to prioritize team-relevant entities

    • Apply separate matching strategies for teams vs. individuals

    • Consider historical patterns in entity mentions

  3. Smart Entity Prioritization

    • Intelligently resolve ambiguity between entities with similar names

    • Weight prioritization based on team affiliations, entity types, and mention history

    • Apply different threshold values based on entity context

    • Use metadata like player age as tiebreakers when relevant

This approach effectively handles complex scenarios like distinguishing between players with identical surnames, recognizing nickname variations, and correctly identifying entities in ambiguous contexts.

Evaluation Metrics

We use Recall, Precision, and F1 Scores to evaluate our entity linking process. The current metrics below are as of January 10, 2025.

NHL

Metric

Value

Annotations

1064

F1 Score

0.912

Precision

0.908

Recall

0.918

NBA

Metric

Value

Annotations

198

F1 Score

0.928

Precision

0.944

Recall

0.913

News Classification

Our news classification model automatically categorizes snippets by their type of news and provides confidence scores for each possible news type. Higher probability indicates that we have a stronger confidence that the snippet is related to each news type.

News Types

News Type Description
Injury Reports about an acute player injury event (e.g., a player limping, hard collisions)
Recovery Reports about a player returning to availability (e.g., cleared to play, returning from an injury)
Diagnosis Updates on a player's injury status after the incident, but before recovery (e.g., MRI results, estimated return timeline)
Lineups Updates on players starting, sitting, or being moved around the lineup, including minor leagues
Practice Updates about practice participation, training sessions, and post-practice interviews
Suspension Reports on players or staff members getting suspended from games and team activities
Performance Commentary on player or team performance historically and during live games
Trade Reports about completed or prospective player trades between teams
Contract Updates about player or staff contract negotiations, extensions, or terms
Draft Reports about the draft or draft prospects, including prospective draft pick decisions
Firing Reports about coaches, executives, or staff being fired or stepping down
Hiring Reports about new coaches, executives, or staff being hired

Evaluation Metrics

We use Recall, Precision, and F1 Scores to evaluate our entity linking process. The current metrics below are as of April 29, 2025.

NHL

Metric

Value

Annotations

1799

F1 Score

0.849

Precision

0.860

Recall

0.837

NBA

Metric

Value

Annotations

2000

F1 Score

0.873

Precision

0.881

Recall

0.866

Relevancy

Our proprietary relevancy model produces a confidence score of how contextually relevant a snippet is to sports betting. This binary classification model produces a single probability score. A high probability (closer to 1) indicates stronger confidence that the snippet is likely to be relevant for decision-makers. Users can filter snippets by relevancy in our API.

Evaluation Metrics

We use Recall, Precision, and F1 Scores to evaluate our entity linking process. The current metrics below are as of April 29, 2025.

NHL

Metric

Value

Annotations

369

F1 Score

0.950

Precision

0.931

Recall

0.970

NBA

Metric

Value

Annotations

364

F1 Score

0.976

Precision

0.963

Recall

0.989

Snippet Linking

Our system identifies when snippets contain the same core information and establishes parent-child relationships between them.

        graph LR
    A[New Snippet Ingested] --> B[Retrieve]
    B --> H[Rank]
    H --> J{Match Probability}
    J -->|Below Threshold| K[Original Snippet<br>parent_id: null]
    J -->|Above Threshold| L[Linked Snippet<br>parent_id: existing_id]
    
    classDef phase fill:#d4e6f1,stroke:#2874a6;
    classDef decision fill:#fadbd8,stroke:#943126;
    classDef result fill:#fdebd0,stroke:#b9770e;
    
    class A process;
    class B,H phase;
    class J decision;
    class K,L result;
    

Our two-phase approach:

  1. Retrieve: First we efficiently filter the candidate pool:

    • Apply practical filters (time window, entity overlap, news type)

    • Deploy our in-house fine-tuned bi-encoder for semantic similarity search

    • Select top candidates based on relevance scores

  2. Rank: Then we perform precise comparison:

    • Our proprietary fine-tuned cross-encoder evaluates semantic equivalence

    • Probability scoring determines parent-child relationships

    • High-confidence matches establish linking structure

Using in Search

Filter with the original_report parameter:

  • true: Returns only snippets without a parent_id (original reports)

  • false: Returns only snippets with a parent_id (derivative reports)

  • null: No filtering based on parent_id (default)

Evaluation Metrics

We use Recall, Precision, and F1 Scores to evaluate our snippet linking process. The current metrics below are as of April 29, 2025.

NHL

Metric

Value

Annotations

125

F1 Score

0.822

Precision

0.769

Recall

0.882

NBA

Metric

Value

Annotations

355

F1 Score

0.940

Precision

0.946

Recall

0.934

Enriched Snippet Example

Original Tweet

@Trevor_Lane: Ja Morant’s head hit Post’s leg so hard that Post’s leg is bleeding

Enriched Output

{
    "snippet": {
        "snippet_id": 493109,
        "league_id": "729c2399-f94a-4db3-9710-e2a5b3c4d0e3",
        "snippet_text": "Ja Morant's head hit Post's leg so hard that Post's leg is bleeding",
        "publish_date": "2025-04-16T04:31:21",
        "url": "https://x.com/Trevor_Lane/status/1912363158825578922",
        "extracted_reporter": "Trevor Lane",
        "extracted_source": "Twitter",
        "extracted_urls": [],
        "source_type": "Social Media",
        "author_name": "Trevor Lane",
        "in_play": true,
        "parent_id": null
    },
    "entities": [
        {
            "matched_entity": "Ja Morant",
            "entity_type": "sc_player",
            "entity_id": "46237ce5-fc56-baea-70d0-54b0c38eff0a",
            "entity_type_id": 1
        },
        {
            "matched_entity": "Quinten Post",
            "entity_type": "sc_player",
            "entity_id": "b1b537d6-0337-9542-b52a-88bc35e5e1e7",
            "entity_type_id": 1
        }
    ],
    "news_types": {
        "contract": 0.00186,
        "draft": 0.00142,
        "firing": 0.00154,
        "hiring": 0.00135,
        "injury": 0.97223,
        "diagnosis": 0.01453,
        "recovery": 0.00227,
        "lineup": 0.00637,
        "other": 0.02345,
        "performance": 0.03939,
        "practice": 0.02688,
        "relevant": 1.0,
        "suspension": 0.00394,
        "trade": 0.00136
    }
}

Key Insights

The enriched data transforms a simple tweet into structured intelligence:

  • Entity Recognition: Identifies both “Ja Morant” and “Quinten Post” (referenced as “Post”)

  • News Classification: Categorized as injury (0.972) with high relevance (1.0)

  • Source Information: Includes reporter name, publication timestamp, and source platform (Twitter)

  • Linking: Marked as an original report (parent_id: null)

This data structure enables filtering by player, news type, relevance, and original reporting - allowing users to receive precisely targeted information for their specific needs.