Data¶

SportsCapital has a Data Operations team that trains and evaluates artificial intelligence models. These models form the core foundation of how we collect and process information.

Data Collection¶

SportsCapital maintains a data infrastructure that continuously monitors and ingests content from various news sources. News reports are published, processed, and available in our API in less than 8 seconds (as of May 1, 2025).

See coverage by reporter type:

League	Total Reporters	Regional Reporters	National Reporters
MLB	248	207	41
NBA	178	134	44
NHL	137	105	32
Total	563	446	117

Data Processing¶

Once the information is collected, we run it through a series of processing steps to generate a snippet, classify, link entities, and extract key information.

        graph LR
    A[Social Media] --> D[Data Ingest]
    B[Articles] --> D
    C[Videos/Podcasts] --> D
    D --> E[Text Extraction]
    E --> F[Snippet Generation]
    F --> G[Entity Linking]
    F --> H[News Classification]
    F --> I[Relevancy Scoring]
    F --> Q[Snippet Linking]
    G --> J[Enriched Data]
    H --> J
    I --> J
    Q --> J
    J --> K[API]
    J --> L[Platform]
    J --> M[Integrations]
    M --> N[Slack]
    M --> O[Amazon S3]
    M --> P[MS Teams]
    
    classDef sources fill:#f5f5f5,stroke:#333;
    classDef ingest fill:#d4e6f1,stroke:#2874a6;
    classDef process fill:#d1f0e0,stroke:#196f3d;
    classDef enrich fill:#e8daef,stroke:#6c3483;
    classDef enrichedData fill:#fadbd8,stroke:#943126;
    classDef delivery fill:#fdebd0,stroke:#b9770e;
    classDef integrations fill:#fdebd0,stroke:#b9770e;
    
    class A,B,C sources;
    class D ingest;
    class E,F process;
    class G,H,I,Q enrich;
    class J enrichedData;
    class K,L delivery;
    class M,N,O,P integrations;

Snippets¶

Snippets are short extractions or segments of text taken from a larger body of content. Social media posts are cleaned by removing elements like hashtags and tags. Snippets from longer form content, such as broadcast feeds, articles, podcasts, and videos, are generated using LLMs to retain only key information.

Entities¶

Entities are any players, teams, staff, or agents working around the league. We keep our entity database current with daily roster updates and the latest staff movement.

See the definitions of each entity type below:

Players: We only track players on the active daily major league roster and injury reserve list. This does not include minor league players or prospects as of January 10, 2025.
Teams: We only track major leagues teams. This does not include minor leagues and international clubs as of January 10, 2025.
Staff: We track all coaches, executives, scouts, and medical staff listed on the active roster (if the information is available).
Agents: We only track licensed agents (if the information is available)

Enrichments¶

SportsCapital uses proprietary models to extract and generate additional metadata highly relevant to each snippet. This allows users to perform advanced searching and alerting. More information about our primary models can be found in the descriptions below.

Entity Linking¶

We have a proprietary model that links all entities to a snippet whenever it mentions the entity’s name, last name, or nickname. Our entity linking process uses advanced fuzzy matching algorithms combined with contextual awareness to accurately identify and link entities mentioned in text.

        graph LR
    A[Snippet Text] --> B[Preprocessing]
    B --> C[Entity Candidates]
    C --> D{Type Analysis}
    D -->|Teams| E[Team Matching]
    D -->|People| F[People Matching]
    E --> G[Entity Prioritization]
    F --> G
    G --> H[Final Entity Matches]
    
    classDef process fill:#d1f0e0,stroke:#196f3d;
    classDef decision fill:#e8daef,stroke:#6c3483;
    classDef result fill:#fadbd8,stroke:#943126;
    
    class A,B,C process;
    class D decision;
    class E,F,G process;
    class H result;

Our entity linking system combines preprocessing techniques, fuzzy matching, and contextual awareness to resolve complex identification challenges:

Preprocessing & Candidate Generation
- Handle hyphenated names, remove noise, and generate multi-word entity mentions
- Match against our comprehensive entity database including nicknames and variations
Contextual Analysis
- Leverage reporter affiliations to prioritize team-relevant entities
- Apply separate matching strategies for teams vs. individuals
- Consider historical patterns in entity mentions
Smart Entity Prioritization
- Intelligently resolve ambiguity between entities with similar names
- Weight prioritization based on team affiliations, entity types, and mention history
- Apply different threshold values based on entity context
- Use metadata like player age as tiebreakers when relevant

This approach effectively handles complex scenarios like distinguishing between players with identical surnames, recognizing nickname variations, and correctly identifying entities in ambiguous contexts.

Evaluation Metrics

We use Recall, Precision, and F1 Scores to evaluate our entity linking process. The current metrics below are as of January 10, 2025.

NHL

Metric	Value
Annotations	1064
F1 Score	0.912
Precision	0.908
Recall	0.918

NBA

Metric	Value
Annotations	198
F1 Score	0.928
Precision	0.944
Recall	0.913

News Classification¶

Our news classification model automatically categorizes snippets by their type of news and provides confidence scores for each possible news type. Higher probability indicates that we have a stronger confidence that the snippet is related to each news type.

News Types

News Type	Description
Injury	Reports about an acute player injury event (e.g., a player limping, hard collisions)
Recovery	Reports about a player returning to availability (e.g., cleared to play, returning from an injury)
Diagnosis	Updates on a player's injury status after the incident, but before recovery (e.g., MRI results, estimated return timeline)
Lineups	Updates on players starting, sitting, or being moved around the lineup, including minor leagues
Practice	Updates about practice participation, training sessions, and post-practice interviews
Suspension	Reports on players or staff members getting suspended from games and team activities
Performance	Commentary on player or team performance historically and during live games
Trade	Reports about completed or prospective player trades between teams
Contract	Updates about player or staff contract negotiations, extensions, or terms
Draft	Reports about the draft or draft prospects, including prospective draft pick decisions
Firing	Reports about coaches, executives, or staff being fired or stepping down
Hiring	Reports about new coaches, executives, or staff being hired

Evaluation Metrics

We use Recall, Precision, and F1 Scores to evaluate our entity linking process. The current metrics below are as of April 29, 2025.

NHL

Metric	Value
Annotations	1799
F1 Score	0.849
Precision	0.860
Recall	0.837

NBA

Metric	Value
Annotations	2000
F1 Score	0.873
Precision	0.881
Recall	0.866

Relevancy¶

Our proprietary relevancy model produces a confidence score of how contextually relevant a snippet is to sports betting. This binary classification model produces a single probability score. A high probability (closer to 1) indicates stronger confidence that the snippet is likely to be relevant for decision-makers. Users can filter snippets by relevancy in our API.

Evaluation Metrics

We use Recall, Precision, and F1 Scores to evaluate our entity linking process. The current metrics below are as of April 29, 2025.

NHL

Metric	Value
Annotations	369
F1 Score	0.950
Precision	0.931
Recall	0.970

NBA

Metric	Value
Annotations	364
F1 Score	0.976
Precision	0.963
Recall	0.989

Snippet Linking¶

Our system identifies when snippets contain the same core information and establishes parent-child relationships between them.

        graph LR
    A[New Snippet Ingested] --> B[Retrieve]
    B --> H[Rank]
    H --> J{Match Probability}
    J -->|Below Threshold| K[Original Snippet<br>parent_id: null]
    J -->|Above Threshold| L[Linked Snippet<br>parent_id: existing_id]
    
    classDef phase fill:#d4e6f1,stroke:#2874a6;
    classDef decision fill:#fadbd8,stroke:#943126;
    classDef result fill:#fdebd0,stroke:#b9770e;
    
    class A process;
    class B,H phase;
    class J decision;
    class K,L result;

Our two-phase approach:

Retrieve: First we efficiently filter the candidate pool:
- Apply practical filters (time window, entity overlap, news type)
- Deploy our in-house fine-tuned bi-encoder for semantic similarity search
- Select top candidates based on relevance scores
Rank: Then we perform precise comparison:
- Our proprietary fine-tuned cross-encoder evaluates semantic equivalence
- Probability scoring determines parent-child relationships
- High-confidence matches establish linking structure

Using in Search

Filter with the original_report parameter:

true: Returns only snippets without a parent_id (original reports)
false: Returns only snippets with a parent_id (derivative reports)
null: No filtering based on parent_id (default)

Evaluation Metrics

We use Recall, Precision, and F1 Scores to evaluate our snippet linking process. The current metrics below are as of April 29, 2025.

NHL

Metric	Value
Annotations	125
F1 Score	0.822
Precision	0.769
Recall	0.882

NBA

Metric	Value
Annotations	355
F1 Score	0.940
Precision	0.946
Recall	0.934

Enriched Snippet Example¶

Original Tweet¶

@Trevor_Lane: Ja Morant’s head hit Post’s leg so hard that Post’s leg is bleeding

Enriched Output¶

{
    "snippet": {
        "snippet_id": 493109,
        "league_id": "729c2399-f94a-4db3-9710-e2a5b3c4d0e3",
        "snippet_text": "Ja Morant's head hit Post's leg so hard that Post's leg is bleeding",
        "publish_date": "2025-04-16T04:31:21",
        "url": "https://x.com/Trevor_Lane/status/1912363158825578922",
        "extracted_reporter": "Trevor Lane",
        "extracted_source": "Twitter",
        "extracted_urls": [],
        "source_type": "Social Media",
        "author_name": "Trevor Lane",
        "in_play": true,
        "parent_id": null
    },
    "entities": [
        {
            "matched_entity": "Ja Morant",
            "entity_type": "sc_player",
            "entity_id": "46237ce5-fc56-baea-70d0-54b0c38eff0a",
            "entity_type_id": 1
        },
        {
            "matched_entity": "Quinten Post",
            "entity_type": "sc_player",
            "entity_id": "b1b537d6-0337-9542-b52a-88bc35e5e1e7",
            "entity_type_id": 1
        }
    ],
    "news_types": {
        "contract": 0.00186,
        "draft": 0.00142,
        "firing": 0.00154,
        "hiring": 0.00135,
        "injury": 0.97223,
        "diagnosis": 0.01453,
        "recovery": 0.00227,
        "lineup": 0.00637,
        "other": 0.02345,
        "performance": 0.03939,
        "practice": 0.02688,
        "relevant": 1.0,
        "suspension": 0.00394,
        "trade": 0.00136
    }
}

Key Insights¶

The enriched data transforms a simple tweet into structured intelligence:

Entity Recognition: Identifies both “Ja Morant” and “Quinten Post” (referenced as “Post”)
News Classification: Categorized as injury (0.972) with high relevance (1.0)
Source Information: Includes reporter name, publication timestamp, and source platform (Twitter)
Linking: Marked as an original report (parent_id: null)

This data structure enables filtering by player, news type, relevance, and original reporting - allowing users to receive precisely targeted information for their specific needs.