Data¶
SportsCapital has a Data Operations team that trains and evaluates artificial intelligence models. These models form the core foundation of how we collect and process information.
Data Collection¶
SportsCapital maintains a data infrastructure that continuously monitors and ingests content from various news sources. News reports are published, processed, and available in our API in less than 8 seconds (as of May 1, 2025).
See coverage by reporter type:
| League | Total Reporters | Regional Reporters | National Reporters |
|---|---|---|---|
| MLB | 248 | 207 | 41 |
| NBA | 178 | 134 | 44 |
| NHL | 137 | 105 | 32 |
| Total | 563 | 446 | 117 |
Data Processing¶
Once the information is collected, we run it through a series of processing steps to generate a snippet, classify, link entities, and extract key information.
graph LR
A[Social Media] --> D[Data Ingest]
B[Articles] --> D
C[Videos/Podcasts] --> D
D --> E[Text Extraction]
E --> F[Snippet Generation]
F --> G[Entity Linking]
F --> H[News Classification]
F --> I[Relevancy Scoring]
F --> Q[Snippet Linking]
G --> J[Enriched Data]
H --> J
I --> J
Q --> J
J --> K[API]
J --> L[Platform]
J --> M[Integrations]
M --> N[Slack]
M --> O[Amazon S3]
M --> P[MS Teams]
classDef sources fill:#f5f5f5,stroke:#333;
classDef ingest fill:#d4e6f1,stroke:#2874a6;
classDef process fill:#d1f0e0,stroke:#196f3d;
classDef enrich fill:#e8daef,stroke:#6c3483;
classDef enrichedData fill:#fadbd8,stroke:#943126;
classDef delivery fill:#fdebd0,stroke:#b9770e;
classDef integrations fill:#fdebd0,stroke:#b9770e;
class A,B,C sources;
class D ingest;
class E,F process;
class G,H,I,Q enrich;
class J enrichedData;
class K,L delivery;
class M,N,O,P integrations;
Snippets¶
Snippets are short extractions or segments of text taken from a larger body of content. Social media posts are cleaned by removing elements like hashtags and tags. Snippets from longer form content, such as broadcast feeds, articles, podcasts, and videos, are generated using LLMs to retain only key information.
Entities¶
Entities are any players, teams, staff, or agents working around the league. We keep our entity database current with daily roster updates and the latest staff movement.
See the definitions of each entity type below:
Players: We only track players on the active daily major league roster and injury reserve list. This does not include minor league players or prospects as of January 10, 2025.
Teams: We only track major leagues teams. This does not include minor leagues and international clubs as of January 10, 2025.
Staff: We track all coaches, executives, scouts, and medical staff listed on the active roster (if the information is available).
Agents: We only track licensed agents (if the information is available)
Enrichments¶
SportsCapital uses proprietary models to extract and generate additional metadata highly relevant to each snippet. This allows users to perform advanced searching and alerting. More information about our primary models can be found in the descriptions below.
Entity Linking¶
We have a proprietary model that links all entities to a snippet whenever it mentions the entity’s name, last name, or nickname. Our entity linking process uses advanced fuzzy matching algorithms combined with contextual awareness to accurately identify and link entities mentioned in text.
graph LR
A[Snippet Text] --> B[Preprocessing]
B --> C[Entity Candidates]
C --> D{Type Analysis}
D -->|Teams| E[Team Matching]
D -->|People| F[People Matching]
E --> G[Entity Prioritization]
F --> G
G --> H[Final Entity Matches]
classDef process fill:#d1f0e0,stroke:#196f3d;
classDef decision fill:#e8daef,stroke:#6c3483;
classDef result fill:#fadbd8,stroke:#943126;
class A,B,C process;
class D decision;
class E,F,G process;
class H result;
Our entity linking system combines preprocessing techniques, fuzzy matching, and contextual awareness to resolve complex identification challenges:
Preprocessing & Candidate Generation
Handle hyphenated names, remove noise, and generate multi-word entity mentions
Match against our comprehensive entity database including nicknames and variations
Contextual Analysis
Leverage reporter affiliations to prioritize team-relevant entities
Apply separate matching strategies for teams vs. individuals
Consider historical patterns in entity mentions
Smart Entity Prioritization
Intelligently resolve ambiguity between entities with similar names
Weight prioritization based on team affiliations, entity types, and mention history
Apply different threshold values based on entity context
Use metadata like player age as tiebreakers when relevant
This approach effectively handles complex scenarios like distinguishing between players with identical surnames, recognizing nickname variations, and correctly identifying entities in ambiguous contexts.
Evaluation Metrics
We use Recall, Precision, and F1 Scores to evaluate our entity linking process. The current metrics below are as of January 10, 2025.
NHL
Metric |
Value |
|---|---|
Annotations |
1064 |
F1 Score |
0.912 |
Precision |
0.908 |
Recall |
0.918 |
NBA
Metric |
Value |
|---|---|
Annotations |
198 |
F1 Score |
0.928 |
Precision |
0.944 |
Recall |
0.913 |
News Classification¶
Our news classification model automatically categorizes snippets by their type of news and provides confidence scores for each possible news type. Higher probability indicates that we have a stronger confidence that the snippet is related to each news type.
News Types
| News Type | Description |
|---|---|
| Injury | Reports about an acute player injury event (e.g., a player limping, hard collisions) |
| Recovery | Reports about a player returning to availability (e.g., cleared to play, returning from an injury) |
| Diagnosis | Updates on a player's injury status after the incident, but before recovery (e.g., MRI results, estimated return timeline) |
| Lineups | Updates on players starting, sitting, or being moved around the lineup, including minor leagues |
| Practice | Updates about practice participation, training sessions, and post-practice interviews |
| Suspension | Reports on players or staff members getting suspended from games and team activities |
| Performance | Commentary on player or team performance historically and during live games |
| Trade | Reports about completed or prospective player trades between teams |
| Contract | Updates about player or staff contract negotiations, extensions, or terms |
| Draft | Reports about the draft or draft prospects, including prospective draft pick decisions |
| Firing | Reports about coaches, executives, or staff being fired or stepping down |
| Hiring | Reports about new coaches, executives, or staff being hired |
Evaluation Metrics
We use Recall, Precision, and F1 Scores to evaluate our entity linking process. The current metrics below are as of April 29, 2025.
NHL
Metric |
Value |
|---|---|
Annotations |
1799 |
F1 Score |
0.849 |
Precision |
0.860 |
Recall |
0.837 |
NBA
Metric |
Value |
|---|---|
Annotations |
2000 |
F1 Score |
0.873 |
Precision |
0.881 |
Recall |
0.866 |
Relevancy¶
Our proprietary relevancy model produces a confidence score of how contextually relevant a snippet is to sports betting. This binary classification model produces a single probability score. A high probability (closer to 1) indicates stronger confidence that the snippet is likely to be relevant for decision-makers. Users can filter snippets by relevancy in our API.
Evaluation Metrics
We use Recall, Precision, and F1 Scores to evaluate our entity linking process. The current metrics below are as of April 29, 2025.
NHL
Metric |
Value |
|---|---|
Annotations |
369 |
F1 Score |
0.950 |
Precision |
0.931 |
Recall |
0.970 |
NBA
Metric |
Value |
|---|---|
Annotations |
364 |
F1 Score |
0.976 |
Precision |
0.963 |
Recall |
0.989 |
Snippet Linking¶
Our system identifies when snippets contain the same core information and establishes parent-child relationships between them.
graph LR
A[New Snippet Ingested] --> B[Retrieve]
B --> H[Rank]
H --> J{Match Probability}
J -->|Below Threshold| K[Original Snippet<br>parent_id: null]
J -->|Above Threshold| L[Linked Snippet<br>parent_id: existing_id]
classDef phase fill:#d4e6f1,stroke:#2874a6;
classDef decision fill:#fadbd8,stroke:#943126;
classDef result fill:#fdebd0,stroke:#b9770e;
class A process;
class B,H phase;
class J decision;
class K,L result;
Our two-phase approach:
Retrieve: First we efficiently filter the candidate pool:
Apply practical filters (time window, entity overlap, news type)
Deploy our in-house fine-tuned bi-encoder for semantic similarity search
Select top candidates based on relevance scores
Rank: Then we perform precise comparison:
Our proprietary fine-tuned cross-encoder evaluates semantic equivalence
Probability scoring determines parent-child relationships
High-confidence matches establish linking structure
Using in Search
Filter with the original_report parameter:
true: Returns only snippets without a parent_id (original reports)false: Returns only snippets with a parent_id (derivative reports)null: No filtering based on parent_id (default)
Evaluation Metrics
We use Recall, Precision, and F1 Scores to evaluate our snippet linking process. The current metrics below are as of April 29, 2025.
NHL
Metric |
Value |
|---|---|
Annotations |
125 |
F1 Score |
0.822 |
Precision |
0.769 |
Recall |
0.882 |
NBA
Metric |
Value |
|---|---|
Annotations |
355 |
F1 Score |
0.940 |
Precision |
0.946 |
Recall |
0.934 |
Enriched Snippet Example¶
Original Tweet¶
@Trevor_Lane: Ja Morant’s head hit Post’s leg so hard that Post’s leg is bleeding
Enriched Output¶
{
"snippet": {
"snippet_id": 493109,
"league_id": "729c2399-f94a-4db3-9710-e2a5b3c4d0e3",
"snippet_text": "Ja Morant's head hit Post's leg so hard that Post's leg is bleeding",
"publish_date": "2025-04-16T04:31:21",
"url": "https://x.com/Trevor_Lane/status/1912363158825578922",
"extracted_reporter": "Trevor Lane",
"extracted_source": "Twitter",
"extracted_urls": [],
"source_type": "Social Media",
"author_name": "Trevor Lane",
"in_play": true,
"parent_id": null
},
"entities": [
{
"matched_entity": "Ja Morant",
"entity_type": "sc_player",
"entity_id": "46237ce5-fc56-baea-70d0-54b0c38eff0a",
"entity_type_id": 1
},
{
"matched_entity": "Quinten Post",
"entity_type": "sc_player",
"entity_id": "b1b537d6-0337-9542-b52a-88bc35e5e1e7",
"entity_type_id": 1
}
],
"news_types": {
"contract": 0.00186,
"draft": 0.00142,
"firing": 0.00154,
"hiring": 0.00135,
"injury": 0.97223,
"diagnosis": 0.01453,
"recovery": 0.00227,
"lineup": 0.00637,
"other": 0.02345,
"performance": 0.03939,
"practice": 0.02688,
"relevant": 1.0,
"suspension": 0.00394,
"trade": 0.00136
}
}
Key Insights¶
The enriched data transforms a simple tweet into structured intelligence:
Entity Recognition: Identifies both “Ja Morant” and “Quinten Post” (referenced as “Post”)
News Classification: Categorized as injury (0.972) with high relevance (1.0)
Source Information: Includes reporter name, publication timestamp, and source platform (Twitter)
Linking: Marked as an original report (
parent_id: null)
This data structure enables filtering by player, news type, relevance, and original reporting - allowing users to receive precisely targeted information for their specific needs.