How Sports Prediction Models Work — The XGBoost Approach
A prediction model is not magic. It takes structured data, finds patterns humans cannot see at scale, and outputs a probability. The gap between that probability and the sportsbook's implied probability is the edge. Here is how we build ours.
Published April 2026 · 15 min read
1. What a Prediction Model Actually Does
A prediction model takes structured data — rolling batting averages, starting pitcher strikeout rates, head-to-head matchup history, park factors — and finds patterns that are too complex or too numerous for a human analyst to track simultaneously. It outputs a number: the probability that a specific outcome occurs.
That probability is only useful if you can compare it against something. The comparison is the sportsbook's implied probability, which you derive directly from the odds. A line of -110 implies a 52.4% win probability. A line of -130 implies 56.5%. If your model says the true probability is 62% and the book implies 52.4%, you have a 9.6% edge — and a bet worth making.
The model does not need to be right every time. It needs to be right more often than the odds require. At -110, you need to win 52.4% of bets to break even. A model that wins 57% of bets at -110 is profitable over a large enough sample. The discipline is in applying the model only where confidence is high enough to generate that edge, and sizing bets proportionally to the edge size.
What the Model Is Not
A prediction model is not a crystal ball. It cannot account for a pitcher who tweaked his shoulder warming up and hid it from the trainer. It does not know that a star hitter is playing through a hamstring issue that never made the injury report. It cannot predict rain delays, blown calls, or the thousand random events that make any individual game unpredictable.
What a model can do is identify structural edges — situations where the available data systematically predicts outcomes better than the market has priced. Those edges are real, they are recurring, and they compound over hundreds of bets.
2. Why XGBoost for Sports Betting
XGBoost — extreme gradient boosting — is a machine learning algorithm built on decision trees. Instead of training one tree and stopping, it trains a sequence of trees where each new tree corrects the errors of the previous ones. The final prediction is a weighted sum of all the trees in the ensemble.
It became the dominant algorithm in data science competitions because it handles the properties of real-world tabular data better than most alternatives. Sports statistics are a tabular data problem: rows are games, columns are features, outcomes are labels. XGBoost was built for exactly this kind of structure.
Key Advantages for Sports Data
How Boosting Works, Simply
Imagine you have a dataset of 10,000 MLB games. You train a simple decision tree — maybe it just splits on home/away and pitcher ERA. It gets 55% of outcomes right and 45% wrong. Boosting focuses the next tree on the 45% the first tree got wrong. That second tree learns patterns the first missed. Then the third tree focuses on what the second missed. After 100 to 500 trees, the ensemble has learned patterns that no single tree could capture.
The learning rate controls how much each new tree is weighted. A lower learning rate means each tree contributes less individually, requiring more trees to converge but producing a more robust final model. Most of our models use learning rates between 0.01 and 0.05, with 200 to 600 trees depending on the feature set size.
3. Feature Engineering — Where the Real Work Is
The algorithm matters less than most people think. Two teams using the same XGBoost implementation will get very different results if they build their features differently. Feature engineering — deciding what data to feed the model and how to represent it — is where prediction models are won and lost.
More features are not better. More features are almost always worse unless each one carries independent predictive signal. Adding correlated features (batting average and on-base percentage are highly correlated) does not add information — it adds noise. Adding features with no causal relationship to the outcome (a team's record in Tuesday home games) actively degrades model performance by giving the model patterns to memorize that will not repeat.
Our Feature Sets by Market
Our feature sets were built iteratively — every feature that did not improve out-of-sample performance was removed. These numbers reflect the final production sets:
MLB Moneyline — 51 features
24 team strength features: rolling offensive and defensive stats (runs scored, runs allowed, wRC+, FIP) over 7, 14, and 30-day windows
24 starting pitcher quality features: ERA, WHIP, K/9, BB/9, and volatility from last 5 starts — computed fresh for each game
3 series context features: home/away, days of rest, and game number in series
MLB Totals — 27 features
All team-level: combined offensive production rates, ballpark run factor, temperature and wind conditions
Note: pitcher-specific features were tested and rejected — they added noise that degraded totals accuracy from 65% to 53%. This is counterintuitive but reproducible.
Pitcher Strikeouts — 46 features
Pitcher rolling stats: K/9, BB/9, swinging strike rate, chase rate from last 5 and 10 starts
Statcast pitch-level data: spin rate, movement profiles, velocity trends
Opposing lineup K-rate, platoon splits, and confirmed lineup quality score
Matchup interaction terms: pitcher K-rate multiplied by opponent K-rate
Batter Props (hits, total bases) — 73 features
Player rolling averages over 7, 14, 30-day and season-long windows
Batter vs. pitcher history (BvP): career at-bats, batting average, slugging
Platoon splits: batter performance vs. same-hand and opposite-hand pitchers
Opposing pitcher quality: ERA, WHIP, and K/9 from last 5 starts
Lineup position and projected plate appearances
The Most Important Lesson: What NOT to Include
The totals model is the clearest example of this principle. Our first version included 43 features, including starting pitcher ERA, WHIP, K/9, and several advanced pitching metrics. Backtest accuracy was 65.2%. We assumed adding more pitcher data would help. We were wrong.
When we stripped the model to 27 team-level features and removed all pitcher-specific inputs, accuracy jumped to 68.8%. The pitcher features were adding noise because starting pitchers do not consistently determine total runs scored — bullpen performance, lineup depth, and offensive team quality are more reliable signals.
Finding which features do not help is as important as finding which do. The method is simple: train the full model, then train a version with the candidate feature removed. If out-of-sample accuracy does not drop, the feature is not carrying real signal and should be removed.
Our pitcher strikeout model uses 46 features and is free to view. Compare today's projections against your book's lines and see the edge calculation for yourself.
4. Walk-Forward Backtesting — How to Know If It Works
Backtesting is how you evaluate a model before trusting it with real money. The principle is simple: train the model on historical data, then test it on future data it has never seen. If it performs well on data it was not trained on, the patterns it learned are real. If it only performs well on data it was trained on, it memorized noise.
The critical detail is that sports data is time-series data. A game on April 15 is influenced by everything that happened before April 15. It is not influenced by what happened after. This means you cannot use a random train/test split — if you randomly assign 20% of games as your test set, those test games will be scattered throughout the full date range, meaning the training set includes games that happened after those test games. The model trains on the future. Every accuracy number produced this way is meaningless.
Walk-Forward Methodology
The correct approach is walk-forward testing. Train on games from Season 1 through Season 3. Test on Season 4. That is your first window. Then expand the training set to include Season 4 and test on Season 5. Continue expanding. Never test on data that occurred before your most recent training cutoff.
This mirrors how the model actually operates in production: it only knows what happened before the game being predicted. If it cannot beat the market under these conditions in backtesting, it will not beat the market in production.
Our Actual Backtest Numbers
Two details worth noting. First, the pitcher K accuracy (76.1%) is high because the 1K+ edge threshold filters aggressively — only the most confident calls are included. Second, the moneyline sample (491 plays) is smaller because we apply a 60%+ confidence filter. Both of these are intentional. High-confidence bets from a well-calibrated model outperform applying the model to every game indiscriminately.
What Sample Size Actually Means
A 100-game backtest is statistically meaningless. With 100 games at 55% accuracy, the 95% confidence interval on your true win rate stretches from 45% to 65%. You cannot distinguish skill from luck. You need thousands of predictions to have meaningful confidence that your edge is real rather than a variance artifact.
This is why the pitcher K model's 14,165-play sample matters. At that sample size, a 76.1% accuracy has a standard error of less than 0.4%. The signal is not luck. Conversely, anyone claiming a model works based on 50-200 games is presenting noise as evidence.
5. The Shuffle Test — Catching Leakage
Data leakage is the most dangerous failure mode in model building. It happens when information that would not be available at prediction time accidentally finds its way into the training data. The model appears to work in backtesting because it had access to the future. In live use, that future data is unavailable and the model fails.
The most common leakage source in sports models is using season averages that include the game being predicted. If you compute a batter's full-season batting average and use it as a feature for every game in that season, you are using April, May, and June data to predict a March game. The model sees future performance. Strip those season averages out and replace them with rolling averages computed only on data before game day.
How the Shuffle Test Works
Take your full dataset. Randomly scramble the outcome labels — assign wins and losses randomly without regard for which team actually won. Retrain the model on this scrambled data using the same features and hyperparameters. Evaluate accuracy on the held-out test set.
A model trained on scrambled labels cannot learn anything real — there is no signal to learn. If it still shows 65% accuracy on the test set, the model was not learning patterns at all. It was learning the date structure, the feature distributions, or some other artifact that leaked future information into the training data.
A properly built model should show a sharp drop when labels are shuffled. Our models show gaps of 7 to 10 percentage points between real accuracy and shuffled accuracy. That gap is the signal. Larger gaps indicate more robust underlying patterns. Gaps below 5% suggest the model is primarily learning noise.
Other Leakage Sources to Watch
6. Why Most Prediction Models Fail
Most publicly available sports prediction models do not work. Not because machine learning cannot beat sportsbooks — it can — but because the models are built incorrectly. The failure modes are consistent and well-documented.
Overfitting to Small Samples
A model trained on 100 games can achieve 75% accuracy in backtesting. The same model will achieve 52% in production. With 100 samples, a model has enough degrees of freedom to memorize the specific outcomes of those games without learning any generalizable pattern. The larger the model relative to the training sample, the worse this problem becomes.
The floor for a meaningful backtest is roughly 1,000 games for a binary classifier. For player props with high variance, you need 5,000 or more before accuracy estimates stabilize. Any model claiming 70%+ accuracy on fewer than 500 backtested games should be treated as overfitted until proven otherwise.
Data Leakage — Using Future Information
This was covered in the previous section, but it bears repeating because it is by far the most common failure mode. Using season-to-date averages as model features leaks future game results into every prediction. Using end-of-season statistics is even worse. A model trained this way will backtest beautifully and perform at random in production.
Feature Bloat
There are hundreds of baseball statistics available. Batting average, on-base percentage, slugging, OPS, wRC+, wOBA, BABIP, ISO, contact rate, hard-hit rate, barrel rate — and each of these computed at different time scales. A model that includes all of them will find spurious patterns in the training data and memorize them. Adding more correlated features does not add information. It multiplies the opportunities for the model to find noise.
Our totals model failed when we added pitcher features because those features were correlated with team-level features already in the model, adding complexity without adding signal. The model used those pitcher features to memorize specific game outcomes rather than learn generalizable patterns. Accuracy dropped 12 percentage points. Stripping back to the core 27 features recovered the loss and then some.
The Compression Problem
A model trained to minimize mean squared error on game totals will predict the mean — somewhere between 8 and 10 runs for every game. It gets most predictions close to right, because most games land near the average, but it never predicts the 3-1 pitcher's duel or the 14-11 slugfest. When you bet totals, those extreme outcomes matter. A model that compresses all predictions toward the mean is useless for finding over/under edges on extreme games.
The solution is to frame the problem as a classification task rather than a regression task — predict whether the total will be over or under a specific line, not what the exact total will be. This forces the model to learn the conditions that produce extreme outcomes rather than averaging toward the middle.
Never Retraining
A model trained on 2022-2023 data and never updated will go stale. The 2026 Dodgers are not the 2023 Dodgers. Players age, get traded, change pitch mixes, recover from injuries, and develop new tendencies. A model that does not reflect current conditions will gradually lose its edge as its feature distributions drift from the reality of the current season.
We retrain all models monthly during the season, incorporating the most recent complete month of games. This keeps feature distributions current while preserving the deep historical signal from prior seasons.
7. From Probability to Bet — The Edge Calculation
A model outputs a probability. That probability only translates to a profitable bet if the odds offered exceed the implied probability required to break even. The edge calculation connects these two numbers.
Computing the Edge
Step one: convert the model's probability output to a percentage. The model says 64% win probability for Team A.
Step two: convert the sportsbook's odds to an implied probability. Team A is listed at -110. Implied probability = 110 / (110 + 100) = 52.4%.
Step three: compute the edge. Edge = model probability minus implied probability = 64% minus 52.4% = 11.6%.
A positive edge means you expect to profit on this bet over many repetitions. A negative edge means the opposite. The size of the edge tells you how much of your bankroll to risk.
Kelly Criterion for Bet Sizing
The Kelly criterion converts an edge into an optimal bet size. The formula: bet fraction = (edge) / (odds expressed as decimal minus 1). For a 11.6% edge at -110 (1.91 decimal odds): Kelly fraction = 0.116 / (1.91 - 1) = 12.7% of bankroll.
Full Kelly is aggressive — most serious bettors use fractional Kelly, typically half or quarter Kelly, to reduce variance while preserving the mathematical edge. A half-Kelly bet on this example is 6.35% of bankroll. For a full walkthrough of Kelly sizing, see our dedicated guide to Kelly criterion for sports betting.
The Breakeven Table
A model with 60% accuracy is profitable at -110 but loses money if applied to -130 lines. The odds matter as much as the accuracy. A model that picks correctly 58% of the time at +120 is far more profitable than one that picks correctly 60% of the time at -150.
8. What We Run at Prediction Engine
Our production stack covers three sports with machine learning models on every major market. All models are XGBoost classifiers trained on walk-forward backtested datasets. Models retrain monthly. Predictions generate three times daily as lineups confirm and odds sharpen.
MLB Coverage
NBA and NHL Coverage
NBA models cover moneyline, spread, totals, and five player prop markets (points, rebounds, assists, points+rebounds+assists, and three-pointers). NHL models cover moneyline, puck line, totals, and six prop markets. All follow the same XGBoost walk-forward methodology as MLB.
Infrastructure
Predictions run on a dedicated EC2 instance. The pipeline pulls live lineup data, runs each model, filters by confidence threshold, computes edges against current market odds, and writes outputs to the app database — three times daily during the season. All predictions are graded against final scores. Every pick has a public track record.
Access to all markets and full model output is available through a subscription at predictionengine.app/pricing. The pitcher strikeout projections are free to view without an account.
9. Frequently Asked Questions
How accurate are sports prediction models?
Accuracy depends on the market and the confidence threshold applied. At high confidence thresholds — where only the model's most certain calls are included — accuracy is meaningfully higher than at lower thresholds. Our pitcher K model hits 76.1% at 1K+ edge over 14,165 plays; our MLB moneyline model hits 64.2% at 60%+ confidence over 491 plays. Raw accuracy across all predictions is lower. The useful question is not “what is the overall win rate?” but “what is the win rate at the confidence levels I actually bet?”
What is XGBoost and why is it used for sports betting?
XGBoost is a gradient-boosted decision tree algorithm. It builds a sequence of decision trees where each new tree corrects the errors of the previous ones. For sports betting, its advantages are: native handling of missing data, automatic learning of nonlinear feature interactions, fast retraining on new data, and built-in regularization that prevents overfitting. It is the dominant algorithm for tabular data problems — and sports statistics are a tabular data problem. A pitcher with a high K/9 facing a high-K lineup is multiplicatively more likely to strike out batters than either feature alone predicts; XGBoost learns this without being told to look for it.
How do you test if a prediction model actually works?
The correct methodology is walk-forward backtesting: train exclusively on past data, test on the subsequent period the model has never seen. Random train/test splits are invalid for time-series sports data. You also need to run a shuffle test: scramble the outcome labels, retrain the model, and measure how much accuracy drops. A real model shows a 7-10% gap between actual and shuffled accuracy. A model with no gap found noise rather than signal. And the sample size matters — backtest results on fewer than 1,000 games are not statistically meaningful.
Why do most sports betting models fail?
The most common failure modes: overfitting to small samples (100-game backtests are statistically meaningless), data leakage (using season averages that include the game being predicted), feature bloat (more stats added until the model memorizes noise rather than learning patterns), never retraining as rosters and strategies change, and the compression problem (predicting 8-10 for every total instead of learning the conditions that produce extreme outcomes). Most public models suffer from at least two of these simultaneously, producing backtests that look excellent and live performance that is indistinguishable from random.
Can machine learning beat sportsbooks?
Yes, selectively. Sportsbooks are efficient on heavily traded markets — NFL game lines, NBA totals — where sharp action has compressed the margins. They are less efficient on player props, secondary markets, and early-week lines before sharp volume arrives. Machine learning finds edges in less-efficient markets by identifying patterns in large historical datasets that neither human handicappers nor simple statistical models detect. The key is applying predictions only at confidence thresholds where historical accuracy exceeds the breakeven rate implied by the odds, and maintaining the discipline not to bet at lower confidence levels where no edge exists.
Access all markets — MLB, NBA, NHL — with a full track record on every pick
XGBoost models on every major market. Predictions three times daily. Every pick graded publicly. Start free for 5 days, no card required.
Start your free 5-day trial — predictionengine.app/pricing