Why Most MLB Prediction Models Fail — The Compression Problem Explained

Most AI models claim to predict MLB totals. Most of them are useless for betting — not because they are wrong, but because they are never actually wrong about anything. Here is why, and what separates models that make money from models that just look impressive.

Updated April 2026

What Is Model Compression and Why Should You Care

The average MLB game produces 9.07 total runs. The standard deviation across all games is 4.55. That means real games range from complete shutouts (0 runs combined) to all-out slugfests — the MLB record is 49 runs in a single game, and 30-run games are not unheard of in modern baseball.

Now consider what the typical prediction model does. It ingests team batting averages, pitching stats, and recent form — all of which are rolling averages — and produces an output. That output almost always falls between 8 and 10 runs. Every game. Regardless of who is pitching, what ballpark the game is in, or whether both offenses are ice cold.

This is model compression: the model collapses a wide distribution of real outcomes into a narrow band of predictions. It is not technically wrong — the average game really does produce about 9 runs. But it is useless for betting, because the sportsbook has already priced the line at exactly that average.

Think of it this way: if a model predicts “9 runs” for every single game of the season, it would be within 0.5 runs of the sportsbook line roughly half the time. It would look, on paper, like a 50% accurate model. But it would also lose money on every bet it triggered, because it has no ability to distinguish the 6-run pitchers' duel from the 14-run slugfest.

The ability to predict the direction and magnitude of variance — not just the average — is the only thing that creates a betting edge. See how MLB betting analytics are built to capture that variance.

The Base Rate Problem: 51.5% Is Barely Better Than Guessing

Before evaluating any model, you need to know the base rate: how often does the over hit on the most common line? At a game total of 8.5 — the most common MLB total — the over hits roughly 51.5% of the time historically. At 9.5, the over hits about 48%. At 7.5, closer to 55%.

This base rate matters enormously. A model that claims “57% accuracy” on over/under 8.5 is only generating 5.5 percentage points of lift above the base rate. At -110 juice, you need 52.4% to break even. So that model has roughly 4.6 points of true edge — if the accuracy is measured correctly on out-of-sample data.

Most models are not measured correctly. They report accuracy on the same data used to train them. A gradient boosting model with decision trees can perfectly memorize five seasons of game outcomes and report 100% backtested accuracy. Tested on a single new season it never saw, it may drop to 51%. That 49-point gap is what overfitting looks like in practice.

Break-Even Quick Reference

OddsWin Rate Needed to Break Even
-110 (standard)52.4%
-11553.5%
-12054.5%
-13056.5%

The break-even bar is low in absolute terms, but the base rate of the market is close to that bar by design. Sportsbooks set lines to attract equal action on both sides, which means the line itself is already a reasonable estimate of the true probability. You are not competing against randomness — you are competing against a well-informed market that has already absorbed most of the public information.

This is why raw accuracy metrics are meaningless without context. A model that hits 63% on high-confidence plays is exceptional. Understanding expected value is the framework that turns raw accuracy into a bankroll strategy.

Why Adding More Data Makes Models Worse

This is the counterintuitive finding that took the longest to accept during our model development: adding more features — more data points about each game — made the model perform worse on new games, not better.

We tested configurations ranging from a lean set of features up to sets of 42 and 50 features. The model with fewer, well-chosen features consistently outperformed the larger feature sets on held-out data. The larger sets looked better in backtesting. They were worse in practice.

Why? Three reasons:

  • Correlated features amplify noise. Team batting average, team OPS, team wRC+, and team weighted runs above average all measure the same thing from different angles. Adding all four does not give the model more information — it gives the model four chances to overfit to the noise in each metric.
  • Rolling averages are mean-reverting by construction. A 20-game rolling batting average moves toward the season mean every time a game is added. That is the mathematical definition of mean reversion. Using it as a predictor of variance — which is what betting on totals requires — is self-defeating. The feature smooths out exactly what you need to detect.
  • More parameters, more ways to memorize. A machine learning model with 50 inputs has far more internal complexity than one with 18. That complexity can be used to learn genuine patterns, or it can be used to memorize quirks of the training data. With noisy sports data, memorization wins most of the time — unless the model is aggressively regularized.

The practical implication is that “our model uses 200+ features” is a marketing claim, not a quality signal. A model with 200 features and weak validation is almost certainly worse than a model with 20 features validated properly on data it never saw during training.

Ballpark effects are a good example of a feature that earns its place: stable, externally validated, not correlated with team strength, and measurable independently. Park factors and how they move totals is a feature worth adding. A team's last three game scores is not.

See It In Action

We rebuilt our models to fix compression. See the results — 68% totals, 83% NBA ML.

Start Free 5-Day Trial

Regression vs Classification: Predicting a Score vs Beating a Line

There are two fundamentally different ways to build a baseball totals model. Most models use regression: they try to predict the exact number of runs that will be scored. The output is something like “projected total: 9.2 runs.”

The better approach for betting is classification: instead of predicting an exact number, the model predicts the probability that the game goes over a specific line. “Probability of over 8.5: 61%.”

Why does this matter? Because the betting question is not “how many runs will be scored?” The betting question is “is this game more likely to go over or under this specific number?” These are related but different questions, and optimizing for one does not automatically optimize for the other.

A regression model predicting 9.2 runs for a game with an 8.5 line might seem like a clear over play. But if the variance around that prediction is high — as it always is with baseball — a game projected at 9.2 could easily land anywhere from 5 to 14 runs. The expected value of betting over 8.5 in that scenario depends on the full shape of the probability distribution, not just the point estimate.

Classification models trained directly on the over/under outcome are inherently calibrated for the betting decision. They learn the probability of exceeding the line rather than trying to predict an exact score and then inferring the bet from that score. In our testing, classification consistently outperformed regression for this reason.

Our pitcher props model uses the same classification-first approach — the model outputs probability, not a raw projection, and plays are only surfaced when that probability clears a meaningful threshold above the break-even rate.

What Separates High-Confidence Plays From Noise

The most important design decision in a data-driven betting model is the confidence threshold: at what predicted probability do you actually flag a play?

A model that outputs probabilities between 50% and 55% for every game is useless. The break-even rate at -110 is 52.4%. A prediction of 53% probability barely clears that bar, and any measurement error in the model wipes out the edge entirely.

The difference between a useful model and a useless one comes down to whether it can identify a subset of games — maybe 10% to 20% of the slate — where its confidence is genuinely high. If those high-confidence plays hit at 62% instead of the 52% base rate, the model has found something real. If they hit at 54%, the model is probably noise with a compelling backstory.

Confidence Tier Framework

Below 55%No edge at -110. Pass unless getting plus money.
55%–60%Marginal edge. Small unit size only, requires favorable odds.
60%–65%Meaningful edge. Standard unit size, backtested over 500+ plays.
Above 65%Strong edge. Rare. If the model hits this consistently, it is significant.

The dangerous middle ground is 55%–58% confidence on a large volume of plays. It looks like a working model. It generates plenty of action. But the edge is so thin that a short bad run looks indistinguishable from the model breaking down — and there is no way to tell which it is until you have hundreds more plays.

High-confidence plays should be rare. A model that finds 80% of the daily slate “worth betting” is almost certainly compressing — finding something everywhere because it cannot distinguish signal from noise.

How to Evaluate Whether a Model Is Actually Predictive

If you are evaluating a sports betting model — whether your own or someone else's — here is what to look for:

1. Out-of-Sample Accuracy, Not Backtested Accuracy

Backtested accuracy on the training data is meaningless. The model already saw those games. The only number that matters is performance on games the model had no access to during training. For a model built on 2022–2024 data, that means 2025 season performance — ideally, game-by-game, in the order those games were played.

A serious, data-driven modeling process trains on historical data and tests forward on new data the model never saw. If someone cannot tell you which specific games were in their test set, their accuracy claim is not credible.

2. Calibration: Does 60% Actually Mean 60%?

A well-calibrated model is one where plays predicted at 60% confidence actually win 60% of the time. Poorly calibrated models are systematically overconfident or underconfident — a model might output 65% and win only 55%, which completely changes the expected value calculation.

Calibration can be checked by binning predictions into confidence ranges (55%–60%, 60%–65%, etc.) and comparing the predicted win rate to the actual win rate in each bucket. A well-behaved model will show those numbers tracking each other. A model with calibration problems will show a flat actual win rate regardless of predicted confidence.

3. Sample Size: 50 Plays Proves Nothing

Baseball is noisy. A model that has hit 62% over 50 plays could be a 62% model, a 55% model on a good run, or a 52% model on an extremely good run. You need several hundred plays at a consistent confidence tier before a win rate is statistically meaningful. The confidence interval around 62% accuracy over 50 plays is enormous — roughly 48% to 74%. Over 500 plays, it tightens to 57%–67%.

4. Transparency: What Is the Model Actually Using?

A model built on team batting averages and starter ERA is doing the same thing every free sports analytics site does. Edge comes from signals that are not already priced into the line — park factors adjusted for specific matchups, pitcher-batter historical data, situational factors that are stable and predictive but less obvious.

At Prediction Engine, we publish the conceptual basis of every model on our MLB analytics page, and our pitcher props projections are backtested on full seasons of held-out data before any play is surfaced as a high-confidence pick.

5. The Compression Check: Does It Ever Predict Extremes?

A quick sanity check: look at the full distribution of a model's predictions. If the highest prediction is 10.5 runs and the lowest is 8.1, the model is compressed. Real games range from 0 to 33 total runs, with a standard deviation of 4.55. A model that is capturing real variance should produce predictions that spread across at least part of that distribution — projecting some games near 6 runs and others near 13.

Compression is not just a technical failure. It is the clearest possible signal that the model's inputs are too mean-reverting to distinguish between different types of games. The inputs are doing all the smoothing before the model even sees them.

Frequently Asked Questions

What win rate do you need to be profitable betting MLB totals?

At standard -110 juice, you need to win 52.4% of bets to break even. Most models that claim “55% accuracy” are measuring against a wrong baseline. The base rate for over 8.5 already sits near 51.5%, so the true edge in a 55% model is only about 3.5 percentage points — slim enough to disappear in variance. To be meaningfully profitable you need to demonstrate edge at a specific confidence threshold, not just a raw win rate across all plays.

Why do most sports betting models underperform?

Most models suffer from compression — they predict a narrow range of outcomes (8–10 runs per game) instead of distinguishing between low-scoring and high-scoring matchups. They also overfit by adding too many features, which inflates backtested accuracy without improving live performance. The underlying problem is that the features used (season averages, standings, win/loss records) are mean-reverting by construction and wash out matchup-specific signals.

Is 63% accuracy good for MLB totals betting?

Yes — if it is measured correctly. 63% accuracy on plays where the model expresses high confidence (not across all games) is a strong result. At -110 juice, 63% generates roughly 10.6 units of profit per 100 bets. The key caveat is that the 63% must be validated on held-out data the model never trained on, not on the same games used to build it.

How do you know if a prediction model is overfitting?

The clearest sign of overfitting is a large gap between backtested accuracy and live accuracy. If a model shows 68% on historical data but drops to 51% on new games, it memorized patterns specific to the training set that do not generalize. Other warning signs: the model improves every time you add a feature, it performs better the more data it sees, and its confidence is uniformly high across very different matchups.

Why does adding more features sometimes hurt model accuracy?

More features means more parameters for the model to tune, which increases the risk of memorizing noise instead of learning signal. In sports betting, most additional features are highly correlated with each other. Adding correlated features does not add new information — it adds opportunities for the model to overfit. Fewer, higher-quality features almost always outperform a large noisy feature set.

See what a non-compressed model actually looks like

Prediction Engine surfaces only high-confidence plays — no noise, no filler. Backtested on full seasons of held-out data across MLB, NBA, and NHL.

Start your free 5-day trial

predictionengine.app/pricing — no credit card required to start

5-day free trial — all sports

Try Free