With FIFA set to kick off on Thursday, June 11, 2026, the opening match at the Mexico City Stadium, I think it would be fun to build the best ML model we can to predict match outcomes. To do this, I have brought together several databases—49,000 matches—with data on Elo ratings, match results, and cup locations. From FIFA to the Baltic Cup, with matches from 1872 to 2026, we will take a probabilistic approach to the sport.
We will compare the performance of several ML models, including
- multinomial regression
- multinomial ridge / elastic-net model
- LightGBM
We will also work to understand the strengths and weaknesses of our models to create a well-calibrated model that predicts home wins 86% of the time. By weighing model performance, calibration, and complexity, we will find the best model for our data.
Soccer by the Numbers
Distribution of total goals per match in the training dataset, showing a strong concentration of matches with low goal totals and a long right tail of increasingly rare high-scoring games. Illustration by Author.
A lot of people say soccer is sleep-inducing. As a soccer fan, I disagree, but to be fair, this is not without reason. The majority of matches end with fewer than 5 goals, and anything above 20 is an anomaly, if not impossible. In contrast, it’s not uncommon for one player to score more than 50 points in an NBA game. But despite the pace, pubs from England to botecos in Rio remain full.
What critics don’t understand is that the low score can make a game more interesting, as this makes it harder for teams to gain a substantial lead, keeping fans on the edge until the end. Unfortunately, this also means matches end in a draw close to 22% of the time—which can also be infuriating. Yet the sport remains as popular as ever.
Annual count of international matches in the pre-2018 training dataset, showing the long-term expansion of international football activity from sparse early records to consistently high match volumes after the late twentieth century. Illustration by Author
The fact that so many matches end in a draw actually becomes a modeling problem later, but before we get to that lets go over how we put this data togther.
Stitching the data together
Oftentimes the best way to improve a model is to simply get more data. We will be working with international_results.csv, international_team_ratings.csv and international_goalscorers.csv
We want to matchinternational_results.csv to international_team_ratings.csv so we can use Elo ratings. This could be simple, but as you might’ve guessed, the team names don’t match up perfectly, so we need to turn to text processing unless we want to check 336 teams individually. We also need to be incredibly careful of when the Elo rating was updated. We could take the Elo on the same day the match occurs, but that would be a source of data leakage, as Elo scores are updated only after the match. Making use of it as a feature tempting but problematic.
We must take the most recent Elo score, and as an additional engineered feature we keep track of the time since the latest Elo update, positing that earlier ratings would be more informative than older ones. The code for joining these tables and the entire project is available in the Appendix.
Top tournaments by match count in the training dataset, highlighting the dominance of friendlies and FIFA World Cup qualification matches relative to all other international competitions. Illustration by Author.
international_results.csv
Field typeExamplesMatch identitysource_match_id, date, season, competitionTeamshome_team, away_teamFinal resulthome_score, away_score, match_result, result_classContextneutral, tournament, city, country
international_team_ratings.csv
FeatureMeaninghome_rating_pre_matchHome team Elo before kickoffaway_rating_pre_matchAway team Elo before kickoffrating_diffHome Elo minus away Elorating_age_days_homeHow stale the home team rating israting_age_days_awayHow stale the away team rating is
international_goalscorers.csv
Feature ideaMeaningUnique scorers in recent matchesWhether a team depends on one scorer or manyGoals by top scorerConcentration of scoringRecent scoring formAttacking output before this match
Comparison of match-result class distributions across the training and test splits, showing broadly similar outcome shares with home wins as the most frequent result, followed by away wins and draws. Illustration by Author.
Because we are doing a time-series prediction, we need to ensure our split respects the time order. We will evaluate our model on all games from 2018 onward, which would be roughly 8,000 matches.
Effective splitApproximate date logicmodel trainearlier part of pre-2018 datavalidationlatest ~20% of the pre-2018 training pooltest2018 onward
Engineered Features
Overview of engineered feature distributions used for model training, showing prior match counts, recent draw rates, goal-difference measures, goals-for and goals-against rates, and points-per-match indicators across home and away team histories. Illustration by Author.
We want to move from basic match-level predictors towards richer pre-match features that capture: team strength, attacking and defensive quality, home/away effects, matchup balance, goalkeeper strength, historical performance trends.
1. Draw-modeling features
The most evident failure of our baseline multinomial logistic regression model was its weak performance at classifying draws. While the model could calculate the probability of a draw because we defined the target variable as match_result ∈ _$”, “”)
(Home win, Draw, Away win), Draw was simply never the most likely outcome. We can see this by the missing column for Draws in the confusion matrix.
Row-normalized test confusion matrix for the best baseline model, showing that the model predicts only home and away outcomes, with home wins most often classified correctly and draws never predicted as a separate class. Illustration by Author.
This poor draw performance is not specific to one model family. When we isolate high-confidence errors — cases where the model’s predicted class was wrong, and its maximum predicted probability was at least 0.60 — the same pattern appears across models: they are systematically overconfident in home wins. Many matches that actually ended in draws were assigned a confident home-win prediction, suggesting that the models capture team-strength direction better than match-level uncertainty or draw likelihood.
Counts of high-confidence wrong predictions on the test set for Model, comparing three model families and showing that most confident errors occur when actual draws are predicted as home wins. Illustration by Author.
To address this ‘blindness’ to the draw option, we can engineer features such as abs_rating_diff, home_draw_rate_last_5, form_draw_rate_mean_last_5, and binary context features like neutral, flag_is_world_cup, and flag_is_friendly, indicating whether the match is on neutral ground or at the World Cup.
Feature groupMeaningExamplesElo closenessMeasures how evenly matched the teams are. Smaller rating gaps are especially relevant for draw probability.abs_rating_diffRecent draw tendencyMeasures how often each team’s prior matches ended in draws.home_draw_rate_last_5, away_draw_rate_last_10Combined draw tendencyCaptures whether both teams have recently been draw-prone.form_draw_rate_mean_last_5, form_draw_rate_mean_last_10Match contextTournament and venue indicators that may affect draw frequency.neutral, flag_is_world_cup, flag_is_friendly
Final LightGBM predicted probabilities by outcome class. Illustration by Author.
With these features, our model can now better discriminate between Home/Away wins and draws, as evidenced by a 3.3% increase in true-positive draw predictions. This is still low, given that ~20% of matches end in draws. So our features help but not by much. This suggests that it could be worth building a model dedicated to draw modeling, with the target variable match_result ∈ >
as.character() , but for now we need to engineer more features.
¬D represents not D meaning our target variable is the match ends in draw (1), or match does not end in draw (0)
Test confusion matrix for the best LightGBM validation model. Illustration by Author.
2. Elo features
The average team has an Elo slightly above 1500; this is near Saudi Arabia, Iceland, and Haiti for FIFA 2026. When we graph the distributions of Home wins, Draws, and away wins, we can see that as the difference decreases, Draws become increasingly likely. Our distributions are also slightly shifted to the left, indicating a small home advantage, as expected.
Distribution of pre-match home team ratings. Illustration by Author.
Rating-difference distributions by match result. Illustration by Author.
We would be leaving LogLoss points on the table if we relied solely on pre-match Elo as our only feature. To get the most from the data, we also
FeatureMeaninghome_rating_pre_matchHome team Elo rating before kickoff.away_rating_pre_matchAway team Elo rating before kickoff.rating_diffHome team Elo minus away team Elo before kickoff. Positive values favor the home team.rating_age_days_homeDays since the home team’s Elo rating was last updated.rating_age_days_awayDays since the away team’s Elo rating was last updated.
Multinomial probability curves by rating difference. Illustration by Author.
3. Rolling past-performance features
A critic could argue that using rolling past performance and Elo is not a good idea, since they both model team strength, which would add redundant or highly correlated features to the model.
Rolling past performance does capture team strength, but it is specifically there to aid the modeling of team momentum. Winning streaks are a very real thing in sports. In fact, the current top choice by supercomputers is Spain. One reason they are predicted first is their historic 31-match unbeaten streak entering FIFA 2026.
Feature groupMeaningExamplesRecent points per matchAverage points earned over each team’s previous 5 or 10 matches.home_points_per_match_last_5, away_points_per_match_last_10Recent goal differenceAverage goals scored minus goals conceded over prior matches.home_goal_diff_per_match_last_5, away_goal_diff_per_match_last_10Recent draw rateShare of prior matches that ended in a draw.home_draw_rate_last_5, away_draw_rate_last_10Home-away form differencesDifference between the home and away teams on the same rolling metric.form_points_diff_last_5, form_goal_diff_diff_last_10Prior match countsNumber of previous matches available before the fixture.home_prior_matches, away_prior_matches
4. Attack and defense form features
While our model tried to capture attacking and defending team strength through points, this is where our model falls short of super-computer approaches. Modern approaches often also implement player data, which is invaluable in computing a team’s strengths. Because we are working only with game-level data, our modeling of attacking and defensive features is computed from previous match results like Recent scoring rates, conceding rates, Scoring-rate difference, and Conceding-rate difference.
Feature groupMeaningExamplesRecent scoring rateAverage goals scored per match over the previous 5 or 10 matches.home_goals_for_per_match_last_5, away_goals_for_per_match_last_10Recent conceding rateAverage goals conceded per match over the previous 5 or 10 matches.home_goals_against_per_match_last_5, away_goals_against_per_match_last_10Scoring-rate differenceHome team’s recent scoring rate minus away team’s recent scoring rate.form_goals_for_diff_last_5, form_goals_for_diff_last_10Conceding-rate differenceHome team’s recent conceding rate minus away team’s recent conceding rate. Lower values favor the home team defensively.form_goals_against_diff_last_5, form_goals_against_diff_last_10
Correlation heatmap of numeric model features. Illustration by Author.
Grid Search
Because large search grids can overfit in cross-validation, and grid search scales multiplicatively, parameters are searched logarithmically (1e-5, 1e-4, 1e-3, 1e-2). Except with parameters like alpha, which must exist between zero and one.
- glmnet_alpha Controls the elastic-net blend between ridge and lasso regression, where zero is Pure ridge, and one is pure lasso.
- multinomial_decay penalizes large coefficients more. That can reduce overfitting, but excessive decay can lead to underfitting.
Grid Search O(n) = number of configurations tested × time to train one model
Model familyGrid/configurations shownWhat was tunedBaselinesmajority_baseline, frequency_baseline, rating_diff_multinomMostly not tuned; comparison baselinesglmnetalpha = 0, .25, .5, .75, 1Elastic-net mixing parametermultinomdecay = 0, 1e-5, 1e-4, 1e-3, 1e-2L2 weight decay / coefficient shrinkageLightGBMless_regular, deeper, more_regular, current_final, l2_regularized, shallower, l1_l2_regularized, compact_robust, faster_small, slower_smallNamed bundles of tree-depth, learning-rate, boosting-round, and regularization settings
LightGBM was the most complex model family in the comparison. Unlike the baseline models, which used few or no tuning parameters, LightGBM required choices about tree complexity, learning rate, boosting rounds, and regularization. This made it more flexible, but also increased the risk of overfitting if the parameters were not tuned carefully. We also need to take care not to use a model that is more complicated than our data requires, as we could lose out on interpretability.
The GBM parameters were tuned by comparing a compact grid of LightGBM configurations. These configurations varied tree complexity, learning speed, number of boosting rounds, and regularization strength, keeping the best model scored on log-loss. Below is a list of the LightGBM parameters.
ParameterMeaninglearning_rateHow much each new tree is allowed to change the model. Lower values learn more slowly but can generalize better.num_iterations / nroundsNumber of boosting rounds, meaning how many trees are added. More trees can improve performance but can also overfit.num_leavesControls how complex each tree can be. More leaves allow more detailed patterns but increase overfitting risk.max_depthMaximum depth of each tree. Deeper trees capture more complex interactions. Shallower trees are simpler and safer.min_data_in_leafMinimum number of observations required in a leaf. Higher values make the model less sensitive to small noisy patterns.lambda_l1L1 regularization. Pushes some effects toward zero, making the model simpler.lambda_l2L2 regularization. Shrinks large effects and reduces overconfidence.feature_fractionFraction of features used for each tree. Using fewer features can reduce overfitting.bagging_fractionFraction of rows used for each tree. Using fewer rows can also reduce overfitting.bagging_freqHow often row subsampling is applied. If set to 0, bagging is usually off.
Validation log loss by Model configurations. Illustration by Author.
Best validation log loss by model family. Illustration by Author.
Final Model
The official selected model was LightGBM with the safe_plus_form_compact feature set, using 20 pre-match features drawn from Elo ratings, tournament context, and lagged team summaries. It was selected based on the lowest validation-set multiclass log loss, with the test set reserved for final reporting.
The selected LightGBM model achieved a validation log loss of 0.893 and a test log loss of 0.873. Its validation result was the best within the Model comparison, but the margin over regression was small: multinomial regression trailed by only about 0.002 log-loss points on validation. On the held-out test set, multinomial regression slightly outperformed LightGBM on both log loss and macro F1.
Incremental log loss across feature tiers. Illustration by Author.
That means the result should be interpreted cautiously. LightGBM is the officially selected predictive model, but the evidence does not show that gradient boosting clearly dominates simpler regression models for the given data. Regression models remain incredibly important because they are easier to interpret and perform nearly as well as, and in some test metrics slightly better than, other methods.
Baseline model metrics across test and validation splits. Illustration by Author.
Feature engineering produced similarly modest gains. Compact lagged features improved validation log loss relative to baseline, but the test improvement was tiny. Goalscorer features did not meaningfully improve log loss in the Model comparison.
Classwise LightGBM F1 by feature tier. Illustration by Author.
The clearest limitation was draw prediction. The selected model almost never predicted draw as the top class: on the test set, it correctly predicted only 2 draws out of 1,784 actual draws, for draw recall of 0.11%. This suggests that the model’s probability estimates may still contain useful information, but argmax classification remains strongly biased toward home and away wins, making a separate model for draw modeling a reasonable next step. Elo and compact pre-match form provide a useful signal stack, but the gains over strong baselines are incremental.
The model is much better at predicting home wins than away wins on the test set:
- It correctly identifies about 87% of actual home wins
- It correctly identifies about 63% of actual away wins
The model is also capable of outputting a probability distribution over Home, Draw, and Away wins, which is often more useful than just a single hard prediction.
Calibration
Final model confidence by prediction correctness. Illustration by Author.
The baseline-plus models are broadly well calibrated on the test set. Across confidence bins. This means predicted confidence tracks observed accuracy, meaning when the models are moderately confident, they are correct at roughly the corresponding rate, and when confidence rises, observed accuracy rises with it. The deviations from the ideal calibration line are modest, suggesting that the models’ probability estimates are generally usable rather than just a rank-ordering of outcomes.
The plot below measures calibration of the top predicted class—the model’s confidence in whichever outcome it chose—not calibration for home wins, draws, and away wins separately. A model can therefore look well calibrated overall while still misestimating one class, especially draws. The aggregate calibration plot supports the claim that the models’ confidence scores are broadly trustworthy, but it does not, by itself, show that the draw probabilities are well calibrated.
Test calibration curves for baseline-plus models. Illustration by Author.
The class-specific calibration plots show where that aggregate picture holds and where it becomes more complicated. Home-win and away-win probabilities follow the ideal calibration line closely across most bins: as the model assigns higher probability to either outcome, the observed frequency rises at roughly the same rate. In practical terms, the model’s home and away probabilities behave like meaningful probabilities, not just scores.
Calibration bins for the best validation model. Illustration by Author.
Draws are different. The model’s draw probabilities are reasonably calibrated within its range, but that range is narrow. It rarely assigns draw probabilities much above the low-to-middle range, even when the match is relatively balanced.
This is the central distinction: the model does not ignore draws; it usually treats them as risk factors rather than likely outcomes. Draw probabilities may still be useful for measuring draw risk, but draws seldom become the model’s top prediction, which helps explain the persistent weakness in draw recall.
Test calibration by class for Model 33. Illustration by Author.
Rating Difference Analysis
The rating-difference analysis shows why draws are structurally difficult for the model. Observed draw rates are highest when the teams are closely matched and decline as the absolute Elo rating gap widens. All three model families learn this broad pattern: their predicted draw probabilities also fall as matches become more lopsided.
The failure is not directional but scalar. In the most evenly matched fixtures, the observed draw rate is roughly one-third, while the models assign draw probabilities closer to one-quarter. They correctly identify balanced matches as more draw-prone, but they do not raise the draw probability enough. As a result, the model can recognize draw risk without often selecting a draw as the most likely outcome. This reconciles the apparent contradiction between reasonable draw calibration and weak draw recall: the probabilities move in the right direction, but usually not far enough to win the argmax decision, that being to pick the class with the highest predicted probability.
Model 25 draw rates by rating-difference bucket. Illustration by Author.
Feature Importance
As you might expect, the most important feature for our model is the rating difference, followed by whether the match was on neural ground—a distant second. By checking the feature importance, we can see which of our engineered features provided meaningful signal.
Model 33 LightGBM feature importance by gain. Illustration by Author.
Outcome rates by rating-difference bucket. Illustration by Author.
Conclusion
I think this is a good time to discuss dataset size and model choice. Typically, the larger and more complex the dataset, the more reason we have to choose a more complicated model. As we saw in this example, the gains from switching from regression to LightGBM were very small; this is a good sign that attempting a more complex model on this data will not yield better predictions. Football forecasting is less about finding a magic algorithm and more about building leakage-safe features, comparing interpretable baselines, and asking whether the model’s confidence is deserved.
For now, one thing is clear: wer’re gonna need more data if we want to get a better prediction. Particularly player-level data—knowing if Neymar is sitting out is very important. The granularity of the data is also important if we want to change our forecast as the game progresses.
Apendix
The code for the whole project can be found on my GitHub
The data source has a Creative Commons CC0-1.0 license
make_team_clean <- function(team_name) > stringr::str_replace_all(“^
- stringr::str_squish()
- stringi::stri_trans_general(“Latin-ASCII”)
- Converts accented Latin characters to plain ASCII characters.
- str_to_lower()
- stringr::str_replace_all(“[^a-z0-9]+”, “_”)
- It replaces anything that is not a lowercase letter or number with an underscore.
Website | LinkedIn | GitHub
