people! If you have ever wanted to understand how linear regression works or just refresh the main ideas without jumping between lots of different sources – this article is for you. It is an extra long read that took me more than a year to write. It’s built around five key ideas:
- Visuals first. This is a comic-style article: reading the text helps, but it’s not required. A quick run through the images and animations can still give you a solid understanding of how things work. There are 100+ visuals in total;
- Animations where they might help (33 total). Computer science is best understood in motion, so I use animations to explain key ideas;
- Beginner-friendly. I kept the material as simple as possible, to make the article easy for beginners to follow.
- Reproducible. Most visuals were generated in Python, and the code is open source.
- Focus on practice. Each next step solves a problem that shows up in the previous step, so the whole article stays connected.
One more thing: the post is simplified on purpose, so some wording and examples may be a bit rough or not perfectly precise. Please don’t just take my word for it – think critically and double-check my points. For the most important parts, I provide links to the source code so you can verify everything yourself.
Table of contents
Who is this article for
Skip this paragraph, just scroll through the article for two minutes and look at the visuals. You’ll immediately know if you want to read it properly (the main ideas are shown in the plots and animations). This post is for beginners and for anyone working with data – and also for experienced people who want a quick refresh.
What this post covers
The article is structured in three acts:
- Linear regression: what it is, why we use it, and how to fit a model;
- How to evaluate the model’s performance;
- How to improve the model when the results are not good enough.
At a high level, this article covers:
- data-driven modeling;
- analytical solution for linear regression, and why it is not always practical;
- ways to evaluate model quality, both visually and with metrics;
- Multiple linear regression, where predictions are based on many features;
- the probabilistic side of linear regression, since predictions are not exact and it is important to quantify uncertainty;
- ways to improve model quality, from adding complexity to simplifying the model with regularization.
More specifically, it walks through:
- the least squares method for simple linear regression;
- regression metrics such as R², RMSE, MAE, MAPE, SMAPE, along with the Pearson correlation coefficient and the coefficient of determination, plus visual diagnostics like residual plots;
- maximum likelihood and prediction intervals;
- train/test splits, why they matter and how to do them;
- outlier handling methods, including RANSAC, Mahalanobis distance, Local Outlier Factor (LOF), and Cook’s distance;
- data preprocessing, including normalization, standardization, and categorical encoding;
- the linear algebra behind least squares, and how it extends to multivariate regression;
- numerical optimization methods, including gradient descent;
- L1 and L2 regularization for linear models;
- cross-validation and hyperparameter optimization.
Although this article focuses on linear regression, some parts – especially the section on model evaluation, apply to other regression algorithms as well. The same goes for the feature preprocessing chapters.
Since this is meant as an introductory, ML-related guide to linear regression, I’ll mostly avoid vector notation (where formulas use vectors instead of scalars). In other words, you will hardly see vectors and matrices in the equations, except in a few places where they are truly necessary. Keep in mind that most of the formulas shown here do have a vector form, and modern libraries implement the algorithms in exactly that way. Those implementations are efficient and reliable, so if you decide to code things up, don’t reinvent the wheel – use well-tested libraries or tools with UI when it makes sense.
All animations and images in the article are original and created by the author.
A brief literary review
This topic isn’t new, so there’s plenty of material out there. Below is a short list of direct predecessors, similar in platform (mostly Towards Data Science) and audience, meaning browser-first readers rather than textbook readers. The list is ordered by increasing subjective complexity:
- What is Linear Regression? – A beginner-friendly overview of what linear regression is, what the line represents, how predictions are made, with simple visuals and code;
- A Practical Guide to Linear Regression – Represents linear model fitting as machine learning pipeline: EDA, feature handling, model fitting, and evaluation on a real Kaggle dataset;
- Mastering the Basics: How Linear Regression Unlocks the Secrets of Complex Models – Easy to follow guide with step-by-step calculations memorable and nice visuals;
- Predict Housing Price using Linear Regression in Python – implementation-oriented article built around the Boston Housing dataset, with code examples for calculations from scratch;
- Multiple Linear Regression Analysis – An article with more mathematical detail, focused on multicollinearity;
- Mastering Linear Regression: The Definitive Guide For Aspiring Data Scientists – A long, all in one guide, theory plus Python;
- Linear Regression In Depth (Part 1) and Linear Regression In Depth (Part 2) – Deeper theory plus implementation articles that focuses on simple linear regression and sets up the transition to multiple regression;
And of course, do not ignore the classic papers if you want to read more about this topic. I’m not listing them as a separate bibliography in this section, but you’ll find links to them later in the text. Each reference appears right after the fragment it relates to, in square brackets, in the format: [Author(s). Title. Year. Link to the original source]
A good model starts with data
Let’s assume we have tabular data with two columns:
- Number of rooms in the apartment;
- The price of the apartment, $
Figure 1. Visualization of the original dataset on apartment prices (link to the code for generating the image – image by author)
By the time you build a model, there should already be data. Data collection and the initial preparation of the dataset are outside the scope of this article, especially since the process can vary a lot depending on the domain. The main principle to keep in mind is “garbage in, garbage out,” which applies to supervised machine learning in general. A good model starts with a good dataset.
Disclaimer regarding the dataset: The data used in this article is synthetic and was generated by the author. It is distributed under the same license as the source code – BSD 3-Clause.
Why do we need a model?
As the British statistician George Box once said, “All models are wrong, but some are useful.” Models are useful because they help us uncover patterns in data. Once those patterns are expressed as a mathematical relationship (a model), we can use it, for example, to generate predictions (Figure 2).
Figure 2. Turning a data table into a model – and what can be considered a model (image by author)
Modeling relationships in data is not a trivial task. It can be done using mathematical models of many different kinds – from simple ones to modern multi-stage approaches such as neural networks. For now, the key point is that a “model” can mean any kind of mapping from one set of data (feature columns) to a target column. I’ll use this definition throughout the article.
Figure 3. The model can be (almost) anything (link to the code for generating the image – image by author)
In linear regression, we model linear relationships between data variables. In pair (one-feature) regression – when there is one feature and one dependent variable – the equation has the form:
y=b0+b1⋅xy = b_0 + b_1 \cdot x, where xx – feature, yy – target variable [James, G., et al. Linear Regression. An Introduction to Statistical Learning, 2021. Free version https://www.statlearning.com/].
So the expression y=1+10⋅xy= 1 + 10\cdot x is a linear regression model. And y=15−21⋅xy = 15 − 21 \cdot x is one as well – the only difference is the coefficients. Since the coefficients are the key parameters of the equation, they have their own names:
- b0 – the intercept (also called the bias term)
- b1 – the slope coefficient
So, when we build a linear regression model, we make the following assumption:
Assumption 1. The relationship between the features (independent variables) and the response (dependent variable) is linear [Kim, Hae-Young. Statistical notes for clinical researchers: simple linear regression 1 – basic concepts, 2018. https://www.rde.ac/upload/pdf/rde-43-e21.pdf]
An example of a linear model with the intercept and slope coefficients already fitted (we will discuss why they are called that a bit later) is shown in Figure 4.
Figure 4. A linear regression model and its predictions (link to the code for generating the image – image by author)
For the dataset shown in Figure 1, estimating the apartment price in dollars means multiplying the number of rooms by 10 000.
Important note: we are focusing on an approximation – so the model line does not have to pass through every data point, because real-world data almost never falls exactly on a single straight line. There is always some noise, and some factors the model does not see. It’s enough for the model line to stay as close to the observed data as possible. If you do not remember well the difference between approximation, interpolation and extrapolation, please check the image below.
Side branch 1. Difference between approximation, interpolation and extrapolation
Extra Figure 1. The difference between the terms interpolation, extrapolation, and approximation (image by author)
How to build a simple model
We need to choose the coefficients b0b_0 and b1b_1 in the equation below so that the straight line fits the empirical observations (the real data) as closely as possible: y=b0+b1⋅xy = b_0 + b_1 \cdot x, where xx – number of rooms, yy – apartment price, $.
Why this equation, and why two coefficients
Despite its apparent simplicity, the linear regression equation can represent many different linear relationships, as shown in Figure 5. For each dataset, a different line will be optimal.
Figure 5. Examples of equations with different optimal coefficient values (link to the code for generating the image – image by author)
Analytical solution
To find the optimal coefficient values, we will use an analytical solution: plug the empirical data from the previous section into a well-known formula derived long ago (by Carl Gauss and Adrien-Marie Legendre). The analytical solution can be written as four simple steps (Figure 6) [Hastie, T., et al. Linear Methods for Regression (Chapter 3 in The Elements of Statistical Learning: Data Mining, Inference, and Prediction). 2009. https://hastie.su.domains/ElemStatLearn].
Figure 6. Analytical solution for simple linear regression. Step 2 shows a Python-like pseudocode for computing the slope coefficient (link to the code for doing the computations – image by author)
Error is also part of the model
Earlier, I noted that linear regression is an approximation algorithm. This means we do not require the line to pass exactly through the observations. In other words, even at this stage we allow the model’s predictions to differ from the observed apartment prices. And it is important to emphasize: this kind of mismatch is completely normal. In the real world, it is very hard to find a process that generates data lying perfectly on a straight line (Figure 7).
Figure 7. Real-world data can rarely be described by a model without any residual error. That’s why the linear regression equation includes an error term (link to the code for generating the image – image by author)
So, the model needs one more component to be realistic: an error term. With real data, error analysis is essential – it helps spot problems and fix them early. Most importantly, it provides a way to quantify how good the model really is.
How to measure model quality
Model quality can be assessed using two main approaches:
- Visual evaluation
- Metric-based evaluation
Before we dive into each one, it is a good moment to define what we mean by “quality” here. In this article, we will consider a model a good one when the error term is as small as possible.
Using the original dataset (see Figure 1), different coefficient values can be plugged into the linear regression equation. Predictions are then generated for the known examples, and the difference between predicted and actual values is compared (Table 1). Among all combinations of the intercept and slope, one pair yields the smallest error.
Number of roomsModel (b0 + b1 x rooms number)Prediction
Ground truth (observation)Error (observation – predicted)20+10000⋅20 + 10000 \cdot 2 20 00020 000020+5000⋅20 + 5000 \cdot 210 00020 00010 0002500+1000⋅2500 + 1000 \cdot 22 50020 00017 500Table 1. Error comparison for a single observation (with two rooms) under different values of the coefficients b0 and b1
The table example above is easy to follow because it is a small, toy setup. It only shows how different models predict the price of a two-room apartment, and in the original dataset each “number of rooms” value maps to a single price. Once the dataset gets larger, this kind of manual comparison becomes impractical. That’s why model quality is usually assessed with evaluation tools (visuals, metrics and statistical tests) rather than hand-made tables.
To make things a bit more realistic, the dataset will be expanded in three versions: one easy case and two that are harder to fit. The same evaluation will then be applied to these datasets.
Figure 8. Three datasets: examples of expanded samples (A, B, C) with apartment prices for evaluating model performance (link to the code for generating the image – image by author)
Figure 8 is closer to real life: apartments vary, and even if the number of rooms are the same, the price across different properties doesn’t have to be identical.
Visual evaluation
Using the formula from the Analytic Solution section (Figure 6), the data can be plugged in to obtain the following models for each dataset:
- A: 0+10000⋅x0 + 10000 \cdot x, where x is rooms number
- B: 0+10000⋅x0 + 10000 \cdot x, where x is rooms number
- C: 6800+6600⋅x6800 + 6600 \cdot x, where x is rooms number
A useful first plot to show here is the scatter plot: the feature values are placed on the x-axis, while the y-axis shows both the predicted values and the actual observations, in different colors. This kind of figure is straightforward to interpret – the closer the model line is to the real data, the better the model. It also makes the relationship between the variables easier to see, since the feature itself is shown on the plot [Piñeiro, G., et al. How to evaluate models: Observed vs. predicted or predicted vs. observed? 2008. https://doi.org/10.1016/j.ecolmodel.2008.05.006].
Figure 9. Visual evaluation of model quality: predicted values shown alongside the observed values in a scatter plot (link to the code for generating the image – image by author)
One downside of this plot is that it becomes hard to introduce additional features once you have more than one or two – for example, when price depends not only on the number of rooms, but also on the distance to the nearest metro station, the floor level, and so on. Another issue is scale: the target range can strongly shape the visual impression. Tiny differences on the chart, barely visible to the eye, may still correspond to errors of several thousand dollars. Price prediction is a great example here, because a misleading visual impression of model errors can translate directly into money.
When the number of features grows, visualizing the model directly (feature vs. target with a fitted line) quickly becomes messy. A cleaner alternative is an observed vs. predicted scatter plot. It’s built like this: the x-axis shows the actual values, and the y-axis shows the predicted values (Figure 10) [Moriasi, D. N., et al. Hydrologic and Water Quality Models: Performance Measures and Evaluation Criteria. 2015. pdf link]. I’ve also seen the axes swapped, with predicted values on the x-axis instead. Either way, the plot serves the same purpose – so feel free to choose whichever convention you prefer.
Figure 10. Visual evaluation of model quality: observed vs. predicted scatter plot (link to the code for generating the image – image by author)
This plot is read as follows: the closer the points are to the diagonal line coming from the bottom-left corner, the better. If the model reproduced the observations perfectly, every point would sit exactly on that line without any deviation (dataset A looks pretty close to this ideal case).
When datasets are large, or the structure is uneven (for example, when there are outliers), Q-Q plots can be helpful. They show the same predicted and observed values on the same axes, but after a special transformation.
Q-Q plot Option 1, – order statistics. Predicted values are sorted in ascending order, and the same is done for the observed values. The two sorted arrays are then plotted against each other, just like in Figure 10.
Q-Q plot Option 2, – two-sample Q-Q plot. Here the plot uses quantiles rather than raw sorted values. The data are grouped into a finite number of levels (I usually use around 100). This plot is useful when the goal is to compare the overall pattern, not individual “prediction vs. observation” pairs. It helps to see the shape of the distributions, where the median sits, and how common very large or very small values are.
Side branch 2. Reminder about quantiles
According to Wikipedia, a quantile is a value that a given random variable does not exceed with a fixed probability.
Setting the probability wording aside for a moment, a quantile can be thought of as a value that splits a dataset into parts. For example, the 0.25 quantile is the number below which 25% of the sample lies. And the 0.9 quantile is the value below which 90% of the data lies.
For the sample [ 1 , 3 , 5 , 7 , 9 ] the 0.5 quantile (the median) is 5. There are only two values above 5 (7 and 9), and only two below it (1 and 3).
The 0.25 quantile is approximately 3, and the 0.75 quantile is approximately 7. See the explanation in the figure below.
Extra Figure 2. A little about quantiles and percentiles (image by author)
The 25th percentile is also called the first quartile, the 50th percentile is the median or second quartile, and the 75th percentile is the third quartile.
Figure 11. Visual evaluation of model quality: Q-Q plot. The 25th, 50th, and 75th percentiles are highlighted with numbered labels and black outlines (that is, the quantiles at levels 0.25, 0.50, and 0.75) (link to the code for generating the image – image by author)
In the second variant, no matter how large the dataset is, this plot always shows 99 points, so it scales well to large samples. In Figure 11, the real and predicted quantiles for dataset A lie close to the diagonal line which indicates a good model. For dataset B, the right tail of the distributions (upper-right corner) starts to diverge, meaning the model performs worse on high-priced apartments.
For dataset C:
- Below the 25th percentile, the predicted quantiles lie above the observed ones;
- Within the interquartile range (from the 25th to the 75th percentile), the predicted quantiles lie below the observed ones;
- Above the 75th percentile, the predicted tail again lies above the observed one.
Another widely used diagnostic is the residual plot. The x-axis shows the predicted values, and the y-axis shows the residuals. Residuals are the difference between the observed and predicted values. If you prefer, you can define the error with the opposite sign (predicted minus observed) and plot that instead. It doesn’t change the idea – only the direction of the values on the y-axis.
Figure 12. Visual evaluation of model quality: residual plot (link to the code for generating the image – image by author)
A residual plot is one of the most convenient tools for checking the key assumptions behind linear regression (Assumption 1 (linearity) was introduced earlier):
- Assumption 2. Normality of residuals. The residuals (observed minus predicted) should be approximately normally distributed. Intuitively, most residuals should be small and close to zero, while large residuals are rare. Residuals occur roughly equally often in the positive and negative direction.
- Assumption 3. Homoscedasticity (constant variance). The model should have errors of roughly the same magnitude across the full range: cheap apartments, mid-range ones, and expensive ones.
- Assumption 4. Independence. Observations (and their residuals) should be independent of each other – i.e., there should be no autocorrelation.
Figure 12 shows that dataset B violates Assumption 3: as the number of rooms increases, the errors get larger – the residuals fan out from left to right, indicating increasing variance. In other words, the error is not constant and depends on the feature value. This usually means the model is missing some underlying pattern, which makes its predictions less reliable in that region.
For dataset C, the residuals don’t look normal: the model sometimes systematically overestimates and sometimes systematically underestimates, so the residuals drift above and below zero in a structured way rather than hovering around it randomly. On top of that, the residual plot shows visible patterns, which can be a sign that the errors are not independent (to be fair, not always XD but either way it’s a signal that something is off with the model).
A nice companion to Figure 12 is a set of residual distribution plots (Figure 13). These make the shape of the residuals immediately visible: even without formal statistical tests, you can eyeball how symmetric the distribution is (a good sign is symmetry around zero) and how heavy its tails are. Ideally, the distribution should look bell-shaped, most residuals should be small, while large errors should be rare.
Figure 13. Visual evaluation of model quality: residual plot and residuals distribution (link to the code for generating the image – image by author)
Side branch 3. A quick reminder about frequency distributions
If your stats course has faded from memory or you never took one this part is worth a closer look. This section introduces the most common ways to visualize samples in mathematical statistics. After it, interpreting the plots used later in the article should be straightforward.
Frequency distribution is an ordered representation showing how many times the values of a random variable fall within certain intervals.
To build one:
- Split the full range of values into k bins (class intervals)
- Count how many observations fall into each bin – this is the absolute frequency
- Divide the absolute frequency by the sample size n to get the relative frequency
In the figure below, the same steps are shown for the variable V:
Extra Figure 3. Visualization of frequency distribution V as a histogram: how to calculate (link to the code for generating the image – image by author)
The same kind of visualization can be built for variable U as well, but in this section the focus stays on V for simplicity. Later on, the histogram will be rotated sideways to make it easier to compare the raw data with the vertical layout commonly used for distribution plots.
From the algorithm description and from the figure above, one important drawback becomes clear: the number of bins k (and therefore the bin width) has a major impact on how the distribution looks.
Extra Figure 4. Frequency distribution visualizations using different numbers of bins k: 5, 10, and 20. The vertical axis is intentionally left unlabeled to avoid the temptation to interpret point positions along the y-axis, those values can be arbitrary and do not affect the distribution of V (link to the code for generating the image – image by author)
There are empirical formulas that help choose a reasonable number of bins based on the sample size. Two common examples are Sturges’ rule and the Rice rule (see Extra Figure 5 below) [Sturges. The Choice of a Class Interval. 1926. DOI: 10.1080/01621459.1926.10502161], [Lane, David M., et. al. Histograms. https://onlinestatbook.com/2/graphing_distributions/histograms.html].
Extra Figure 5. Rules for choosing the number of bins in histograms Sturges’ and Rice formulas (link to the code for generating the image – image by author)
An alternative is to visualize the distribution using kernel density estimation (KDE). KDE is a smoothed version of a histogram: instead of rectangular bars, it uses a continuous curve built by summing many smooth “kernel” functions, usually, normal distributions (Extra Figure 6).
Extra Figure 6. Kernel density estimation (KDE) for variable V (link to the code for generating the image – image by author)
I understand that describing KDE as a sum of “tiny normal distributions” isn’t very intuitive. Here’s a better mental picture. Imagine that each data point is filled with a large number of tiny grains of sand. If you let the sand fall under gravity, it forms a little pile directly beneath that point. When several points are close to each other, their sand piles overlap and build a larger mound. Watch the animation below to see how it works:
Extra Animation 1. Kernel density estimation as falling sand forming a mound (link to the code for generating the animation – animation by author)
In a KDE plot, those “sand piles” are typically modeled as small normal (Gaussian) distributions placed around each data point.
Another widely used way to summarize a distribution is a box plot. A box plot describes the distribution in terms of quartiles. It shows:
- The median (second quartile, Q2);
- The first (Q1) and third (Q3) quartiles (the 25th and 75th percentiles), which form the edges of the “box”;
- The whiskers, which mark the range of the data excluding outliers;
- Individual points, which represent outliers.
Extra Figure 7. Frequency distribution visualization of V variable. Boxplot (link to the code for generating the image – image by author)
To sum up, the next step is to visualize samples of different sizes and shapes using all the methods discussed above. This will be done by drawing samples from different theoretical distributions: two sample sizes for each, 30 and 500 observations.
Extra Figure 8. Visualizing samples in different ways (histograms, kernel density estimation, and boxplots) for two theoretical distributions: a normal distribution and a bimodal distribution (a mixture of two Gaussians) (link to the code for generating the image – image by author)
A frequency distribution is a key tool for describing and understanding the behavior of a random variable based on a sample. Visual methods like histograms, kernel density curves, and box plots complement each other and help build a clear picture of the distribution: its symmetry, where the mass is concentrated, how spread out it is, and whether it contains outliers.
Such point of view on the data is also useful because it has a natural probabilistic interpretation: the most likely values fall in the region where the probability density is highest, i.e., where the KDE curve reaches its peak.
As noted above, the residual distribution should look roughly normal. That’s why it makes sense to compare two distributions: theoretical normal vs. the residuals we actually observe. Two convenient tools for this are density plots and Q-Q plots with residual quantiles vs. normal quantiles. The parameters of the normal distribution are estimated from the residual sample. Since these plots work best with larger samples, for illustration I’ll artificially increase each residual set to 500 values while preserving the key behavior of the residuals for each dataset (Figure 14).
Figure 14. Q-Q plot comparing a normal distribution with the model residuals (bottom row). For clarity, the residual samples for datasets A, B, and C were artificially expanded (link to the code for generating the image – image by author)
As Figure 14 shows, the residual distributions for datasets A* and B* are reasonably well approximated by a normal distribution. For B*, the tails drift a bit: large errors occur slightly more often than we would like. The bimodal case C* is much more striking: its residual distribution looks nothing like normal.
Heteroscedasticity in B* won’t show up in these plots, because they look at residuals on their own (one dimension) and ignore how the error changes across the range of predictions.
To sum up, a model is rarely perfect, it has errors. Error analysis with plots is a convenient way to diagnose the model:
- For pair regression, it is useful to plot predicted and observed values on the y-axis against the feature on the x-axis. This makes the relationship between the feature and the response easy to see;
- As an addition, plot observed values (x-axis) vs. predicted values (y-axis). The closer the points are to the diagonal line coming from the bottom-left corner, the better. This plot is also handy because it does not depend on how many features the model has;
- If the goal is to compare the full distributions of predictions and observations, rather than individual pairs, a Q-Q plot is a good choice;
- For very large samples, cognitive load can be reduced by grouping values into quantiles on the Q-Q plot, so the plot will have, for example, only 100 scatter points;
- A residual plot helps check whether the key linear regression assumptions hold for the current model (independence, normality of residuals, and homoscedasticity);
- For an easier comparison between the residual distribution and a theoretical normal distribution, use a Q-Q plot.
Metrics
Disclaimer regarding the designations X and Y
In the visualizations in this section, some notation may look a bit unusual compared to related literature. For example, predicted values are labeled XX, while the observed response is labeled YY. This is intentional: even though the discussion is tied to model evaluation, I don’t want it to feel like the same ideas only apply to the “prediction vs. observation” pair. In practice, XX and YY can be any two arrays – the right choice depends on the task.
There’s also a practical reason for choosing this pair: XX and YY are visually distinct. In plots and animations, they are easier to tell apart than pairs like UU and VV, or the more familiar yy and y^\hat{y}.
As compelling as visual diagnostics can be, model quality is best assessed together with metrics (numerical measures of performance). A good metric is appealing because it reduces cognitive load: instead of inspecting yet another set of plots, the evaluation collapses to a single number (Figure 15).
Figure 15. Why metrics matter: they let you judge model quality with a single number (sometimes a small set of numbers). The plot shows the Mean Absolute Percentage Error (MAPE) metric (link to the code for generating the image – image by author)
Unlike a residual plot, a metric is also a very convenient format for automated analysis, not just easy to interpret, but easy to plug into code. That makes metrics useful for numerical optimization, which we will get to a bit later.
This “Metrics” section also includes statistical tests: they help assess the significance of individual coefficients and of the model as a whole (we will cover that later as well).
Here is a non-exhaustive list:
- Coefficient of determination R2 – [Kvalseth, Tarald O. Cautionary Note about R². 1985. https://www.tandfonline.com/doi/abs/10.1080/00031305.1985.10479448];
- Bias;
- Mean absolute error – MAE;
- Root mean square error – RMSE;
- Mean absolute percentage error – MAPE;
- Symmetric mean absolute percentage error – SMAPE;
- The F-test for checking whether the model is significant as a whole;
- The t-test for checking the significance of the features and the target;
- Durbin-Watson test for analyzing residuals.
Figure 16 shows metrics computed by comparing the observed apartment prices with the predicted ones.
Figure 16. Model metrics for datasets A, B, and C. Note that in the three bottom subplots the y-axis is split by color for each bar group. Because of that, bar heights are only meaningful when comparing the same metric across datasets, for example R² for A, B, and C. They are not meant for comparing different metrics within a single dataset, such as MAE versus correlation coefficient (link to the code for generating the image – image by author)
The metrics are grouped for readability. The first group, shown in red, includes the correlation coefficient (between predicted and observed values) and the coefficient of determination, R². Both are dimensionless, and values closer to 1 are better. Note that correlation is not limited to predictions versus the target. It can also be computed between a feature and the target, or pairwise between features when there are many of them.
Animation 1. How to compute the correlation coefficient and the coefficient of determination (R²). Notation: X are the predicted values, Y are the observed values. Please zoom in on the animation to see how the values are inserted into the formulas (link to the code for generating the animation – animation by author)
The second group, shown in green, includes metrics that measure error in the same units as the response, which here means $. For all three metrics, the interpretation is the same: the closer the value is to zero, the better (Animation 2).
Animation 2. How to compute bias, mean absolute error (MAE), and root mean squared error (RMSE). Notation: X are the predicted values, Y are the observed values (link to the code for generating the animation – animation by author)
One interesting detail: in Figure 16 the bias is zero in all cases. Literally, this means the model errors are not shifted in either direction on average. A question for you: why is this generally true for a linear regression model fitted to any dataset (try changing the input values and playing with different datasets)?
Animation 2 and Figure 16 also show that as the gap between XX and YY grows, RMSE reacts more strongly to large errors than MAE. That happens because RMSE squares the errors.
The third group, shown in blue, includes error metrics measured in percentages. Lower values are better. MAPE is sensitive to errors when the true values are small, because the formula divides the prediction error by the observed value itself. When the actual value is small, even a modest absolute error turns into a large percentage and can strongly affect the final score (Figure 17).
Animation 3. How to compute mean absolute percentage error (MAPE) and symmetric mean absolute percentage error (SMAPE). Notation: X are the predicted values, Y are the observed values (link to the code for generating the animation – animation by author)
Figure 17. How MAPE and SMAPE behave on two datasets where the target values are close to zero. The figure shows how the metrics change after adding 10 units to both the observed and predicted values in the second dataset (image by author)
Figure 17 shows that the difference measured in the original units, the absolute deviation between observed and predicted values, stays the same in both cases: it is 0 for the first pair, 8 for the second, and 47 for the third. For percentage-based metrics, the errors shrink for an obvious reason: the observed values become larger.
The change is larger for MAPE, because it normalizes each error by the observed value itself. sMAPE, in contrast, normalizes by the average magnitude of the observed and predicted values. This difference matters most when the observations are close to zero, and it fades as values move away from zero, which is exactly what the figure shows.
Side branch 4. Features of MAPE and SMAPE calculations
The details of metric calculations are important to discuss. Using MAPE and SMAPE (and briefly MAE) as examples, this section shows how differently metrics can behave across datasets. The main takeaway is simple: before starting any machine learning project, think carefully about which metric, or metrics, you should use to measure quality. Not every metric is a good fit for your specific task or data.
Here is a small experiment. Using the data from Figure 17, take the original arrays, observations [1,2,3] and predictions [1,10,50]. Shift both arrays away from zero by adding 10 to every value, repeated for 10 iterations. At each step, compute three metrics: MAPE, SMAPE, and MAE. The results are shown in the plot below:
Extra Figure 9. Exploring MAPE’s asymmetry. Values of MAPE and sMAPE (left axis) and MAE (right axis) as the observed and predicted values are shifted farther away from zero. The absolute deviation between observations and predictions stays the same at every shift in this experiment (link to the code for generating the image – image by author)
As can be seen from the figure above, the larger the values included in the dataset, the smaller the difference between MAPE and SMAPE, and the smaller the errors measured in percentage terms. The alignment of MAPE and SMAPE is explained by the calculation features that allow eliminating the effect of MAPE asymmetry, which is particularly noticeable at small observation values. MAE remains unchanged, as expected.
Now the reason for the word “asymmetry” becomes clear. The simplest way to show it is with an example. Suppose the model predicts 110 when the true value is 100. In that case, MAPE is 10%. Now swap them: the true value is 110, but the prediction is 100. The absolute error is still 10, yet MAPE becomes 9.1%. MAPE is asymmetric because the same absolute deviation is treated differently depending on whether the prediction is above the true value or below it.
Another drawback of MAPE is that it cannot be computed when some target values are zero. A common workaround is to replace zeros with a very small number during evaluation, for example 0.000001. Still, it is clear that this can inflate MAPE.
Other metrics have their own quirks as well. For example, RMSE is more sensitive to large errors than MAE. This section is not meant to cover every such detail. The main point is simple: choose metrics thoughtfully. Use metrics recommended in your domain, and if there are no clear standards, start with the most common ones and experiment.
To summarize, the units of measurement for metrics and the ranges of possible values are compiled in Table 2.
MetricUnitsRangeMeaningPearson correlation coefficient (predictions vs target)Dimensionlessfrom -1 to 1The closer to 1, the better the modelCoefficient of determination R2Dimensionlessfrom −∞ to 1The closer to 1, the better the modelBias
The same unit as the target variablefrom −∞ to ∞The closer to 1, the better the modelMean absolute error (MAE)
The same unit as the target variablefrom 0 to ∞The closer to zero, the better the modelRoot mean square error (RMSE)
The same unit as the target variablefrom 0 to ∞The closer to zero, the better the modelMean absolute percentage error (MAPE)Percentage (%)from 0 to ∞The closer to zero, the better the modelSymmetric mean absolute percentage error (SMAPE)Percentage (%)from 0 to 200The closer to zero, the better the modelTable 2. Some regression metrics
As mentioned earlier, this is not a complete list of metrics. Some tasks may require more specialized ones. If needed, quick reference info is always easy to get from your favorite LLM.
Here is a quick checkpoint. Model evaluation started with a table of predicted and observed values (Table 1). Large tables are hard to inspect, so the same information was made easier to digest with plots, moving to visual evaluation (Figures 9-14). The task was then simplified further: instead of relying on expert judgment from plots, metrics were computed (Figures 15-17 and Animations 1-3). There is still a catch. Even after getting one or several numbers, it is still up to us to decide whether the metric value is “good” or not. In Figure 15, a 5% threshold was used for MAPE. That heuristic cannot be applied to every linear regression task. Data varies, business goals are different, etc. For one dataset, a good model might mean an error below 7.5%. For another, the acceptable threshold might be 11.2%.
F test
That is why we now turn to statistics and formal hypothesis testing. A statistical test can, in principle, save us from having to decide where exactly to place the metric threshold, with one important caveat, and give us a binary answer: yes or no.
If you have never come across statistical tests before, it makes sense to start with a simplified definition. A statistical test is a way to check whether what we observe is just random variation or a real pattern. You can think of it as a black box that takes in data and, using a set of formulas, produces an answer: a few intermediate values, such as a test statistic and a p-value, and a final verdict (Figure 18) [Sureiman, Onchiri, et al. F-test of overall significance in regression analysis simplified. 2020. https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108].
Figure 18. Statistical hypothesis test. The figure shows an example of calculating the F test for overall model significance. The input data are highlighted in orange, and the values produced by the test calculations are highlighted in yellow (link to the code for doing the computations – image by author)
As Figure 18 shows, before running a test, we need to choose a threshold value. Yes, this is the right moment to come back to that caveat: here too, we have to deal with a threshold. But in this case it is much easier, because there are widely accepted standard values to choose from. This threshold is called the significance level. A value of 0.05 means that we accept a 5% chance of incorrectly rejecting the null hypothesis. In this case, the null hypothesis could be something like: the model is no better than a naive prediction based on the mean. We can vary this threshold. For example, some scientific fields use 0.01 or even 0.001, which is more strict, while others use 0.10, which is less strict.
If the practical meaning of the significance level is not fully clear at this point, that is completely fine. There is a more detailed explanation at the end of this section. For now, it is enough to fix one key point: the statistical tests discussed below have a parameter, α\alpha, which we as researchers or engineers choose based on the task. In our case, it is set to 0.05.
So, a statistical test lets us take the data and a few chosen parameters, then compute test quantities that are used for comparison, for example, whether the test statistic is above or below a threshold. Based on that comparison, we decide whether the model is statistically significant. I would not recommend reinventing the wheel here. It is better to use statistical packages (it is reliable) to compute these tests, which is one reason why I am not giving the formulas in this section. As for what exactly to compare, the two common options are the F statistic against the critical F value, or the p-value against the significance level. Personally, mostly out of habit, I lean toward the second option.
We can use the F test to answer the question, “Is the model significant?” Since statistics is a mathematical discipline, let us first describe the two possible interpretations of the fitted model in a formal way. The statistical test will help us decide which of these hypotheses is more plausible.
We can formulate the null hypothesis (H₀) as follows: all coefficients for the independent variables, that is, the features, are equal to zero. The model does not explain the relationship between the features and the target variable any better than simply using the (target) mean value.
The alternative hypothesis (H₁) is then: at least one coefficient is not equal to zero. In that case, the model is significant because it explains some part of the variation in the target variable.
Now let us run the tests on our three datasets, A, B, and C (Figure 19).
Figure 19. F test for overall model significance. The figure shows the test results obtained with the Python package statsmodels at a significance level of 0.05. Here, x is the model feature – the number of rooms (link to the code for generating the image – image by author)
As we can see from Figure 19, in all three cases the p-value is below 0.05, which is our chosen significance level. We use 0.05 because it is the standard default threshold, and in the case of apartment price prediction, choosing the wrong hypothesis is not as critical as it would be, for example, in a medical setting. So there is no strong reason to make the threshold more strict here. p-value is below 0.05 means we reject the null hypothesis, H₀, for models A, B, and C. After this check, we can say that all three models are statistically significant overall: at least one feature contributes to explaining variation in the target.
However, the example of dataset C shows that confirmation that the model is significantly better than the average price does not necessarily mean that the model is actually good. The F-statistic checks for minimum adequacy.
One limitation of this approach to model evaluation is that it is quite narrow in scope. The F test is a parametric test designed specifically for linear models, so unlike metrics such as MAPE or MAE, it cannot be applied to something like a random forest (another machine learning algorithm). Even for linear models, this statistical test also requires standard assumptions to be met (see Assumptions 2-4 above: independence of observations, normality of residuals, and homoscedasticity).
Still, if this topic interests you, there is plenty more to explore on your own. For example, you could look into the t test for individual features, where the hypothesis is tested separately for each model coefficient, or the Durbin-Watson test. Or you can choose any other statistical test to study further. Here we only covered the basic idea. P.S. It is especially worth paying attention to how the test statistics are calculated and to the mathematical intuition behind them.
Side branch 5. If you are not entirely clear about the significance level α\alpha, please read this section
Every time I tried to understand what significance level meant, I ran into a brick wall. More complex examples involved calculations that I didn’t understand. Simpler sources conveyed the concept more clearly – “here’s an example where everything is intuitively understandable”:
- H₀ (null hypothesis): The patient does not have cancer;
- Type I error: The test says “cancer is present” when it is not actually;
- If the significance level α\alpha is set at 0.05, in 5% of cases the test may mistakenly alarm a healthy person by informing them that they have cancer;
- Therefore, in medicine, a low α\alpha (e.g., 0.01) is often chosen to minimize false alarms.
But here we have data and model coefficients – everything is fixed. We apply the F-test and get a p-value < 0.05. We can run this test 100 times, and the result will be the same, because the model is the same and the coefficients are the same. There we go – 100 times we get confirmation that the model is significant. And what is the 5 percent threshold here? Where does this “probability” come from?
Let us break this down together. Start with the phrase, “The model is significant at the 0.05 level”. Despite how it sounds, this phrase is not really about the model itself. It is really a statement about how convincing the observed relationship is in the data we used. In other words, imagine that we repeatedly collect data from the real world, fit a model, then collect a new sample and fit another one, and keep doing this many times. In some of those cases, we will still find a statistically significant relationship even if, in reality, no real relationship exists between the variables. The significance level helps us account for that.
To sum up, with a p-value threshold of 0.05, even if no real relationship exists, the test will still say “there is a relationship” in about 5 out of 100 cases, simply because of random variation in the data.
To make the text a bit less dense, let me illustrate this with an animation. We will generate 100 random points, then repeatedly draw datasets of 30 observations from that pool and fit a linear regression model to each one. We will repeat this sampling process 20 times. With a significance level of 5%, this means we allow for about 1 case out of 20 in which the F test says the model is significant even though, in reality, there is no relationship between the variables.
Extra animation 2. Understanding the meaning of the significance level when testing linear regression models. The population was generated at random. The results are shown for a significance level of 0.05 (link to the code for generating the animation – animation by author)
Indeed, in 1 out of 20 cases where there was actually no relationship between x and y, the test still produced a p-value below 0.05. If we had chosen a stricter significance level, for example 0.01, we would have avoided a Type I error, that is, a case where we reject H₀ (there is no relationship between x and y) and accept the alternative hypothesis even though H₀ is in fact true.
For comparison, we will now generate a population where a clear linear relationship is present and repeat the same experiment: 20 samples and the same 20 attempts to fit a linear regression model.
Extra animation 3. Understanding the meaning of the significance level when testing linear regression models. The population contains a linear relationship. The results are shown for a significance level of 0.05 (link to the code for generating the animation – animation by author)
To wrap up this overview chapter on regression metrics and the F test, here are the main takeaways:
- Visual methods are not the only way to assess prediction error. We can also use metrics. Their main advantage is that they summarize model quality in a single number, which makes it easier to judge whether the model is good enough or not.
- Metrics are also used during model optimization, so it is important to understand their properties. For example:
- The metrics from the “green group” (RMSE, MAE, and bias) are convenient because they are expressed in the original units of the target.
- The root mean squared error (RMSE) reacts more strongly to large errors and outliers than the mean absolute error (MAE).
- The “blue group” (MAPE and SMAPE) is expressed in percent, which often makes these metrics convenient to discuss in a business context. At the same time, when the target values are close to zero, these metrics can become unstable and produce misleading estimates.
- Statistical tests provide an even more compact assessment of model quality, giving an answer in the form of “yes or no”. However, as we saw above, such a test only checks basic adequacy, where the main alternative to the fitted regression model is simply predicting the mean. It does not help in more complex cases, such as dataset C, where the relationship between the feature and the target is captured by the model well enough to rise above statistical noise, but not fully.
Later in the article, we will use different metrics throughout the visualizations, so that you get used to looking beyond just one favorite from the list 🙂
Forecast uncertainty. Prediction interval
An interesting combination of visual assessment and formal metrics is the prediction interval. A prediction interval is a range of values within which a new observation is expected to fall with a given probability. It helps show the uncertainty of the prediction by combining statistical measures with the clarity of a visual representation (Figure 20).
Figure 20. Point estimate and prediction interval (image by author)
The main question here is how to choose these threshold values, Δ\Delta. The most natural approach, and the one that is actually used in practice, is to extract information about uncertainty from the cases where the model already made errors during training, namely from the residuals. But to turn a raw set of differences into actual threshold values, we need to go one level deeper and look at linear regression as a probabilistic model.
Recall how point prediction works. We plug the feature values into the model, in the case of simple linear regression, just one feature, and compute the prediction. But a prediction is rarely exact. In most cases, there is a random error.
When we set up a linear regression model, we assume that small errors are more likely than large ones, and that errors in either direction are equally likely. These two assumptions lead to the probabilistic view of linear regression, where the model coefficients and the error distribution are treated as two parts of the same whole (Figure 21) [Fisher, R. A. On the Mathematical Foundations of Theoretical Statistics. 1922. https://doi.org/10.1098/rsta.1922.0009].
Figure 21. Maximum likelihood as a way to estimate the coefficients of a linear regression model, illustrated with a simplified model that includes only the intercept (link to the code for generating the image – image by author)
As Figure 21 shows, the variability of the model errors can be estimated by calculating the standard deviation of the errors, denoted by σ\sigma. We could also talk about the error variance here, since it is another suitable measure of variability. The standard deviation σ\sigma is simply the square root of the variance. The larger the standard deviation, the greater the uncertainty of the prediction (see Section 2 in Figure 21).
This leads us to the next step in the logic: the more widely the errors are spread, the less certain the model is, and the wider the prediction interval becomes. Overall, the width of the prediction interval depends on three main factors:
- Noise in the data: the more noise there is, the greater the uncertainty;
- Sample size: the more data the model has seen during training, the more reliably its coefficients are estimated, and the narrower the interval becomes;
- Distance from the center of the data: the farther the new feature value is from the mean, the higher the uncertainty.
In simplified form, the procedure for building a prediction interval looks like this:
- We fit the model (using the formula from the previous section, Figure 6)
- We compute the error component, that is, the residuals
- From the residuals, we estimate the typical size of the error
- Obtain the point prediction
- Next, we scale s using several adjustment factors: how much training data the model was fitted on, how far the feature value is from the center of the data, and the selected confidence level. The confidence level controls how likely the interval is to contain the value of interest. We choose it based on the task, in much the same way we earlier chose the significance level for statistical testing (common by default – 0.95).
As a simple example, we will generate a dataset of 30 observations with a “perfect” linear relationship between the feature and the target, fit a model, and compute the prediction interval. Then we will 1) add noise to the data, 2) increase the sample size, and 3) raise the confidence level from 90% to 95 and 99%, where the prediction interval reaches its maximum width (see Animation 4).
Animation 4. Prediction interval and how it depends on the properties of the data and the confidence level (link to the code for generating the animation – animation by author)
And consider separately what the prediction interval looks like for datasets A, B, and C (Figure 22).
Figure 22. Prediction intervals at different confidence levels for models fitted to datasets A, B, and C (link to the code for generating the image – image by author)
Figure 22 clearly shows that even though models A and B have the same coefficients, their prediction intervals differ in width, with the interval for dataset B being much wider. In absolute terms, the widest prediction interval, as expected, is produced by the model fitted to dataset C.
Train test split and metrics
All of the quality assessments discussed so far focused on how the model behaves on the same observations it was trained on. In practice, however, we want to know whether the model will also perform well on new data it has not seen before.
That is why, in machine learning, it is common best practice to split the original dataset into parts. The model is fitted on one part, the training set, and its ability to generalize is evaluated on the other part, the test sample (Figure 23).
Figure 23. Splitting a dataset into training and test sets. In many cases, the split should be done at random rather than, for example, by taking the first 70% of the dataset for training and the remaining 30% for testing, because the data may be ordered in the raw dataset (image by author)
If we combine these model diagnostic methods into one large visualization, this is what we get:
Figure 24. Model evaluation on the training and test sets, with residual plots and metrics (warning: this figure is information-dense, so it is best read gradually). The prediction intervals are shown at the 95% confidence level and were computed from the training set (link to the code for generating the image – image by author)
Figure 24 shows that the metric values are worse on the test data, which is exactly what we would expect, since the model coefficients were optimized on the training set. A few more observations stand out:
- First, the bias metric has finally become informative: on the test data it is no longer zero, as it was on the training data, and now shifts in both directions, upward for datasets A and B, and downward for dataset C.
- Second, dataset complexity clearly matters here. Dataset A is the easiest case for a linear model, dataset B is more difficult, and dataset C is the most difficult. As we move from training to test data, the changes in the metrics become more noticeable. The residuals also become more spread out in the plots.
In this section, it is important to point out that the way we split the data into training and test sets can affect what our model looks like (Animation 5).
Animation 5. Same data, different coefficients. A visualization of how different train-test splits affect the linear regression coefficients and metrics for dataset B. Split ratio: 60% training, 40% test. Here, x is the model feature, namely the number of rooms (link to the code for generating the animation – animation by author)
The choice of splitting strategy depends on the task and on the nature of the data. In some cases, the subsets should not be formed at random. Here are a few situations where that makes sense:
- Geographic or spatial dependence. When the data have a spatial component, for example temperature measurements, air pollution levels, or crop yields from different fields, nearby observations are often strongly correlated. In such cases, it makes sense to build the test set from geographically separated regions in order to avoid overestimating model performance.
- Scenario-based testing. In some business problems, it is important to evaluate in advance how the model will behave in certain critical or rare situations, for example at high or extreme feature values. Such cases can be intentionally included in the test set, even if they are absent or underrepresented in the training sample.
Imagine that there are only 45 apartments in the world…
To make the rest of the discussion easier to follow, let us introduce one important simplification for this article. Imagine that our hypothetical world, the one in which we build these models, is very small and contains only 45 apartments. In that case, all our earlier attempts to fit models on datasets A, B, and C were really just individual steps toward recovering that original relationship from all the available data.
From this point of view, A, B, and C are not really separate datasets, even though we can imagine them as data collected in three different cities, A, B, and C. Instead, they are parts of a larger population, D. Let us assume that we can combine these samples and work with them as a single whole (Figure 25).
Figure 25. Combining datasets A, B, and C into one larger dataset D. Let us assume this is all the data we have (link to the code for generating the image – image by author)
It is important to keep in mind that everything we do, splitting the data into training and test sets, preprocessing the data, calculating metrics, running statistical tests, and everything else, serves one goal: to make sure the final model describes the full population well. The goal of statistics, and this is true for supervised machine learning as well, is to draw conclusions about the whole population using only a sample.
In other words, if we somehow built a model that predicted the prices of these 45 apartments perfectly, we would have a tool that always gives the correct answer, because in this hypothetical world there are no other data on which the model could fail. Again, everything here depends on that “if.” Now let me return us to reality and try to describe all the data with a single linear regression model (Figure 26).
Figure 26. A model fitted to all available data, the “reference model.” The metric values shown in the figure will be treated as a reference point that we will aim for later in the article (link to the code for generating the image – image by author)
In the real world, collecting data on every apartment is physically impossible, because it would take too much time, money, and effort, so we always work with only a subset. The same applies here: we collected samples and tried to estimate the relationship between the variables in a way that would bring us as close as possible to the relationship in population, entire dataset D.
One very important note: Later in the article, we will occasionally take advantage of the rules of our simplified world and peek at how the fitted model behaves on the full population. This will help us understand whether our modifications were successful, when the error metric goes down, or not, when the error metric goes up. At the same time, please keep in mind that this is not something we can do in the real world. In practice, it is impossible to evaluate a model on every single object!
Improving model quality
In the previous section, before we combined our data into one full population, we measured the model’s prediction error and found the results unsatisfying. In other words, we want to improve the model. Broadly speaking, there are three ways to do that: change the data, change the model, or change both. More specifically, the options are:
- Expanding the sample: increasing the number of observations in the dataset
- Reducing the sample: removing outliers and other undesirable rows from the data table
- Making the model more complex: adding new features, either directly observed or newly engineered
- Making the model simpler: reducing the number of features (sometimes this also improves the metrics)
- Tuning the model: searching for the best hyperparameters, meaning parameters that are not learned during training
We will go through these approaches one by one, starting with sample expansion. To illustrate the idea, we will run an experiment.
Expanding the sample
Keep in mind that the values from the full population are not directly available to us, and we can only access them in parts. In this experiment, we will randomly draw samples of 10 and 20 apartments. For each sample size, we will repeat the experiment 30 times. The metrics will be measured on 1) the training set, 2) the test set, and 3) the population, that is, all 45 observations. This should help us see whether larger samples lead to a more reliable model for the full population (Animation 6).
Animation 6. Analyzing the relationship between sample size and the metrics calculated on the full population. The animation shows the first five out of 30 runs for each sampling strategy, with samples of 10 and 20 observations (link to the code for generating the animation – animation by author)
Increasing the sample size is a good idea if only because mathematical statistics tends to work better with larger numbers. As a result, the metrics become more stable, and the statistical tests become more reliable as well (Figure 27).
Figure 27. Results of the sample size experiment: as the number of observations in the sample increases, the training and test metrics get closer to the values the model shows on the full population. Model quality improves as well (link to the code for generating the image – image by author)
If boxplots are more familiar to you, take a look at Boxplot version of Figure 27.
Figure 27 in a form of Boxplot
Extra figure 10. Boxplots. Results of the sample size experiment: as the number of observations in the sample increases, the training and test metrics get closer to the values the model shows on the full population. Model quality improves as well (link to the code for generating the image – image by author)
Even though we worked here with very small samples, partly for visual convenience, Animation 6 and Figure 27 still let us draw a few conclusions that also hold for larger datasets. In particular:
- The average RMSE on the population is lower when the sample size is 20 rather than 10, specifically 4088 versus 4419. This means that a model fitted on more data has a lower error on the population (all available data).
- The metric estimates are more stable for larger samples. With 20 observations, the gap between RMSE on the training set, the test set, and the population is smaller.
As we can see, using larger samples, 20 observations rather than 10, led to better metric values on the population. The same principle applies in practice: after making changes to the data or to the model, always check the metrics. If the change improves the metric, keep it. If it makes the metric worse, roll it back. Rely on an engineering mindset, not on luck. Of course, in the real world we cannot measure metrics on the full population. But metrics on the training and test sets can still help us choose the right direction.
Reducing the sample by filtering outliers
Since this section is about pruning the sample, I will leave out the train-test split so the visualizations stay easier to read. Another reason is that linear models are highly sensitive to filtering when the sample is small, and here we are deliberately using small samples for clarity. So in this section, each model will be fitted on all observations in the sample.
We tried to collect more data for model fitting. But now imagine that we were unlucky: even with a sample of 20 observations, we still failed to obtain a model that looks close to the reference one (Figure 28).
Figure 28. An “unlucky” sample extraction from the population. The reference model is shown as a black line (link to the code for generating the image – image by author)
Besides a sample that does not reflect the underlying relationship well, other factors can make the task even harder. Such distortions are quite common in real data for many reasons: measurement inaccuracies, technical errors during data storage or transfer, and simple human mistakes. In our case, imagine that some of the real estate agents we asked for data made mistakes when entering information manually from paper records: they typed 3 instead of 4, or added or removed zeros (Figure 29).
Figure 29. Some samples contain corrupted data (link to the code for generating the image – image by author)
If we fit a model to this raw data, the result will be far from the reference model, and once again we will be unhappy with the modeling quality.
This time, we will try to solve the problem by removing a few observations that are much less similar to the rest, in other words, outliers. There are many methods for this, but most of them rely on the same basic idea: separating similar observations from unusual ones using some threshold (Figure 30) [Mandic-Rajcevic, et al. Methods for the Identification of Outliers and Their Influence on Exposure Assessment in Agricultural Pesticide Applicators: A Proposed Approach and Validation Using Biological Monitoring. 2019. https://doi.org/10.3390/toxics7030037]:
- Interquartile range (IQR), a nonparametric method
- Three-sigma rule, a parametric method, since it assumes a distribution, most often a normal one
- Z-score, a parametric method
- Modified Z-score (based on the median), a parametric method
Parametric methods rely on an assumption about the shape of the data distribution, most often a normal one. Nonparametric methods do not require such assumptions and work more flexibly, mainly using the ordering of values or quantiles. As a result, parametric methods can be more effective when their assumptions are correct, while nonparametric methods are usually more robust when the distribution is unknown.
Figure 30. Outlier filtering as a way to detect unusual observations. Here we look at how one-dimensional filtering methods work, using only the target values, on synthetic data (link to the code for generating the image – image by author)
In one-dimensional methods (Figure 30), the features are not used. Only one variable is considered, namely the target y. That is why, among other things, these methods clearly do not take the trend in the data into account. Another limitation is that they require a threshold to be chosen, whether it is 1.5 in the interquartile range rule, 3 in the three-sigma rule, or a cutoff value for the Z-score.
Another important note is that three of the four outlier filtering methods shown here rely on an assumption about the shape of the target distribution. If the data are normally distributed, or at least have a single mode and are not strongly asymmetric, then the three-sigma rule, the Z-score method, and the modified Z-score method will usually give reasonable results. But if the distribution has a less usual shape, points flagged as outliers may not actually be outliers. Since in Figure 30 the distribution is fairly close to a normal bell shape, these standard methods are appropriate in this case.
One more interesting detail is that the three-sigma rule is really a special case of the Z-score method with a threshold of 3.0. The only difference is that it is expressed in the original y scale rather than in standardized units, that is, in Z-score space. You can see this in the plot by comparing the 2σ2\sigma lines from the three-sigma method with the lines from the Z-score method at a threshold of 2.0.
If we apply all of the filtering methods described above to our data, we obtain the following fitted models (Figure 31).
Figure 31. Models fitted to data filtered with one-dimensional methods (link to the code for generating the image – image by author)
Looking at Figure 31, we can see that the worst model in terms of RMSE on the population is the one fitted on the data with outliers still included. The best RMSE is achieved by the model fitted on the data filtered using the Z-score method with a threshold of 1.5.
Figure 31 makes it fairly easy to compare how effective the different outlier filtering methods are. But that impression is misleading, because here we are checking the metrics against the full population D, which is not something we have access to in real model development.
So what should we do instead? Experiment. In some cases, the quickest and most practical option is to clean the test set and then measure the metric on it. In others, outlier removal can be treated as successful if the gap between the training and test errors becomes smaller. There is no single approach that works best in every case.
I suggest moving on to methods that use information from multiple variables. I will mention four of them, and we will look at the last two separately:
Figure 32. Outlier filtering as a way to detect unusual observations. Here we look at how multivariate filtering methods work (link to the code for generating the image – image by author)
Each method shown in Figure 32 deserves a separate discussion, since they are already much more advanced than the one-dimensional approaches. Here, however, I will limit myself to the visualizations and avoid going too deep into the details. We will treat these methods from a practical point of view and look at how their use affects the coefficients and metrics of a linear regression model (Figure 33).
Figure 33. Models fitted to data filtered with multivariate methods (link to the code for generating the image – image by author)
The methods shown in the visualizations above are not limited to linear regression. This kind of filtering can also be useful for other regression algorithms, and not only regression ones. That said, the most interesting methods to study separately are the ones that are specific to linear regression itself: leverage, Cook’s distance, and Random Sample Consensus (RANSAC).
Now let us look at leverage and Cook’s distance. Leverage is a quantity that shows how unusual an observation is along the x-axis, in other words, how far xix_i is from the center of the data. If it is far away, the observation has high leverage. A good metaphor here is a seesaw: the farther you sit from the center, the more influence you have on the motion. Cook’s distance measures how much a point can change the model if we remove it. It depends on both leverage and the residual.
Animation 7. How leverage and Cook’s distance work. The formulas are shown for a single point, where p is the number of model parameters. After removing an observation, we measure the error of the new model. If the metric improves, we keep the new model. If not, – consider the alternative (link to the code for generating the animation – animation by author)
In the example above, the calculations are performed iteratively for clarity. In practice, however, libraries such as scikit-learn implement this differently, so Cook’s distance can be computed without actually refitting the model n times.
One important note: a large Cook’s distance does not always mean the data are bad. It may point to an important cluster instead. Blindly removing such observations can hurt the model’s ability to generalize, so validation is always a good idea.
If you are looking for a more automated way to filter out values, that exists too. One good example is the RANSAC algorithm, which is a useful tool for automatic outlier removal (Animation 8). It works in six steps:
- Randomly select a subset of n observations.
- Fit a model to those n observations.
- Remove outliers, that is, exclude observations for which the model error exceeds a chosen threshold.
- Optional step: fit the model again on the remaining inliers and remove outliers one more time.
- Count the number of inliers, denoted by m.
- Repeat the first five steps several times, where we choose the number of iterations ourselves, and then select the model for which the number of inliers m is the largest.
Animation 8. How the RANSAC algorithm works (link to the code for generating the animation – animation by author)
The results of applying the RANSAC algorithm and the Cook’s distance method are shown in Figure 34.
Figure 34. Linear regression models fitted to data filtered using the RANSAC and Cook’s distance outlier detection methods. The RMSE of the reference model on the population is 3873 (link to the code for generating the image – image by author)
Based on the results shown in Figure 34, the most promising model in this comparison is the one fitted with RANSAC.
To sum up, we tried to collect more data, and then filtered out what looked unusual. It is worth noting that outliers are not necessarily “bad” or “wrong” values. They are simply observations that differ from the rest, and removing them from the training set is not the same as correcting data errors. Even so, excluding extreme observations can make the model more stable on the larger share of more typical data.
For clarity, in the next part of the article we will continue working with the original unfiltered sample. That way, we will be able to see how the model behaves on outliers under different transformations. Still, we now know what to do when we want to remove them.
Making the model more complex: multiple linear regression
As an alternative, and also as a complement to the first two approaches (of model quality improvement), we can introduce new features to the model.
Figure 35. Multiple linear regression (image by author)
Feature engineering. Generating new features
A good place to start transforming the feature space is with one of the simplest approaches to implement: generating new features from the ones we already have. This makes it possible to avoid changes to the data collection pipelines, which in turn makes the solution faster and often cheaper to implement (in contrast to collecting new features from scratch). One of the most common transformations is the polynomial one, where features are multiplied by each other and raised to a power. Since our current dataset has only one feature, this will look as follows (Figure 36).
Figure 36. Polynomial feature transformation of degree 2 (image by author)
Note that the resulting equation is now a polynomial regression model, which makes it possible to capture nonlinear relationships in the data. The higher the polynomial degree, the more degrees of freedom the model has (Figure 37).
Figure 37. Examples of polynomials fitted to the sample. At this point nonlinear relationships become possible to model (link to the code for generating the image – image by author)
There are many different transformations that can be applied to the original data. However, once we use them, the model is no longer truly linear, which is already visible in the shape of the fitted curves in Figure 37. For that reason, I will not go into them in detail in this article. If this sparked your curiosity, you can read more about other feature transformations that can be applied to the original data. A good reference here is Trevor Hastie, Robert Tibshirani, Jerome Friedman – The Elements of Statistical Learning):
- Functional transformations
- Logarithms: log(x+ε)log(x + ε)
- Reciprocals: 1/x,1/(x+ε)1/x, 1/(x + ε)
- Roots: x, x1/3\sqrt{x},\ x^{1/3}
- Exponentials: exp(x), exp(−x)\exp(x),\ \exp(-x)
- Trigonometric functions: sin(x),cos(x),tan(x)sin(x), cos(x), tan(x) especially when a feature has periodic behavior
- Sigmoid: 1/(1+exp(−x))1 / (1 + exp(-x))
- Binarization and discretization
- Binning: split a feature X into intervals, for example, [x<10],[10≤x<20],[x≥20][x < 10], [10 ≤ x < 20], [x ≥ 20]
- Quantile binning: split the data into groups with equal numbers of observations
- Threshold functions (hello, neural networks)
- Splines
- Wavelet and Fourier transforms
- and many others
Collecting new features
If generating new features does not improve the metric, we can move to a “heavier” approach: collect more data, but this time not new observations, as we did earlier, but new characteristics, that is, new columns.
Suppose we have a chance to collect several additional candidate features. In the case of apartment prices, the following would make sense to consider:
- Apartment area, in square meters
- Distance to the nearest metro station, in meters
- City
- Whether the apartment has air conditioning
The updated dataset would then look as follows:
Figure 38. Dataset D with new features: apartment area, distance to the nearest metro station, city, and whether the apartment has air conditioning (image by author)
A note on visualization
Looking back at Figure 1, and at most of the figures earlier in the article, it is easy to see that a two-dimensional plot is no longer enough to capture all the features. So it is time to switch to new visualizations and look at the data from a different angle (Figure 39 and Animation 9).
Figure 39. Visualizing the relationships between several features and the target. The rows and columns correspond to features. Along the main diagonal, where each feature intersects with itself, the figure shows two-dimensional plots with the feature on the x-axis and the target on the y-axis. The upper triangle, above the main diagonal, contains 3D plots with two features on the x- and y-axes and the target on the z-axis. The lower triangle shows the same three-dimensional relationships in a different form, as contour maps where the axes correspond to features (link to the code for generating the image – image by author)
It is best to review the figure in detail (Figure 40).
Figure 40. Previous visualization (see Figure 39) of multidimensional data with annotations (image by author)
Animation 9. Three-dimensional scatter plots for two feature pairs: number of rooms & distance to the nearest metro station, and apartment area & air conditioning (link to the code for generating the animation – animation by author)
Animation 9 highlights two noticeable patterns in the dataset:
- The closer an apartment is to the metro, the higher its price tends to be. Apartments near metro stations also tend to have a smaller area (Observation 2 in Figure 40)
- Air conditioning is a feature that clearly separates the target, that is, apartment price: apartments with air conditioning tend to be more expensive (Observation 6 in Figure 40).
As the figures and animation show, a good visualization can reveal important patterns in the dataset long before we start fitting a model or looking at residual plots.
Side branch 6. Thinking back to Figure 5, why did the price decrease after all?
Let us go back to one of the first figures (Figure 5 and Figure 7) in the article, the one used to explain the idea of describing data with a straight line. It showed an example with three observations where the price went down even though the number of rooms increased. But everything becomes clear once we visualize the data with an additional feature:
Extra animation 4.Why apartment prices went down even as the number of rooms increased. – The price rises not because the number of rooms is smaller, but because the apartments are closer to the metro (link to the code for generating the animation – animation by author)
The reason for the price drop becomes much clearer here: even though the apartments were getting larger, they were also much farther from the metro station. Do not let the simplicity of this example fool you. It illustrates an important idea that is easy to lose sight of when working with truly large and complex data: we cannot see relationships between variables beyond the data we actually analyze. That is why conclusions should always be drawn with care. A new pattern may appear as soon as the dataset gains one more dimension.
As the number of features grows, it becomes harder to build pairwise visualizations like the ones shown in Figures 39 and 40. If your dataset contains many numerical features, a common choice is to use correlation matrices (Figure 41). I am sure you will come across them often if you continue exploring data science / data analysis domain.
Figure 41. A matrix of numerical features with the corresponding correlation coefficients (link to the code for generating the image – image by author)
The same principle applies here as it did when evaluating model quality: it is cognitively easier for an engineer to interpret numbers, one for each pair, than to inspect a large set of subplots. Figure 41 shows that price is positively correlated with the features number of rooms and area, and negatively correlated with distance to the metro. This makes sense: in general, the closer an apartment is to the metro or the larger it is, the more expensive it tends to be.
It is also worth noting why the correlation coefficient is so often visualized. It is always useful to check whether the dataset contains predictors that are strongly correlated with each other, a phenomenon known as multicollinearity. That is exactly what we see for the pair number of rooms and area, where the correlation coefficient is equal to one. In cases like this, it often makes sense to remove one of the features, because it adds little useful information to the model while still consuming resources, for example during data preparation and model optimization. Multicollinearity can also lead to other unpleasant consequences, but we will talk about it a bit later.
On the importance of preprocessing (categorical) features
As Figure 39 shows, the table now contains not only clean numerical values such as the number of rooms, but also less tidy distances to the metro, and even not straightforward values such as city names or text answers to questions like whether the apartment has a certain feature (e.g. air conditioning).
And while distance to the metro is not a problem, it is just another numerical feature like the ones we used in the model earlier, city names cannot be fed into the model directly. Just try assigning a coefficient to an expression like this: apartment price = X * New York. You could joke that some “apartments” really might cost, say, two New York, but that will not give you a useful model. That is why categorical features require special methods to convert them into numerical form
Starting with the simpler feature, air conditioning, since it takes only two values, yes or no. Features like this are usually encoded, that is, converted from text into numbers, using two values, for example (Figure 42):
Figure 42. Preprocessing binary features and the resulting linear model (link to the code for generating the image – image by author)
Notice that Figure 42 does not show two separate models, each fitted to its own subset, but a single model. Here, the slope coefficient b1b_1 stays fixed, while the vertical shift of the fitted line differs depending on whether the binary feature is 0 or 1. This happens because when the feature is equal to 0, the corresponding term in the model becomes zero. This works well when the relationship between the features and the target is linear and follows the same direction for all observations. But a binary feature will not help much when the relationship is more complex and changes direction across the data (Figure 43).
Figure 43. Differences in the relationships between the features and the target across subsets lead to a single model with a binary feature cannot adequately model either part of the dataset (image by author)
As Figure 43 shows, in the worst case a model with a binary feature collapses to the same behavior as a model with just one numerical feature. To address this “problem,” we can borrow an idea from the previous section (feature generation) and generate a new interaction feature, or we can fit two separate models for different parts of the dataset (Figure 44).
Figure 44. Ways to improve a model with a binary feature: fitting separate models and generating an interaction feature from the binary one for more accurate modeling (image by author)
Now that we have dealt with the binary feature, it makes sense to move on to the more complex case where a column contains more than two unique values. There are many ways to encode categorical values, and some of them are shown in Figure 45. I will not go through all of them here, though, because in my own experience one-hot encoding has been enough for practical applications. Just keep in mind that there are different encoding methods.
Figure 45. Methods for encoding categorical variables (link to the code for doing the computations – image by author)
Estimating feature importance
Now that we know how to make the model more complex by adding new features, it makes sense to talk about how to combine the independent variables more thoughtfully. Of course, when the feature space grows, whether through feature generation or through collecting new data, practical limits quickly appear, such as “common sense” and model “training time”. But we can also rely on more effective heuristics to decide which features are actually worth keeping in the model. Starting with the simplest one and take a closer look at the coefficients of a multiple linear regression model (Figure 46).
Figure 46. Coefficient size as an indicator of feature importance (link to the code for generating the image – image by author)
As Figure 46 shows, a small problem appears here: differences in feature scale affect the estimated coefficients. Differences in scale also lead to other unpleasant effects, which become especially noticeable when numerical methods are used to find the optimal coefficients. That is why it is standard practice to bring features to a common scale through normalization.
Normalization and standardization (standard scaling) of features
Normalization is a data transformation that brings the values in the arrays to the same range (Figure 47).
Figure 47. Demonstration of the results of applying data normalization methods to two features: number of rooms and distance to the metro (link to the code for generating the image – image by author)
Once the features are brought to the same scale, the size of the coefficients in a linear regression model becomes a convenient indicator of how strongly the model relies on each variable when making predictions.
The exact formulas used for normalization and standardization are shown in Figure 48.
Figure 48. Scaling methods. Extreme cases with outliers are shown here. In practice, if the training set is representative, such outliers should be much less common (link to the code for doing the computations – image by author)
From this point on, we will assume that all numerical features have been standardized. For the sake of clearer visualizations, we will apply the same transformation to the target as well, even though that is not compulsory. When needed, we can always convert the target back to its original scale.
Model coefficient and error landscape when the features are standardized
Once the original features have been standardized, meaning the coefficients b1b_1, b2b_2, b3b_3 and so on are now on a comparable scale, which makes them easier to vary, it becomes a good moment to look more closely at how their values affect model error. To measure error, we will use MAE and MAPE for simple linear regression, and RMSE for multiple linear regression.
Animation 10. Relationship between the coefficients b0b_0 and b1b_1 of a simple linear regression model and the MAE metric. The feature in the model is the number of rooms. A note on the changing intercept in the original units: we vary the slope while working with standardized data, so the intercept in the original units changes (recalculated) accordingly (link to the code for generating the animation – animation by author)
As Animation 10 shows, there is a particular combination of coefficients at which the model error reaches its minimum. At the same time, changes in the intercept and the slope affect the error to a similar degree, the contour lines of the error surface on the left are almost circular.
For comparison, it is useful to look at how different metric landscapes can be. In the case of mean absolute percentage error, the picture changes noticeably. Because MAPE is sensitive to errors at small target values, here, “cheap apartments”, the minimum stretches into an elongated valley. As a result, many coefficient combinations produce similar MAPE values as long as the model fits the region of small y well, even if it makes noticeable errors for expensive apartments (Animation 11).
Animation 11. Relationship between the coefficients b0b_0 and b1b_1 of a simple linear regression model and the MAPE metric. The feature in the model is the number of rooms (link to the code for generating the animation – animation by author)
Next, we increase the number of features in the model, so instead of finding the optimal combination of two coefficients, we now need to find the best combination of three (Animations 12 and 13):
Animation 12. Relationship between the coefficients b0b_0, b1b_1, b2b_2 and the RMSE metric. The features in the model are number of rooms (x1x_1) and distance to the metro (x2x_2) (link to the code for generating the animation – animation by author)
Animation 13. Relationship between the coefficients b0b_0, b1b_1, b2b_2 and the RMSE metric. The features in the model are number of rooms (x1x_1), apartment area (x2x_2) (link to the code for generating the animation – animation by author)
The animations above show that the features are strongly linearly related. For example, in Animation 12, the b1b_1 vs b2b_2 projection, the plane on the left in the lower-left panel, shows a clear linear pattern. This tells us two things. First, there is a strong negative correlation between the features number of rooms and distance to the metro. Second, even though the coefficients “move along the valley” of low RMSE values, the model predictions remain stable, and the error hardly changes. This also suggests that the features carry similar information. The same pattern appears in Animation 13, but there the linear relationship between the features is even stronger, and positive rather than negative.
I hope this short section with visualizations gave you a chance to catch your breath, because the next part will be harder to follow: from here on, linear algebra becomes unavoidable. Still, I promise it will include just as many visualizations and intuitive examples.
Extending the analytical solution to the multivariate case
Earlier in the article, when we explored the error surface, we could visually see where the model error reached its minimum. The model itself has no such visual cue, so it finds the optimum, the best combination of coefficients b0b_0, b1b_1, b2b_2, and so on, using a formula. For simple linear regression, where there is only one feature, we already introduced that equation (Figure 6). But now we have several features, and once they have been preprocessed, it is natural to ask how to find the optimal coefficients for multiple linear regression, in other words, how to extend the solution to higher-dimensional data.
A quick disclaimer: this section will be very colorful, and that is intentional, because each color carries meaning. So I have two requests. First, please pay close attention to the colors. Second, if you have difficulty distinguishing colors or shades, please send me your suggestions on how these visualizations could be improved, including in a private message if you prefer. I will do my best to keep improving the visuals over time.
Earlier, when we introduced the analytical solution, we wrote the calculations in scalar form. But it is much more efficient to switch to vector notation. To make that step easier, we will visualize the original data not in feature space, but in observation space (Figure 49).
Figure 49. A toy dataset and its representation in observation space (image by author)
Even though this way of looking at the data may seem counterintuitive at first, there is no magic behind it. The data are exactly the same, only the form has changed. Moving on, in school, at least in my case, vectors were introduced as directed line segments. These “directed line segments” can be multiplied by a number and added together. In vector space, the goal of linear regression is to find a transformation of the vector x such that the resulting prediction vector, usually written as y^\hat{y} , is as close as possible to the target vector y. To see how this works, we can start by trying the simplest transformations, beginning with multiplication by a number (Figure 50).
Figure 50. Building the simplest linear regression model: slope (b1) only, scaling the vector x by different numbers (image by author)
Starting from the top-left corner of Figure 50, the model does not transform the feature vector x at all, because the coefficient b1b_1 is equal to 1. As a result, the predicted values are exactly the same as the feature values, and the vector x fully corresponds to the forecast vector
If the coefficient b1b_1 is greater than 1, multiplying the vector x by this coefficient increases the length of the prediction vector proportionally. The feature vector can also be compressed, when b1b_1 is between 0 and 1, or flipped in the opposite direction, when b1b_1 is less than 0.
Figure 51. What to do when multiplying by b1b_1 is not enough (image by author)
Figure 50 gives a clear visual explanation of what it means to multiply a vector by a scalar. But in Figure 51, two more vector operations appear. It makes sense to briefly review them separately before moving on (Figure 52).
Figure 52. A small but important reminder: translation and vector addition (image by author)
After this brief reminder, we can continue. As Figure 51 shows, for two observations we were able to express the target vector as a combination of feature vectors and coefficients. But now it is time to make the task more difficult (Animation 14).
Animation 14. Increasing the sample size to three observations. Try to imagine a straight line on the plot to the left that passes through all three points (link to the code for generating the animation – animation by author)
As the number of observations grows, the dimensionality grows with it, and the plot gains more axes. That quickly becomes hard for us (humans) to picture, so I will not go further into higher dimensions here, there is no real need. The main ideas we are discussing still work there as well. In particular, the task remains the same: we need to find a combination of the vectors vv (the all-ones vector) and xx, the feature vector from the dataset, such that the resulting prediction vector y^\hat{y} is as close as possible to the target vector yy. The only things we can vary here are the coefficients multiplying v, namely b0b_0, and xx, namely b1b_1. So now we can try different combinations and see what the solution looks like both in feature space and in vector space (Animation 15).
Animation 15. Exploring the coefficients of a simple linear regression model for three observations: a visualization of the target and prediction vectors, where the prediction vector is formed from the feature vectors vv and xx. Visualization of the subspace Col(X)Col(X) (link to the code for generating the animation – animation by author)
The area of the graph that contains all possible solutions can be outlined, which gives us a plane. In the animation above, that plane is shown as a parallelogram to make it easier to see. We will call this plane the prediction subspace and denote it as Col(X)Col(X). As shown in Animation 15, the target vector y does not lie in the solution subspace. This means that no matter which solution, or prediction vector, we find, it will always differ slightly from the target one. Our goal is to find a prediction vector that lies as close as possible to y while still belonging to the subspace Col(X)Col(X).
In the visualization above, we built this subspace by combining the vectors vv and xx with different coefficients. The same expression can also be written in a more compact form, using matrix multiplication. To do this, we introduce one more vector, this time built from the coefficients b0b_0 and b1b_1. We will denote it by b→\vec{b}. A vector can be transformed by multiplying it by a matrix, which can rotate it, stretch or compress it, and also map it into another subspace. If we take the matrix XX built from the column vectors vv and xx, and multiply it by the vector b→\vec{b} made up of the coefficient values, we obtain a mapping of yy into the subspace Col(X)Col(X) (Figure 53).
Figure 53. Transforming the target vector yy into the prediction vector y^\hat{y} (image by author)
Note that, in line with our assumptions, the target vector does not lie in the prediction subspace. While a straight line can always be drawn exactly through two points, with three or more points the chance increases that no perfect model with zero error exists. That is why the target vector does not lie on the hyperplane even for the optimal model (see the black vector for model C in Figure 54).
Figure 54. Visualization of two poor models, A and B, and one optimal model, C (link to the code for generating the image – image by author)
A closer look at the figure reveals an important difference between the prediction vectors of models A, B, and C: the vector for model C looks like the shadow of the target vector on the plane. This means that solving a linear regression problem can be interpreted as projecting the vector y onto the subspace Col(X)Col(X). The best prediction among all possible ones is the vector that ends at the point on the plane closest to the target. From basic geometry, the closest point on a plane is the point where a perpendicular from the target meets the plane. This perpendicular segment is also a vector, called the residual vector ee, because it is obtained by subtracting the predictions from the target (recall the residual formula from the chapter on visual model evaluation).
So, we know the target vector yy and the feature vector xx. Our goal is to find a coefficient vector b→\vec{b} such that the resulting prediction vector y^\hat{y} is as close as possible to yy. We do not know the residual vector ee, but we do know that it is orthogonal to the space Col(X)Col(X). This, in turn, means that ee is orthogonal to every direction in the plane, and therefore, in particular, perpendicular to every column of XX, that is, to the vectors vv and xx.
Figure 55. Using the orthogonality property to derive the formula. To find the coefficient vector, we need to transpose, multiply, and invert the feature matrix. The Ordinary Least Squares (OLS) method (image by author)
The analytical method we have just gone through is called the least squares method, or Ordinary Least Squares (OLS). It has this name because we chose the coefficients to minimize the sum of squared residuals of the model (Figure 6). In vector space, the size of the residuals is the squared Euclidean distance from the target point to the subspace Col(X)Col(X) (Figure 55). In other words, least squares means the smallest squared distance.
Now let us recall the goal of this section: we worked through the formulas and visualizations above to extend the analytical solution to the multivariate case. And now it is time to check how the formula works when there are not one but two features! Consider a dataset with three observations, to which we add one more feature (Animation 16).
Animation 16. What happens when the number of features increases: multivariate regression in vector form. The formula stays the same, only one new vector, x2, is added to the matrix X. For visual convenience, the subspace Col(X)Col(X) is shown as bounded by a polygon (link to the code for generating the animation – animation by author)
There are three important findings to take away from Animation 16:
- First, the model plane passes exactly through all three data points. This means that the second feature added the missing information that the one feature model lacked. In Figure 50, for example, none of the lines passed through all the points.
- Second, on the right, the number of vectors has not changed, because the dataset still contains three observations.
- Third, the subspace Col(X)Col(X) is no longer just a “plane” on the graph, it now fills the entire space. For visualization purposes, the values are bounded by a three dimensional shape, a parallelepiped. Since this subspace fully contains the target vector y, the projection of the target becomes trivial. In the animation, the target vector and the prediction vector coincide. The residual is zero.
When the analytical solution runs into difficulties
Now imagine we are unlucky, and the new feature x2 does not add any new information. Suppose this new feature can be expressed as a linear combination of the other two, the shift term and feature x1. In that case, the Col(X)Col(X) polygon collapses back into a plane, as shown in Animation 17.
Animation 17. Many coefficient combinations lead to the same prediction: multivariate linear regression with two features, where one can be expressed as a linear combination of the other and the shift term (link to the code for generating the animation – animation by author)
And even though we previously had no trouble finding a projection onto such a subspace, the prediction vector is now built not from two vectors, the shift term and x1, but from three, the shift term, x1 and x2. Because there are now more degrees of freedom, there is more than one solution. On the left side of the graph, this is shown by two separate model surfaces that describe the data equally well from the point of view of the least squares method. On the right, the feature vectors for each model are shown, and in both cases they add up to the same prediction vector.
With this kind of input data, the problem appears when trying to compute the inverse matrix (Figure 56).
Figure 56. The formula for the analytical solution that we used earlier can no longer be applied. Exactly the same problem will also appear in our main apartment price dataset (image by author)
As Figure 56 shows, the matrix is singular, which means the inverse matrix formula cannot be applied and there is no unique solution. It is worth noting that even when there is no exact linear dependence, the problem still remains if the features are highly correlated with one another, for example, floor area and number of rooms. In that case, the matrix becomes ill-conditioned, and the solution becomes numerically unstable. Other issues may also arise, for example with one-hot encoded features, but even this is already enough to start thinking about alternative solution methods.
In addition to the issues discussed above, an analytical solution to linear regression is also not applicable in the following cases:
- A non-quadratic or non-smooth loss function is used, such as L1 loss or quantile loss. In that case, the task no longer reduces to the least squares method.
- The dataset is very large, or the computing device has limited memory, so even if a formula exists, calculating it directly is not practical.
Anticipating how the reader may feel after getting through this section, it is worth pausing for a moment and keeping one main idea in mind: sometimes the “formula” either does not work or is not worth using, and in those cases we turn to numerical methods.
Numerical methods
To address the problem with the analytical solution formula described above, numerical methods are used. Before moving on to specific implementations, however, it is useful to state the task clearly: we need to find a combination of coefficients for the features in a linear regression model that makes the error as small as possible. We will measure the error using metrics.
Exhaustive search
The simplest approach is to try all coefficient combinations using some fixed step size. In this case, exhaustive search means checking every pair of coefficients from a predefined discrete grid of values and selecting the pair with the smallest error. The MSE metric is usually used to measure that error, which is the same as RMSE but without the square root.
Perhaps because of my love for geography, one analogy has always come to mind: optimization as the search for the location with the lowest elevation (Animation 18). Imagine a landscape in the “real world” on the left. During the search, we can sample individual locations and build a map in the center, in order to solve a practical problem, in our case, to find the coordinates of the point where the error function reaches its minimum.
For simplicity, Animations 18 and 19 show the process of finding coefficients for simple linear regression. However, the numerical optimization methods discussed here also extend to multivariate cases, where the model includes many features. The main idea stays the same, but such problems become extremely difficult to visualize because of their high dimensionality.
Animation 18. Exhaustive search for finding the solution to simple linear regression (link to the code for generating the animation – animation by author)
Random search
The exhaustive search approach has one major drawback: it depends heavily on the grid step size. The grid covers the space uniformly, and although some regions are clearly unpromising, computations are still performed for poor coefficient combinations. Therefore, it might be beneficial to explore landscape randomly without a pre-defined grid (Animation 19).
Animation 19. Random search for finding the optimal set of coefficients in simple linear regression (link to the code for generating the animation – animation by author)
One drawback of both random search and grid based search is their computational cost, especially when the dataset is large and the number of features is high. In that case, each iteration requires computational effort, so it makes sense to look for an approach that minimizes the number of iterations.
Using information about the direction
Instead of blindly trying random coefficient combinations, the approach can be improved by using information about the shape of the error function landscape and taking a step in the most promising direction based on the current value. This is especially relevant for the MSE error function in linear regression, because the error function is convex, which means it has only one global optimum.
To make the idea easier to see, we will simplify the problem and take a slice along just one parameter, a one dimensional array, and use it as an example. As we move along this array, we can use the fact that the error value has already been computed at the previous step. By taking MSE in this example and comparing the current value with the previous one, we can determine which direction makes sense for the next step, as shown in Figure 57.
Figure 57. Descent using pairwise comparisons. Optimizing the coefficient values in the slice along the intercept parameter b0 (link to the code for generating the image – image by author)
We move along the slice from left to right, and if the error starts to increase, we turn and move in the opposite direction.
It makes sense to visualize this approach in motion. Start from a random initial guess, a randomly chosen point on the graph, and move to the right, thereby increasing the intercept coefficient. If the error starts to grow, the next step is taken in the opposite direction. During the search, we will also count how many times the metric is evaluated (Animation 20).
Animation 20. Descent using pairwise comparisons along a parabola. Examples are shown for two initial guesses, the yellow one and the green one (link to the code for generating the animation – animation by author)
It is important to note explicitly that in Animation 20 the step is always equal to one interval, one grid step, and no derivatives are used yet, anticipating the gradient descent algorithm. We simply compare metric values in pairs.
The approach described above has one major drawback: it depends heavily on the grid size. For example, if the grid is fine, many steps will be needed to reach the optimum. On the other hand, if the grid is too coarse, the optimum will be missed (Animation 21).
Animation 21. Descent using pairwise comparisons: convergence speed and grid size (link to the code for generating the animation – animation by author)
So, we want the grid to be as dense as possible in order to descend to the minimum with high accuracy. At the same time, we want it to be as sparse as possible in order to reduce the number of iterations needed to reach the optimum. Using the derivative solves both of these problems.
Gradient descent
As the grid step becomes smaller in pairwise comparisons, we arrive at the limit based definition of the derivative (Figure 58).
Figure 58. The gradient on a slice of the error function: in the one dimensional case, it is the derivative and shows the direction of change in MSE (link to the code for doing the computations – image by author)
Now it is time to surf across the error landscape. See the animation below, which shows the gradient and the anti-gradient vectors (Animation 22). As we can see, the step size can now be chosen freely, because we are no longer constrained by a regular grid [Goh, Gabriel. Why Momentum Really Works. 2017. https://distill.pub/2017/momentum/].
Animation 22. Exploring the gradient and anti-gradient in different parts of the error slice. Since we are no longer limited by the grid size, the step between iterations can now be chosen freely: larger for the first initial guess, the yellow point, and smaller for the second initial guess, the green point (link to the code for generating the animation – animation by author)
In multivariate spaces, for example when optimizing the intercept and slope coefficients at the same time, the gradient consists of partial derivatives (Figure 59).
Figure 59. Choosing the direction when the gradient is computed with respect to two coefficients (link to the code for generating the image – image by author)
It is now time to see gradient descent in action (Animation 23).
Animation 23. Gradient descent for finding the optimal set of coefficients in simple linear regression. In practice, the starting point is usually chosen at or near the coordinates 0, 0. In the examples that follow, however, I will use different starting points to make the visualizations less repetitive (link to the code for generating the animation – animation by author)
See how gradient descent converges at different learning rates
Extra animation 5. Slowly moving toward the optimum with a learning rate of 0.06. The maximum number of iterations allowed is 25 (link to the code for generating the animation – animation by author)
Extra animation 6. Overshooting the optimum with a learning rate of 3.0.
(link to the code for generating the animation – animation by author)
A useful feature of numerical methods is that the error function can be defined in different ways and, as a result, different properties of the model can be optimized (Figure 60).
Figure 60. A model can be optimized in different ways. Tukey’s biweight loss as a way to handle outliers (link to the code for doing the computations – image by author)
When Tukey’s loss function is used, the optimization process looks as follows (Animation 24).
Animation 24. Replacing the MSE error function with Tukey’s loss function (link to the code for generating the animation – animation by author)
However, unlike the squared loss, Tukey’s loss function is not always convex, which means it can have local minima and saddle points where the optimization may get stuck (Animation 25).
Animation 25. Gradient descent is a local optimization method, so the starting point matters. Shown using Tukey’s loss function (link to the code for generating the animation – animation by author)
Now we move on to multivariate regression. If we look at the convergence history of the solution toward the optimal coefficients, we can see how the coefficients for the “important” features gradually increase, while the error gradually decreases as well (Figure 61).
Figure 61. The process of converging to the optimal solution in a multiple linear regression model (link to the code for generating the image – image by author)
Regularization
Recall the effect shown in Animation 5, where different training samples led to different estimated coefficients, even though we were trying to recover the same underlying relationship between the feature and the target. The model turned out to be unstable, meaning it was sensitive to the train test split.
There is another problem as well: sometimes a model performs well on the training set but poorly on new data.
So, in this section, we will look at coefficient estimation from two perspectives:
- How regularization helps when different train test splits lead to different coefficient estimates
- How regularization helps the model perform well to new data
Keep in mind that our data is not great: there is multicollinearity, meaning correlation between features, which leads to numerically unstable coefficients (Figure 62).
Figure 62. Multicollinearity makes the model unstable: different training samples drawn from the same population lead to different results (link to the code for generating the image – image by author)
One way to improve numerical stability is to impose constraints on the coefficients, that is, to use regularization (Figure 63).
Figure 63. Imposing constraints on the values of the coefficients for the features in a linear regression model. Lasso and Ridge regression. Split 2 (image by author)
Regularization allows finer control over the training process: the feature coefficients take on more reasonable values. This also helps address possible overfitting, when the model performs much worse on new data than on the training set (Figure 64).
Figure 64. The convergence of coefficients under L1 regularization (Lasso) and L2 regularization (Ridge). Train/test split 2 (link to the code for generating the image – image by author)
At a certain point (Figure 64), the metric on the test set begins to rise and diverge from the metric on the training set, starting from iteration 10 of gradient descent with L2 regularization. This is another sign of overfitting. Still, for linear models, such behavior across gradient descent iterations is relatively rare, unlike in many other machine learning algorithms.
Now we can look at how the plots change for different coefficient values in Figure 65.
Figure 65. Coefficients of a multiple linear regression model obtained with Ridge regression, compared with coefficients obtained without regularization (link to the code for generating the image – image by author)
Figure 65 shows that with regularization, the coefficients become more even and no longer differ much, even when different training samples are used to fit the model.
Overfitting
The strength of regularization can be varied (Animation 26).
Animation 26. Scatter plot of predictions vs actual values, along with the metric values for models obtained with different levels of regularization (link to the code for generating the animation – animation by author)
Animation 26 shows the following:
- Row 1: The feature coefficients, the metrics on the training and test sets, and a plot comparing predictions with actual values for the model without regularization.
- Row 2: How Lasso regression behaves at different levels of regularization. The error on the test set decreases at first, but then the model gradually collapses to predicting the mean because the regularization becomes too strong, and the feature coefficients shrink to zero.
- Row 3: As the regularization becomes stronger, Ridge regression shows better and better error values on the test set, even though the error on the training set gradually increases.
The main takeaway from Animation 26 is this: with weak regularization, the model performs very well on the training set, but its quality drops noticeably on the test set. This is an example of overfitting (Figure 66).
Figure 66. Overfitting, when a model performs poorly on new data (image by author)
Here is an artificial but highly illustrative example based on generated features for polynomial regression (Animation 27).
Animation 27. Regularization with polynomial features, when the model learns to capture the important patterns instead of trying to fit the noise in the data. The data is synthetic: the underlying relationship is linear, noise is added to the training set, while the test set is left noise free (link to the code for generating the animation – animation by author)
Hyperparameters tuning
Above, we touched on a very important question: how to determine which value of the hyperparameter alpha is suitable for our dataset (since we can vary regularization strength). One option is to split the data into training and test sets, train n models on the training set, then evaluate the metric on the test set for each model. We then choose the one with the smallest test error (Figure 67).
Figure 67. The hyperparameter tuning by grid search, with metrics measured on the test set in order to find the optimal model coefficients (link to the code for generating the image – image by author)
However, the approach above creates a risk of tuning the model to a specific test set, which is why cross-validation is commonly used in machine learning (Figure 68).
Figure 68. Splitting the data into training, validation and test sets, and training the model on the data (link to the code for generating the image – image by author)
As Figure 68 shows, in cross-validation the metric is evaluated using the entire dataset, which makes comparisons more reliable. This is a very common approach in machine learning, and not only for linear regression models. If this topic interests you, the scikit-learn documentation on cross-validation is a good place to continue: https://scikit-learn.org/stable/modules/cross_validation.html.
Linear regression is a whole world
In machine learning, it is connected with metrics, cross-validation, hyperparameter tuning, coefficient optimization with gradient descent, methods for filtering values and selecting features, and preprocessing.
In statistics and probability theory, it involves parameter estimation, residual distributions, prediction intervals, and statistical testing.
In linear algebra, it brings in vectors, matrix operations, projections onto feature subspaces, and much more.
Figure 69. Thank you for your attention! (image by author)
Conclusion
Thank you to everyone who made it this far.
We did not just get acquainted with a machine learning algorithm, but also with the toolkit needed to tune it carefully and diagnose its behavior. I hope this article will play its part in your journey into the world of machine learning and statistics. From here on, you sail on your own 🙂
If you enjoyed the visualizations and examples, and would like to use them in your own lectures or talks, please do. All materials and the source code used to generate them are available in the GitHub repository – https://github.com/Dreamlone/linear-regression
Sincerely yours, Mikhail Sarafanov

