This article is Part 2 of a series of three articles. In the first article, we introduced the Cross-Industry Standard Process for Data Mining (CRISP-DM), along with the concept of graphs for model visualization. In this article, we will take a closer look at a practical application for Linear Regression.

Conventions Used for this Article

The following subsection category names have been defined based on the preferences of the audience. The nomenclature was selected based on my favorite race of humanoids in the Star Trek universe.

For Vulcans
Includes deep technical and academic content for ML practitioners. 

For Non-Vulcans
Includes content at a high level for managers and other business-level professionals who may work with ML practitioners.

External Materials

All code used to produce the visuals in this article can be cloned or downloaded from https://github.com/jbonfardeci/model-validation-blog-post.

For linear regression models, we test for homoscedasticity (same dispersion), aka constant variance, by plotting the model’s residuals on the Y-axis by the model’s predicted values on the X-axis (Figure 3 below). The model is valid if there is no discernable pattern. That is – the dots are randomly scattered from left to right, as shown in the first plot in Figure 3. The model is invalid if there is a pattern indicating heteroscedasticity (different dispersion), aka non-constant variance, as shown in the second and third plots in Figure 3.

Heteroscedasticity in a plot indicates that one or more of the Seven Classical Assumptions of Ordinary Least Squares (OLS) have been violated. The assumptions of OLS are:

1. The regression model is linear in the coefficients and the error term. 

2. The errors have a mean of zero. 

3. The predictors are uncorrelated with the errors. 

4. The errors are uncorrelated with each other. 

5. The errors terms have constant variance. 

6. No predictor is a perfect linear function of other predictors. 

7. The errors are normally distributed. 

Think of the 5th assumption of OLS about constant variance (above) as more of a symptom that manifests itself as a pattern in the errors when one or more of the six assumptions of OLS are not met.

For linear regression models  a model that estimates a numerical value from a set of predictors, data scientists and statisticians plot the model’s errors (the vertical difference between actual and predicted values) by its predicted values. If there is no apparent pattern in the plot and all the dots appear randomly scattered from left to right, the model is valid as shown in the first plot in Figure 3 below. 

This is known as homoscedasticity, which is a combination of the Greek root terms, “homo” (same) and “skedastikos” (dispersion).  If this plot displays an evident pattern, such as the bowtie pattern or fan shape shown in the second and third plots in Figure 3 below, the model is invalid. The two aforementioned plots display heteroscedasticity (different dispersion). 

Figure 3. Plots of a linear regression model’s residuals (error terms) by its fitted values should display no discernable pattern. It should look like a random cloud as shown in the first plot which displays homoscedasticity, meaning same dispersion.

Please note that linear regression models are not limited to OLS models. The assumptions about error terms for OLS hold true for any model that predicts a continuous numerical target, including decision trees, random forests, boosted trees, and neural networks.

Graphs for Model Evaluation

Consider the StatsModels OLS model in Figure 4 (below) with an Adjusted R-squared value of ~0.90 and a residual plot that shows constant variance with no discernible pattern. The R-squared value simply means that ~90% of the variance in the target variable can be explained by its predictors.

Figure 4. Output from a StatsModels OLS Model

While residual plots are used to examine linear regression models for violations of the assumptions of OLS, they don’t provide information on how well a model matches the data. 

A relatively recent innovation and alternative is the marginal model plot. The marginal model plot is a practical graphical tool for visualizing how well a model fits the data by comparing a model’s predicted line of fit (ŷ pronounced “y-hat”) to the actual line of fit (Y). 

To create a marginal model plot, we overlap a model’s predicted value (ŷ) for each observation on top of the actual values (Y) on the Y-axis, and any one of the continuous numerical predictor values on the X-axis. We then employ the LOESS (aka LOWESS) function to “smooth” both the Y and predicted values (ŷ). More about LOESS can be found at https://www.itl.nist.gov/div898/handbook/pmd/section1/pmd144.htm 

Figure 5a and 5b (below) shows a marginal model plot for each of the two predictors for the OLS model in Figure 4 (above). As Figure 5a reveals, X1 is a close fit for Y given the predicted blue line is very close to the red line for actual values. However, even though X2 is statistically significant given its p-value < 0.05, it is not a good predictor of Y as evidenced by the distance between the blue predicted line and the red line for the actual values.

In Figures 5a and 5b (below), a marginal model plot was created for each predictor (X1 and X2) in the OLS model specified in Figure 4. The blue lines represent the predicted values. The red lines represent the actual target values in the data. Because the blue line in Figure 5a is very close to the red line, the predictor X1 is an excellent predictor of that target variable. But as we see in Figure 5b, the blue line is not very close to the red line meaning that X2 is not a very good predictor of the target variable.

Figure 5a. Marginal Model Plot for X1.

Figure 5b. Marginal Model Plot for X2.

Even though the predictor in Figure 5b (above) is statistically significant in the linear regression model, it isn’t a good fit for the data after all! This example exemplifies the importance of assessing how well a model fits the data with graphs.

For Part 3 of this series, I’ll cover a practical example of marginal model plots as applied to Logistic Regression.
I hope you enjoyed this article.

Bibliography

Sanford Weisberg, Applied Linear Regression, 3rd Edition, pp. 198 and 185-190, 2005

 

Written by:
John Bonfardeci
Lead Data Scientist, Senior Software and Data Engineer at Definitive Logic