Introduction 

With the prevalence of online courses in machine learning (ML) and the popularity of Python and R, there is no question as to the enthusiasm for ML. It’s no longer the dark magic reserved only for “quants” who lurked within the shadows of financial companies pre-2008. ML is an exciting subject that promises a lot of predictive power. There are countless introductory online tutorials and evangelists that make ML seem as easy as model.fit(X). 

While amazing ML libraries, such as Scikit Learn and StatsModels, have abstracted the complexity of ML, it doesn’t exclude the aspiring data scientist from the burden of proving their trained ML models are a good fit for the data. Too many online tutorials do not address the vital topic of model validation or which model validation methods are the most reliable. 

Determining model validity only by comparing the predictive accuracy on the validation data set is not enough. Deploying an invalid model to production can have disastrous results for an organization if decision-makers heed faulty predictions.  

In this article, we will summarize visual methods for model validation. We’ll also cover why they’re the most appropriate tests for assessing how well a model fits the data.

External Materials

All code used to produce the visuals in this article can be cloned or downloaded from https://github.com/jbonfardeci/model-validation-blog-post.

Model Evaluation

Model Evaluation is but one of a series of logical steps within the industry-recognized acronym CRISP-DM, which stands for the Cross-Industry Standard Process for Data Mining. While the “Data Mining” part of the acronym may seem irrelevant to ML, the steps defined within this standard ensure we follow best practices to produce valid models. See Figure 1 (below) for the major steps.

Figure 1. CRISP-DM Process. We iterate each step until we achieve the most accurate, parsimonious (easy to explain), and valid model.

As illustrated by the directional arrows in Figure 1 (above), we can reiterate each step until we achieve an optimal state. Discussing each step (and each step within each step) for CRISP-DM is out of scope for this article, but model validation falls under the Model Evaluation step. You can read more information about CRISP-DM at https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-methodology-to-ensure-a-good-outcome. 

Anscombe’s Quartet 

Statisticians and data scientists use many types of charts to describe models and assess their validity. In 1973, a well-known statistician named Francis Anscombe set out to prove the importance of graphing data. He created four datasets, now known as Anscombe’s Quartet (Figure 2a below)—all with the same mean, standard deviation, and regression line. Each model is qualitatively different.      

Figure 2a. With Anscombe’s Quartet, each dataset above shares the same regression line, same mean=7.50, standard deviation=1.94, r-value=0.82, and r-squared=0.40.

Figure 2b. Plots of Residuals (error terms) by Predicted Values for Anscombe’s Quartet. Each dataset shares a nearly identical residual sum of squares (RSS) value. But the error terms for each are very different. Only the top two models display homoscedasticity with no discernable patterns.

For each model, if we were to only evaluate the regression line and its R-value of 0.82 or R-Squared value of 0.40, it would appear all four models are identical as far as model accuracy. But only the first model is a valid linear model, while the second model is curvilinear, the third model has a bad leverage point that skews the linear model, and the fourth model has all but one observation where X is a constant. When we plot the residuals (error terms) by the predicted values for each model (Figure 2b above), the top two models appear identical but the bottom two are very different even though their RSS values are very close—24.65 and 24.63 respectively.

Many of the validity tests used in data science provide some type of numerical output that shows if a model is a good fit. For example, if a P-value (probability value) is low, say less than 0.50, we reject the model. If it is high, above 0.50, we accept it. As we can see from Anscombe’s famous example, such tests do not really tell you anything qualitative about the model.

For Part 2 of this series, I’ll cover a practical example of marginal model plots as applied to Linear Regression.
I hope you enjoyed this article.

 

Written by:
John Bonfardeci
Lead Data Scientist, Senior Software and Data Engineer at Definitive Logic