This article is Part 3 of a series of three articles. In the previous article, I outlined an example of graphs for model validation for linear regression. In this article, we will take a closer look at a practical application for Logistic Regression.

Conventions Used for this Article

The following subsection category names have been defined based on the preferences of the audience. The nomenclature was selected based on my favorite race of humanoids in the Star Trek universe.

For Vulcans
Includes deep technical and academic content for ML practitioners. 

For Non-Vulcans
Includes content at a high level for managers and other business-level professionals who may work with ML practitioners.

External Materials

All code used to produce the visuals in this article can be cloned or downloaded from

Validating Classification Models

Logistic Regression and Multiple Logistic Regression are types of classification models for two or more labels of a target variable. For classification models, the convention is to employ a goodness-of-fit (GoF) test to determine if the model has been specified correctly. If the GoF test results in a p-value (probability value) that is less than the significance level, say alpha=0.05, we reject the model. Otherwise, we accept the model.

The GoF test that is commonly applied to classification models is the Hosmer-Lemeshow (HL) test. But the HL test has serious problems, especially that it’s subject to providing false negatives or false positives for GoF with even slight changes to the test’s arbitrary hyperparameter for group size.

Furthermore, the HL test has been shown to produce wild swings in p-values due to large data sets.

The Hosmer-Lemeshow test detected a statistically significant degree of miscalibration in both models, due to the extremely large sample size of the models, as the differences between the observed and expected values within each group are relatively small.

~ Journal of Palliative Medicine. Volume 12, Number 2, 2009.

Per the quote above, the errors were relatively small, meaning the model explained the target variable Y reasonably well. But the HL GoF said otherwise! This is a false negative or what’s known as a Type II Error. In other words, we failed to reject the null hypothesis (H0) when we should have. While we described marginal model plots in the context of linear regression models, they also work very well for classification models that predict the probability of an observation belonging to a class. This applies to popular classifier models such as logistic regression, decision tree, random forest, boosted trees, and Support Vector Machine. To create a marginal model plot for classification models we can utilize the same function described for linear regression models. (Figures 6a-6c below)

While we described marginal model plots in the context of linear regression models, they also work very well for classification models. As shown in Figures 6a-6c (below) for a two-class logistic regression model.

In Figures 6a-6bc below, we overlap a model’s predicted probability value for each observation on top of the actual Y values on the Y-axis, and any one of the continuous numerical predictor values on the X-axis. Even though Y consists of only finite integer values (0, 1, 2, …n) indicative of class labels, the LOESS (aka LOWESS) function “smooths” both the Y and predicted Y (ŷ) values so we can compare apples to apples.

Marginal Model Plot for Y & ŷ by X1

Figure 6a. X1 is a very good predictor of Y.

Marginal Model Plot for Y & ŷ by X2

Figure 6b. X2 is a good predictor of Y except between values between ~1.0 and ~3.5.

Marginal Model Plot for Y & ŷ by X3

Figure 6c. X3 is a poor predictor of Y.

In Figure 6a (above), the blue ‘Predicted’ line is very close to the red ‘Actual’ line. In this case, variable X1 for all values is an extremely good predictor of Y. The plot of X2 (Figure 6b above) is also a very good predictor of Y except within the range of values on the X-axis between approximately 1.0 and 3.5. In this case, we would investigate why X2 is a poor predictor in this range, such as the effects of outliers. We may also look for any interactions X2 may have with another Xn variable. If an interaction with another variable is found we would include a new variable in the linear formula for X2 * Xn, which may or may not improve the fit. And for Figure 6c for predictor X3, which is clearly a poor fit given the distances between the predicted values and actual values, we may decide it isn’t useful to keep in the model.

Key Takeaways

In summary, model validation is a serious responsibility. If ignored, the consequences for key decision-makers can be disastrous. Just because a model appears to be as accurate for the validation data set as it was for the training data set, does not mean the model is a good fit for the data. Visual analysis of diagnostic plots for goodness-of-fit (GoF) is the most reliable method for evaluating model fit over the use of R-squared or p-values. I hope you enjoyed this series and thank you for reading.


Hosmer D.W. and Lemeshow S. (1980) “A goodness-of-fit test for the multiple logistic regression model.” Communications in Statistics A10:1043-1069 Allison, Paul. “Hosmer-Lemeshow Test for Logistic Regression: Statistical Horizons.” Statistical Horizons | Statistics Training That Makes Sense, Statistical Horizons, 27 Nov. 2019, Journal of Palliative Medicine. Volume 12, Number 2, 2009

Written by:
John Bonfardeci
Lead Data Scientist, Senior Software and Data Engineer at Definitive Logic