Evaluating Scorecard Models
You have put in a lot of hard work generating a scorecard model. Just imagine, you have looked through possibly hundreds of predictor variables and selected those that were most important to your model. You’ve discretized them, looking at weight of evidence to verify that they have been properly prepared for use in developing your model. All of your discretization scripts have been used to prepare your predictors for use in building a logistic regression model which is in turn used to create your scorecard. Now that you have your model, how do you determine if your model performs as you expect?
There is a host of statistics and graphs that you can use to help you determine if your model is performing at the level you expect. There is the Kolmogorov-Smirnov statistic which is a measure of how much the probability distribution of the “goods” differ from the “bads,” and varies from a low of 0 to a high of 1.0. The Gini score reflects the overall unevenness in the relative frequencies of values along the range of scores, or a measure of the predictability of a model, and also ranges from a low of 0 to a high of 1.0. Divergence is a measure of the overall minimum distance between the “goods” and “bads,” and ranges from a low of 1.0 to high positive values. The Hosmer-Lemeshow value is also a form of a minimum distance test incorporating Chi-Square values, and it is evaluated like an ordinary Chi-Square value. The Receiving Operator Characteristic (ROC) curve is created by plotting the true-positive rate (sensitivity) over the false-positive rate (1-specificity). The area underneath the ROC curve varies from a low of 0 to a high of 1.0, the entire area between the axes. Finally a lift chart helps you visualize the effectiveness of a model and is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model.
Needless to say, going through each of these would require more than one blog, and really you need to contrast and compare the results of many of these statistics and graphs to see how well your model is performing. To wet your taste for comparing your models, I’m going to stick to one of these options, the lift chart.
A lift chart is shown above. The X-axis is graduated in terms of deciles, or bins of 10% of the total cases modeled. The Y-axis is graduated in terms of lift index or a factor expressing how much better the model performs in each decile. The model line is plotted by determining the ratio between the results predicted by our model compared to the results using no model.
In the lift chart, you can see that the lift values in the lower deciles are higher than the expected value plotted at 1.0, indicating that the model has a relatively high predictive power. What does this mean? For now let us focus on the 10% decile.
If we contacted 10% of all our customers using no model at all to decide what customers to contact, we could expect a response rate of 10%, with that 10% consisting of positive and negative responses. However, if we used our model to select 10% of our customer base we could expect a response rate of between 22% and 24%. That is a lift of between 2.2 and 2.4, meaning our model performs 2.2 to 2.4 times better than no model at all.
Does it make sense to use the model to select more customers to contact? If you contacted 80% of your customer base with no model you would expect an 80% response rate. With the model being used to select those customers, you could expect a response rate of 96%, but that is only 1.2 times better than no model at all.
Is that lift of 1.2 worth the extra cost of contacting 70% more of your customer base? That’s for you, the content expert, to decide. With the tools made possible to you through lift charts and a whole host of other statistics and graphs for evaluating your scorecard model, you can have the insight on how to make more informed decisions for your company to maximize your profit and reduce your risk.