Mitigating model bias in the field of AI (artificial intelligence) is a challenging topic. It might sometimes mean bringing the calibration curves of the subgroups closer together. There is no standard solution for fixing model bias; and engineers are essentially asking how to make the model perform better for one or more subgroups. There are some standard techniques to be applied for improving model performance targeted towards subgroups and observe how they affect subgroup miscalibration.
When evaluating machine learning models for algorithmic bias, one of the main things the team looks at is the calibration curve. Calibration curves measure whether the model score accurately reflects the probability of the sample belonging to the positive class. When the calibration curve is compared across different subgroups, (e.g., plot the calibration curve separately for men and women), we are essentially asking whether the model is systematically over or underestimating the chance of the outcome occurring for some subgroups. The figure below shows a sample calibration curve. The x-axis shows the model score, while the y-axis shows the average number of samples labeled positive. The ideal curve should lie on the y=x line. For example, of all the samples assigned a score of 0.6 by the model, 60% of those should be labeled as positive. When the curve is above the y=x line, samples are being under-predicted; that is, the model score is less than the probability of the sample being labeled in the positive class. When the curve is below the y=x line, the model is over-predicting the probability.
Calculating and plotting calibration curves is relatively easy. There are both external and internal libraries for doing such things. If the calibration curves across subgroups differ, the question then becomes how would one align the curves.
When a model is generally uncalibrated, the standard treatment is to apply either Platt scaling or isotonic regression to rescale the model output to reflect probabilities.
It stands to reason that simply applying these same techniques separately for each subgroup would be a straightforward way to ensure that the calibration curves are the same across subgroups. While it is okay to apply calibration to the overall model score, applying separate calibration to subgroups is not generally recommended, as it treats the symptom and not the underlying problem. Separate calibration also introduces procedural inconsistencies by subgroup, meaning we are using the subgroup label in making predictions. This may not be acceptable from a policy standpoint depending on the context. For instance, it may make sense to apply different score corrections by language, but if the subgroup were say, ideology, it may not be defensible to do so for conservative vs liberal content.
The use of calibration curves as a proxy to measure model performance across subgroups suggests that reducing disparities in calibration curves comes down to generally improving model performance, perhaps with an emphasis on the subgroup that is more miscalibrated. There is an implicit assumption that models are trained and evaluated on a different metric (e.g., precision at recall, FPR/FNR cost ratio), and that calibration provides another way to look at how the model is working. It is never correct to make the predictions worse for one group just to have equal calibrations.
Here are some things that might work:
Add More Data
- As mentioned above, subgroup miscalibration often occurs due to a lack of training samples for one subgroup relative to the others. So one solution is to collect more training samples for the underrepresented group. Of course, simply collecting more training samples doesn’t guarantee that the model will be able to produce more accurate predictions.
Try a More Sophisticated ML Architecture
- In some cases, the model is not expressive enough to capture some of the nuanced interactions between features, and so it is worth exploring alternative model architectures. One solution might be to train a gradient boosted decision tree, which is able to take into account the more complex interactions between features and demonstrate improvement in subgroup calibration.
- Ideally, feature engineering would be the preferred way of reducing subgroup miscalibration. A canonical example often used is in the case of a hiring algorithm which might learn that long periods without work are negative indicators of job performance. Yet, this might have a disproportionate effect on women, who are more likely to have gaps in employment for child/family care. A better feature, both for fairness and for model performance, would differentiate between involuntary unemployment and parental or family leave. Understanding how the meaning of the features varies across subgroups could help close any gaps in performance between those groups.
Train Separate Models for Each Subgroup*
- Training a separate model is loosely related to adding more data. If we are able to collect “enough” training data for each subgroup, it may make sense to simply build different models. Perhaps different features are relevant for the subgroups. Separating out the models makes the intention clearer that the emphasis is on improving the model performance of each subgroup. On the other hand, training separate models is not easily scalable as we include more types of subgroups to consider. For instance, we may care about the gender and age of the post author in addition to the language of the post. Then training separate models would mean that we would need to train a different model for every possible combination of subgroup identity (e.g., men aged 18-22 speaking English, women aged 18-22 speaking English, etc.)
Add Subgroup Identity as a Feature*
- While adding the subgroup identity as a feature may have some practical sensitivities, the academic literature suggests that supplying the model with more information is generally preferred (see The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning). Subgroup identity is usually most useful to the model when representativeness is an issue. However, consider the hiring algorithm case outlined above. Given a feature for long periods without work, adding in gender as a new feature might allow the model to become more subgroup calibrated in that it learns that long periods without work have different meanings between men and women. Yet, it would still underestimate the performance of men who took family leave and overestimate the performance of men who took leave involuntarily (and make the opposite mistake for women).
Training separate models and adding the subgroup identity as a feature is similar to applying calibration separately for each subgroup in that it introduces procedural inconsistencies among subgroups. In general, the last two methods should be considered as a last resort solution since they don’t directly address the issue of relative underperformance of a subgroup. Nevertheless, sampling additional data from under-represented groups, using more complex models that are able to capture more complex interactions between features, and doing better feature engineering are effective methods for mitigating bias.