One Model Many Scores

An Interactive Analysis of the Machine Learning Multiverse

This document is an interactive companion article illustrating a subset of the results from the paper "One Model Many Scores: Preventing Fairness Hacking and Evaluating the Influence of Model Design Decisions" by Jan Simson, Florian Pfisterer and Christoph Kern published at the ACM Conference on Fairness, Accountability, and Transparency 2024 in Rio de Janeiro, Brazil in June 2024.

The source code of the analyses as well as this interactive document is available at github.com/reliable-ai/fairml-multiverse. If you are interested in the creation process of this interactive analysis, there is an article about it in Nightingale.

Introduction

We try to keep the introduction here short and refer to the paper for a proper introduction into the topic and case study used to generate the present results.

The basic idea of a multiverse analysis is to turn implicit decisions into explicit ones, traversing the garden of forking paths one encounters when conducting an analysis or building a machine learning system.

By systematically combining many decision options, we can evaluate the influence of these decisions on a certain outcome, in our case the performance and fairness of a machine learning model.

For this analysis we combined 9 potential design decisions and 3 evaluation decisions (see Table 1 in the full paper for an overview). By combining these decisions we generate a multiverse with 61440 different system designs / models and 28 evaluation strategies per model. Altogether, this results in 1,720,320 universes 🪐 of how a person may design and evaluate this particular system.

Please note, that just because we included certain decisions or settings in our analysis it does not mean we condone them. On the contrary, we strongly advice against applying some of these practices and only include them to raise awareness of their potential harm (see Section 2.2.1).

Especially decisions related to the evaluation of a machine learning model also pose the risk of potentially hacking a fairness metric, a concept introduced in Meding & Hagendorff (2024).

Multiverse Analysis — The steps to conduct a multiverse analysis for algorithmic fairness. Steps 1 - 4 apply to multiverse analyses in general, whereas steps 5 - 6 are a new addition for multiverse analyses for algorithmic fairness.

Model Design

We split our analysis into two parts: decisions of model design and different evaluation strategies. In this first section we will look at the influence of different model design decisions on the performance and fairness of the resulting machine learning models using a default evaluation strategy.

You can first select which metrics you want to focus on. Pick a fairness metric (Equalized Odds Difference or Demographic Parity Difference) and a performance metric (F1 Score, Accuracy or Balanced Accuracy) from the dropdowns below.

This interactive analysis is powered by recently released and in-development software, which means that it can sometimes be a bit unstable. We therefore recommend to only modify these two dropdowns before interacting with the plots below. In case of issues please 🔄 refresh this site.

Variation of Performance and Fairness Metrics

This plot shows the distribution of the performance and fairness metrics you chose above for all system designs in the multiverse, using a default evaluation strategy.

The table below displays detailed information about all potential system designs in the plot above, including the particular decisions used in them. You can select a subsection in the plot above to interactively filter the table below.

Once you found a particularly interesting design, you can click a row in the table to see more detailed information about it in the next section.

Universes in your selection

Evaluation Strategies

You can pick a system / model from the table above to see how different evaluation strategies affect the exact same model's fairness metric.

You can see that for many models the fairness metric varies significantly depending on which evaluation strategy one may choose. The fairness metrics we used here, can range from 0 to 1 and in many cases we can see that their whole range is covered just by different ways of evaluating the same model.

Variation of Fairness Metric for a Single Model

The plot below displays the distribution of the chosen fairness metric across different evaluation strategies for the model you chose from the table above.

Chosen Model / System 🛠️

These are the settings of your chosen ML sytem:

If you are curious what exactly each decision and its setting refer to, check out Section 2.2.1 in the full paper, which contains detailed explanations.

Conclusion

We hope this interactive analysis gave you an idea of how much design and evaluation choices can affect a models metrics.

If this analysis sparked your interest, we recommend to check out the full paper which contains further analysis. In there, we demonstrate how to determine which decisions or decision combinations influence a metric the most and how one can examine just 1% of the multiverse and still discover the most important decisions.

Technical Details

This document is built using Mosaic and Observable Framework. Data is handled via DuckDB to allow for quick visualisations. All dynamic data processing and visualization is done in your browser.

To keep the size of the data small, all metrics were rounded to two decimal digits. The full and non-rounded dataset is available in our repository for analysis.

Support

This work is supported by the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research.