DIVERSE: Finding the Many Faces of AI Decision-Making
Our paper “DIVERSE: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration” () has been accepted at ICLR 2026, one of the top venues for machine learning research. This is joint work with my PhD student Gilles Eerlings, Brent Zoomers, Jori Liesenborgs, and Gustavo Rovelo Ruiz at the Digital Future Lab. More details on the publication page.
The Problem: Many Roads Lead to the Same Accuracy
Imagine training an AI to recognize diseases in medical images. You might end up with a model that’s 95% accurate — but here’s the surprising part: there could be hundreds of different models that all achieve that same 95% accuracy, yet make their decisions in completely different ways.
One model might focus on the shape of a lesion. Another might rely on texture. A third might use color patterns. All equally accurate on average, but they might disagree on specific patients.
This phenomenon is called the Rashomon Effect (named after the classic film where witnesses give contradictory accounts of the same event). Understanding these alternative models matters because:
- For fairness: Two equally accurate models might treat certain groups of people differently
- For trust: If models disagree on your case, maybe the prediction isn’t as certain as it seems
- For safety: In high-stakes decisions (medical diagnosis, loan approvals), we need to know when AI is genuinely confident
The Challenge: Finding These Alternative Models is Expensive
The traditional way to find these alternative models is to train many versions from scratch — each time with different random starting points. But training a single deep learning model can take hours or days. Training hundreds? Impractical for most real-world applications.
Our Solution: DIVERSE
We developed DIVERSE, a method that discovers diverse alternative models without retraining from scratch. Think of it like this:
Instead of building many different houses, we take one house and systematically explore how rearranging the furniture changes how people move through it.
Technically, we add small “adjustment knobs” to an already-trained model and use an intelligent search algorithm to find settings that produce different predictions while maintaining the same overall accuracy.

The Results
We evaluated DIVERSE on MNIST, PneumoniaMNIST (medical imaging), and CIFAR-10. Compared to retraining from scratch, DIVERSE generates candidate models in minutes rather than hours — while achieving comparable diversity.

On MNIST, nearly all generated models remained within the accuracy tolerance. On the more challenging medical imaging and CIFAR-10 tasks, DIVERSE consistently outperformed dropout-based sampling and matched or exceeded retraining on key diversity metrics — particularly at higher tolerance levels.
Why This Matters
With DIVERSE, researchers and practitioners can now ask important questions that were previously too expensive to answer:
-
“How certain is this AI really?” — If many alternative models agree, we can be more confident. If they disagree, we should be cautious.
-
“Is this decision fair?” — We can check whether different models would treat different groups of people consistently.
-
“Can we find a better model for this specific use case?” — Among equally accurate models, some might be more interpretable or more robust.
Citation
@inproceedings{eerlings2026diverse,
title={{DIVERSE}: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration},
author={Eerlings, Gilles and Zoomers, Brent and Liesenborgs, Jori and {Rovelo Ruiz}, Gustavo and Luyten, Kris},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}