Paper accepted at ICLR 2026: DIVERSE: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration


DIVERSE: Finding the Many Faces of AI Decision-Making

Our paper “DIVERSE: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration” () has been accepted at ICLR 2026, one of the top venues for machine learning research. This is joint work with my PhD student Gilles Eerlings, Brent Zoomers, Jori Liesenborgs, and Gustavo Rovelo Ruiz at the Digital Future Lab. More details on the publication page.

The Problem: Many Roads Lead to the Same Accuracy

Imagine training an AI to recognize diseases in medical images. You might end up with a model that’s 95% accurate — but here’s the surprising part: there could be hundreds of different models that all achieve that same 95% accuracy, yet make their decisions in completely different ways.

One model might focus on the shape of a lesion. Another might rely on texture. A third might use color patterns. All equally accurate on average, but they might disagree on specific patients.

This phenomenon is called the Rashomon Effect (named after the classic film where witnesses give contradictory accounts of the same event). Understanding these alternative models matters because:

  • For fairness: Two equally accurate models might treat certain groups of people differently
  • For trust: If models disagree on your case, maybe the prediction isn’t as certain as it seems
  • For safety: In high-stakes decisions (medical diagnosis, loan approvals), we need to know when AI is genuinely confident

The Challenge: Finding These Alternative Models is Expensive

The traditional way to find these alternative models is to train many versions from scratch — each time with different random starting points. But training a single deep learning model can take hours or days. Training hundreds? Impractical for most real-world applications.

Our Solution: DIVERSE

We developed DIVERSE, a method that discovers diverse alternative models without retraining from scratch. Think of it like this:

Instead of building many different houses, we take one house and systematically explore how rearranging the furniture changes how people move through it.

Technically, we add small “adjustment knobs” to an already-trained model and use an intelligent search algorithm to find settings that produce different predictions while maintaining the same overall accuracy.

FiLM placement strategies: after dense layers, after convolutional blocks, and on residual skip connections

The Results

We evaluated DIVERSE on MNIST, PneumoniaMNIST (medical imaging), and CIFAR-10. Compared to retraining from scratch, DIVERSE generates candidate models in minutes rather than hours — while achieving comparable diversity.

Rashomon in action: equally accurate models disagree on the same input

On MNIST, nearly all generated models remained within the accuracy tolerance. On the more challenging medical imaging and CIFAR-10 tasks, DIVERSE consistently outperformed dropout-based sampling and matched or exceeded retraining on key diversity metrics — particularly at higher tolerance levels.

Why This Matters

With DIVERSE, researchers and practitioners can now ask important questions that were previously too expensive to answer:

  1. “How certain is this AI really?” — If many alternative models agree, we can be more confident. If they disagree, we should be cautious.

  2. “Is this decision fair?” — We can check whether different models would treat different groups of people consistently.

  3. “Can we find a better model for this specific use case?” — Among equally accurate models, some might be more interpretable or more robust.

Citation

@inproceedings{eerlings2026diverse,
  title={{DIVERSE}: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration},
  author={Eerlings, Gilles and Zoomers, Brent and Liesenborgs, Jori and {Rovelo Ruiz}, Gustavo and Luyten, Kris},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}