10 LLMs Disagree - baki.io

When you ask ten language models the same question, you don’t get ten copies of the same answer. You get ten genuinely different interpretations - shaped by architecture, training data, and the emergent personality of each model.

Most systems try to resolve this into consensus. Chaos does the opposite: it celebrates disagreement as a feature, not a bug.

Key insight

Disagreement between models is signal, not noise. Each divergence maps a boundary in the latent space of possible answers.

The diversity hypothesis

If a single model gives you one perspective, ten models give you a landscape. Not an averaged, blurred landscape - a terrain with actual peaks and valleys, each one a different way of seeing.

chaos/orchestrator.py

# Fan out the same prompt to multiple models
responses = await asyncio.gather(*[
  model.generate(prompt)
  for model in ensemble
])

# Measure semantic distance between responses
distances = pairwise_cosine(responses)
diversity_score = distances.mean()

Tip

Higher diversity scores often correlate with questions that have no single correct answer - exactly the kind worth exploring.

Warning

Don’t confuse model diversity with hallucination. Diverse outputs from grounded models reveal genuine interpretive breadth, not errors.

The diversity hypothesis

Further reading