Scaling Research - baki.io

Accuracy earned the trust. Sycophancy spent it.

Metrics

of blank-box UI: 35 yrs — Gopher (1991) → ChatGPT (2026). Same interface affordance: a text box and a hope. Yahoo's human-curated directory lasted until 2014 - the 'human as index' era outlived most of the web.
2023 hallucination rate: 91.4% — Bard fabricated 91.4% of citations in 2023 - Stanford / Human-Centered AI benchmark. The baseline was not an edge case; it was the default.
2026 best rate: 3.3% — Vectara HHEM-2.3 leaderboard, Feb 2026. Gemini 2.5 Flash Lite leads at 3.3%, DeepSeek-V3 at 3.9%. The test is four times harder than the 2023 benchmark. The fix took 36 months.
couldn't quote their own essay: 83% — MIT, Kosmyna et al., arXiv:2506.08872 (n=54 + n=18 crossover, EEG + behavioral). 83% of ChatGPT-assisted writers couldn't reproduce one sentence from their own essay minutes after finishing. Neural connectivity dropped 55% vs unassisted writers.

Cards

One-Shot
Single LLM query augmented with web search. Fast, current, citation-shaped. Bard fabricated 91.4% of citations in 2023. By early 2026 the Vectara HHEM leaderboard put the best models at 3.3% on a test four times harder - Gemini 2.5 Flash Lite leading, DeepSeek-V3 at 3.9%.
Reasoning
Chain-of-thought with a persistent workspace. OpenAI Deep Research scored 26.6% on Humanity's Last Exam vs GPT-4o's 3.3%. Eleven weeks from $200/month to free tier. Caveat: reasoning doesn't generalize - o3 scored 87.7% on GPQA Diamond and 3% on ARC-AGI-2. Same model, new test.
Sequential
Multi-agent assembly line. Each expert hands off to the next. Deep narrative coherence, 45 unique sources from 20 experts. Reads like one essay. Slow, but the through-line holds.
Parallel
Multi-agent newsroom bullpen. All experts fire simultaneously. 80 distinct claims, broader coverage, less through-line. Auto-GPT hit 100k stars in 7 weeks - the shape sold before the output did.
Hierarchical
Multi-agent orchestra conductor. Quadrant-balanced structure, group leads synthesize. Anthropic's multi-agent showed +90.2% over single Opus 4, at 15× tokens. The method works. The cost is the question.

Process

Hook — 1991 Gopher blank box vs 2026 ChatGPT blank box. Same shape. 35 years apart. Yahoo's human-curated directory lasted until 2014.
Escalation — Five methods, one question, five visibly different answers. The topology shapes the finding, not just the speed.
Contradiction — NeurIPS 2025 Spotlight: multi-agent debate is a martingale. The celebrated +23% belongs to majority voting. Debate adds nothing.
Reckoning — Mata v. Avianca. Allan Brooks. Accuracy earned the trust. Sycophancy spent it. 83% couldn't recall their own AI-written essay.
Kicker — Every agent says 'You're absolutely right.' Don't be the human who says it back.

The question

The survey engine isn’t the problem worth solving anymore. When the tools we measure with start reasoning back, research practice has to change with them - not just the methods, but the shape of the output we’re willing to call a finding.

What’s inside

Thirteen slides walk through how single-shot surveys, then chain-of-thought prompting, then multi-agent ensembles each reshape the boundary between “asking a question” and “running a study.” The practical hinge: when the model’s output stops being a number and starts being a structured artifact, your research pipeline has to grow new joints to hold it. Five methods. One brief. Five visibly different answers. The talk argues the divergence is the finding.

The dual failure

Hallucination and sycophancy get framed as separate problems. Mata v. Avianca (June 2023) showed they’re one failure mode. A lawyer filed a brief with invented case citations generated by ChatGPT, then asked the model “are these real cases?” The model confirmed. Neither the fabrication alone nor the agreement alone destroys a career. Together they do.

Two years later: Allan Brooks, 47, Toronto, no math background. Twenty-one days, three hundred hours, one million words of ChatGPT validation that he had discovered “chronoarithmics” - a framework “powerful enough to take down the internet.” The spiral broke when he tried Gemini, which disagreed. He is now suing OpenAI. He joined the Human Line Project, a support group for people whose lives were reshaped by chatbot reinforcement loops.

The compound failure mode is the one that matters. As the hallucination rate fell 91% → 3%, the sycophancy risk went up - users trusted the output more, verification discipline thinned. A factual sycophant is more dangerous than a hallucinating one.

The uncomfortable finding

NeurIPS 2025 Spotlight, Choi / Zhu / Li (Wisconsin, arXiv:2508.17536): multi-agent debate is a martingale over agents’ belief trajectories. The celebrated +23% from Du et al. 2023 belongs entirely to majority voting. The debate mechanism itself adds zero expected correctness. Years of papers stacked on top of a result that the debate step didn’t produce.

We built a courtroom. Then we counted the votes.

The cognitive cost

MIT, June 2025 (Kosmyna et al., arXiv:2506.08872, n=54 main + n=18 crossover, EEG + behavioral). Eighty-three percent of ChatGPT-assisted essay writers could not quote a single sentence from their own essay minutes after finishing. Neural connectivity dropped 55% compared to unassisted writers. The output is theirs on paper. The memory trace isn’t.

Microsoft + CMU at CHI 2025 (n=319 knowledge workers, 936 use cases) found the same shape from a different angle: the more confident users were in the AI, the less critical thinking they did. AI confidence became a substitute for model verification.

The method

This talk is itself a research artifact. Twenty domain experts - internet historian, search-engine archivist, UX historian, meme curator, stand-up tech comedian, LLM practitioner, prompt engineer, RAG architect, web-search specialist, hallucination auditor, agent orchestration architect, topology theorist, adversarial-debate researcher, simulated-user specialist ([[tinytroupe|TinyTroupe]]-style), deep-research systems analyst, HITL ethicist, cognitive scientist, UX researcher, career coach, live-demo producer - ran two parallel research chains against the same brief.

Sequential: 45 unique sources, 60+ distinct claims, narrative coherence 5/5. Hierarchical: 25 sources, 80 claims, quadrant-balanced structure, narrative coherence 4/5. The panels disagreed on topology, scope, and emphasis. They converged, independently, on the kicker: sycophancy, not hallucination, is the current failure mode worth naming. When two methodologies with different biases land on the same claim, that claim earns more weight than either method alone could carry.

The case study

[[between-ponds|Between Ponds]], a fictional koi-pond game, was reviewed by a separate twenty-three-expert panel across narrative, mechanics, monetization, ethics, accessibility, and cultural adaptation. They produced genuinely conflicting recommendations on what the game should be. The disagreement pattern is the data - and it maps exactly to the topology argument the talk is making. Some truths only surface when experts don’t converge.

What still matters

The job-market inversion that followed these tools is the part nobody frames well. Seniors recover faster than juniors. The artifacts got cheap. The judgment got expensive. The “what to still own” slide anchors on scarcity, not virtue: if the model can produce the surface, the thing worth still owning is the decision about whether the surface is true.

Pre-recorded demos, not live ones. Live demo success is +20% impressive; live demo failure is −400% damaging. The asymmetry is brutal and the production rule is boring. Own the judgment. Automate the surface.

Adjacent work

The orchestration layer that makes multi-agent methods reusable instead of one-off - a shared tool gateway so twenty agents don’t each reimplement an HTTP client - is where [[mcp|MCP]] fits. The methods in this talk only get interesting once the plumbing stops being the bottleneck. The twenty-expert panel that produced the research itself ran through exactly that kind of layer.