Frontier models are marketed as “best at everything.” In a multi-agent product, everything is actually dozens of narrow jobs: route this message to the right specialist, triage this inbound email, write this morning brief, pick the right tools for a BD workflow, continue a UX discovery doc.

We run Actually as an agentic stack — a chairman that routes work, specialist subagents with tools and skills, and high-frequency prose paths like the planner focus brief. So we built a deep eval framework and ran the same seven models through every specialist, using scenarios shaped like production and replays from real user traffic.

The headline: Claude Sonnet (and by extension Opus) is rarely the right default. On several high-volume paths, cheaper models match or beat it on quality and latency and cost. On growth-partner tool work, Sonnet came last.

Why “just use Sonnet everywhere” fails

A single premium model simplifies procurement and mental load. It also multiplies cost on paths where quality has already plateaued — routing, daily briefs, email triage — while hiding regressions on tasks where tool selection matters more than prose polish.

Agentic systems are not one chat completion. They are a portfolio of tasks with different failure modes:

  • Routing — pick the right delegate from a catalogue (accuracy is binary; latency matters on every message).
  • Tool agents — choose and execute MCP tools under persona constraints (wrong tool is worse than mediocre prose).
  • Prose generators — markdown briefs with strict contracts (links, headings, no hallucinated ticket counts).
  • Email flows — skills, ticket creation, and refusal to over-react on digests and newsletters.

If you evaluate models on a generic benchmark or a single chat prompt, you will over-pay for the easy paths and under-diagnose the hard ones.

How we test (not vibes)

Our shootout harness (internal ticket AD-084) runs against the live llm-service code paths — not a simplified API playground.

Production fidelity — same prompt builders as production (buildRoutingPrompt, buildFocusBriefUserPrompt, PA email agent loop with skills and standing rules).

  • Real scenarios — golden prompts tagged per subagent, plus production trace replays (actual inbound emails, actual council queries).
  • LLM-as-judge — Haiku scores prose on clarity, structure, grounding, and contract compliance; agent runs score tool selection, execution, and outcomes.
  • Multi-faceted composite — quality, latency (ms), and cost ($/call) recorded for every run. No single number pretends to capture trade-offs.
  • Canonical seven-model lineup — Haiku 4.5, Sonnet 4.6, Groq GPT-OSS 120B, Gemma 4 26B, Qwen 3.6 Plus, Minimax M2.7, Mistral Small 2603 — synced from config/eval/shootout_models.json.
  • Repeatable scriptsmake model-shootout SUBAGENT=<slug>, make email-shootout, make focus-brief-shootout from the monorepo Makefile.

We cull models that fail smoke runs, fix eval-path bugs before blaming the model (more on Gemma below), and tune prompts before upsizing — Mistral on focus brief needed explicit link-format rules, not a bigger model.

All-Sonnet vs per-specialist selection

Illustrative daily load for an active solo operator: 100 chairman routes, 24 focus briefs, 40 PA emails, 12 growth-partner runs, 8 UX researcher tasks. Numbers from June 2026 shootout averages.

Task Sonnet $/call Daily @ Sonnet Our pick Pick $/call Daily @ pick Quality
Chairman routing $0.021 $2.08 Mistral Small $0.0009 $0.09 100% vs 100% (tie)
Focus brief $0.012 $0.29 Mistral Small $0.002 $0.05 82% vs 76% morning (pick wins)
PA inbound email $0.194 $7.76 Haiku / Minimax $0.032–0.052 $1.28–2.08 79% vs 75% (tie/wins)
Growth partner (BD) $0.046 $0.55 Groq GPT-OSS $0.009 $0.11 52% vs 39% (pick wins)
UX researcher $0.144 $1.15 Minimax M2.7 $0.015 $0.12 65% vs 58% (pick wins)
Daily total ~$11.83 ~$1.65–2.45 ~5–7× cheaper

Opus would widen the gap further; we did not include it in the lineup because the lesson already lands without it: uniform premium models multiply invoice on high-frequency paths where quality plateaus.

Chairman routing

Job: read the user message, pick the first delegate from the subagent catalogue (UX researcher, growth partner, project manager, tech lead, support assistant, miniapp coder).

Scenarios (6): UX continuation, LinkedIn/BD priorities, ticket creation, tech golden-path setup, product how-to, interactive miniapp build.

Result: all seven models scored 100% routing accuracy on every scenario. Differentiation is purely cost and latency.

ModelQuality$/routeAvg ms
Mistral Small 2603100%$0.00086861
Groq GPT-OSS 120B100%$0.00086674
Gemma 4 26B100%$0.0007412,744
Minimax M2.7100%$0.001706,201
Haiku 4.5100%$0.006981,720
Qwen 3.6 Plus100%$0.002527,407
Sonnet 4.6100%$0.020772,599

Recommendation: Mistral Small or Groq GPT-OSS for chairman routing — sub-second to ~1s, ~$0.0009 per route. Sonnet adds no accuracy on this golden set.

Personal assistant — inbound email

Job: triage inbound mail — Sentry alerts, GitHub CI, newsletters, Search Console, meeting invites, Fireflies recaps — with skills, ticket creation rules, and LLM judge on outcomes.

Harness: 16 scenarios (golden + production trace replays), full agent mode with tools.

ModelQuality$/emailAvg ms
Minimax M2.779%$0.03221,477
Haiku 4.579%$0.05214,408
Qwen 3.6 Plus77%$0.05241,226
Sonnet 4.675%$0.19425,443
Mistral Small72%$0.02212,502
Gemma 4 26B71%$0.01734,882
Groq GPT-OSS59%$0.02212,328

Haiku and Minimax tie on quality; Sonnet is mid-pack at six times the cost of Mistral. Groq scored lower on quality but is a viable fast/cheap tier — early 429 errors in batch shootouts were TPM saturation, not a production architecture flaw; we added retry with rate-limit header parsing.

Growth partner — tool-heavy BD work

Job: LinkedIn post drafts, daily BD briefs, ICP refinement — scenarios that expect get_document, linkedin_list_posts, list_tickets, and related tools.

This is where the “most expensive = best” story breaks hardest.

ModelQuality$/scenarioNotes
Groq GPT-OSS52%$0.009Top tier on tool selection
Haiku 4.547%$0.018
Gemma 4 26B47%$0.007
Qwen 3.6 Plus47%$0.023
Minimax M2.744%$0.016
Mistral Small35%$0.014
Sonnet 4.639%$0.046Last among frontier candidates

Scores are dominated by tool selection and execution, not prose elegance. Sonnet's premium price buys no advantage when the task is “call the right MCP tools under persona constraints.”

Focus brief — high-frequency prose

Job: generate the planner morning/midday/shutdown brief — markdown prose from calendar, tickets, git activity, and email context. No tools. Strict contract: ticket links, meeting join URLs, no total ticket counts, no emoji.

Harness: production buildFocusBriefUserPrompt(), LLM judge on clarity / structure / grounding / contract, four scenarios including a heavy-context path (large git + email sections simulating a future PA-skill merge).

ModelQuality (4 scenarios)$/briefAvg ms
Sonnet 4.680%$0.0128,060
Haiku 4.575%$0.0054,144
Gemma 4 26B73%$0.0034,540
Qwen 3.6 Plus71%$0.0035,652
Mistral Small70%$0.0031,599
Groq GPT-OSS62%$0.0032,045

On a single morning scenario after prompt tuning, Haiku hit 90% judge score; Mistral hit 82% at 1.7s and $0.002 — our recommended path for a brief that runs every time someone opens the planner.

Mistral Small — after link-format prompt tuning (82%, 1.7s)

You're starting the week with two projects in flight and a protected deep-work block before 11:00.

## Today
Protect your first two hours for heads-down work: the Gmail OAuth refresh is in progress and due today, so open [Gmail OAuth refresh](http://ai.actually.local/tickets/t-1) now and finish the token-rotation checks in staging.

## Meetings
11:00 [Product sync](https://meet.google.com/abc-defg-hij) — with Rich, Nina, Mike, Drummond; weekly sprint review and blocker clearout.

Mistral before tuning — missing ticket URLs

Morning. You have one meeting and one ticket due today.

At 11:00 you have the Product sync with Rich, Nina, Mike, and Drummond — no markdown links, failed contract checks.

Gemma eval bug (worth knowing if you run OpenRouter): first shootout scored Gemma at 0% — empty responses with exactly 2,048 completion tokens. Cause: eval path enabled OpenRouter reasoning: medium while capping tokens; Gemma burned the budget on hidden reasoning. Production never had this issue. Fix: dedicated focus-brief run mode with reasoning: none and 8,192 max tokens.

Prompt > model: we added LINK RULES, a reference example, and a user-prompt footer requiring [Name](url) for tickets. Mistral jumped from missing links to production-quality output without changing model tier.

UX researcher — agent with tools

Job: continue UX discovery — list projects, read documents, start skills, write IA sections.

Scenarios (4): heuristic research summary, project status read, IA phase start, heuristic audit framing.

ModelQualityTools$/scenarioAvg ms
Minimax M2.765%50%$0.01521,817
Sonnet 4.658%75%$0.14449,993
Mistral Small56%38%$0.0099,482
Haiku 4.546%38%$0.0106,716

Sonnet won the prose-only research_summary scenario (88%) — it was the only model to reach for web_search — but collapsed to 2% on ux_heuristic_audit after a 118-second tool loop. Minimax led composite quality; Mistral is the best cost/latency tradeoff for agent UX work at ~10× lower cost than Sonnet.

What we are taking away

  • The moat is not “we use Claude.” It is task-specific evaluation at production fidelity — and the discipline to run Mistral on routing when Sonnet adds nothing but invoice.
  • Eval path must match production path. Reasoning flags, token caps, and provider routing that differ from prod will lie to you (Gemma's 0% focus-brief score).
  • Prompt contracts beat model upsizing for structured prose — link rules and a reference example fixed Mistral faster than switching to Sonnet.
  • Batch shootouts stress rate limits differently from prod. Groq 429s under back-to-back eval were not a production risk; single-email flows with tool gaps behave fine with retry.
  • LLM-as-judge scales prose QA where must_contain heuristics give everyone 100%. Multi-dimensional judge scores spread models meaningfully on focus brief.

Closing thought

If you are picking one model for your whole agent stack, ask what each task actually needs. Routing is not email triage is not BD tool selection is not a daily markdown brief. We measured each separately — and the most expensive model was not the best at most of them.

That is the work we are building in public: not bigger models for their own sake, but specialists with the right brain for the job, chosen with evidence.

Building an agentic stack?

If you want to talk about per-specialist model selection, eval harnesses, or how we run multi-agent workflows in production, we are happy to chat. No obligation.

Book a consultation
Discovery call Our packages