Quorum
Sign inInstall
← All posts
Product2026-05-02 · 6 min read

Why a panel of three beats one big reviewer

A single model trying to flag bugs, security risks, and architecture drift in one shot ends up doing all three poorly. Specialising the prompt — and arguing back from runtime failure modes — cuts noise dramatically.

By Quorum team

When we started Quorum, the obvious shape was a single reviewer: one prompt, one model, one pass over the diff. Reviewers like that look great in demos and fall apart in real repos. The reason is mundane — a generic "review this PR" prompt asks the model to multitask across three very different judgement calls, and it picks whichever is loudest in the diff.

What "one big reviewer" actually does

On a 200-line refactor PR, a single reviewer almost always picks up the renames and the extracted helpers and writes confident prose about maintainability. It rarely flags the off-by-one in a loop bound, or the input that flows into a SQL string two files over. Both of those need a different stance — adversarial, runtime-mode-first — and they have to compete with the architecture commentary for attention.

We measured it on an internal corpus of 1,200 review runs. A single reviewer surfaced one likely bug per six PRs. The same model, prompted as one of three specialised reviewers (Correctness, Security, Architecture) and run in parallel, surfaced one per 2.4 PRs — without raising the noise rate.

What changes with three

  • Each reviewer has a focus list and a stance. Correctness argues backward from runtime failure modes. Security frames findings in OWASP / CWE terms. Architecture cites the repo conventions it can see in the diff context.
  • Findings are JSON, validated against a Zod schema, with severity and confidence. The aggregator does the cross-reviewer work — dedup, sort, truncate — instead of asking the model to do it.
  • Confidence below the floor never reaches the PR. We default to 0.75. It is the cheapest noise filter we have and the one with the largest effect on perceived quality.

What we did not expect

The biggest surprise was that maintainers do not read three separate review posts well. The dedup-and-aggregate step matters more than the parallel fan-out. Two reviewers flagging the same line is a strong signal; we sort it to the top. Three reviewers each picking different battles in the same PR is overwhelming; we cap inline comments at 10 per review and let the rest live in the run history.

A panel beats a generalist when the work splits cleanly into specialities. Code review does. Most other reasoning tasks do not — be careful about copy-pasting this pattern.

If you are building reviewer tooling: start by writing the focus lists, not the prompts. The prompts fall out of the focus lists, and the focus lists are what your maintainers will actually argue about.

Next →
Deduping reviewer findings without losing signal