How accurate is your AI fact-checker? We tested five.

Combining accuracy, cost, and reliability data, here’s where each model landed. Per-check cost is shown relative to the production model— we run GPT-5.4-mini at roughly a nickel a check, and everything else is expressed as a multiple of that.

Production primary

GPT-5.4-mini

OpenAI · with hosted web search

92.7%

Effective concordance with Sonnet

1×

Per-check cost (baseline)

Failure rate

The model running Perclaim today, across both the free and Pro tiers. Reliable, cost-efficient, and on manually-adjudicated hard cases, never factually wrong against Sonnet — just slightly less precise on calibration. This is the production default.

Production backup

Claude Haiku 4.5

Anthropic · with web search

96.5%

Effective concordance with Sonnet

~2×

Per-check cost (rel.)

1.6%

Failure rate (post-fix)

The biggest surprise of the evaluation. On accuracy, indistinguishable from Sonnet — zero factual errors against the reference on the substantive disagreements we adjudicated. Once its output-reliability problem was fixed (§ 5), it became the obvious cross-vendor backup: nearly Sonnet-grade judgment, a fraction of the cost, and a different vendor than the primary for resilience.

Reference model

Claude Sonnet 4.6

Anthropic · with dynamic-filter web search

100%

Reference baseline

~6×

Per-check cost (rel.)

Failure rate

The most accurate model we tested, and the yardstick the others are measured against. On 11 head-to-head substantive disagreements with GPT-5.4-mini, Sonnet was correct in 6 and never wrong. It cites specific fact-checkers (Lead Stories, PolitiFact, Snopes) with specific dates. We keep it as our evaluation reference rather than a production model — at several times the per-check cost, it doesn’t fit a sustainable consumer price point, and Haiku delivers comparable judgment for far less.

Not viable today

gpt-oss:120b

Open-weight · local hardware · no retrieval

76.8%

Effective concordance with Sonnet

power

Cost is electricity only

Hard failure rate

A capable reasoning model with no way to learn about events after its training cutoff. 95% of its “FALSE” verdicts in this study were on posts that contained real, documented events. Cost is essentially free at runtime, but the systematic false-negative problem disqualifies it for current-events fact-checking. Would require a separate web-retrieval system to be built before it could be considered.

Not viable today

gemma4:26b

Open-weight · local hardware · no retrieval

63.2%

Effective concordance with Sonnet

power

Cost is electricity only

17%

Total failure rate

Multimodal but offline. Same retrieval problem as gpt-oss, plus a much higher overall failure rate driven by JSON parse failures (22 of 152 rows) and hard errors. Same conclusion: not viable without a separate web-retrieval system.

How accurate is your AI fact‑checker? We tested five.

§ 1 · The QuestionWhich model should be doing the work?

§ 2 · The MethodTreating the best model as ground truth

§ 3 · The HeadlineTwo tiers, sharp divide

§ 4 · The FindingThe retrieval cliff is real

§ 5 · Five Days LaterThe cheap model that almost wasn’t

§ 6 · The ModelsWhat each one is good for

GPT-5.4-mini

Claude Haiku 4.5

Claude Sonnet 4.6

gpt-oss:120b

gemma4:26b

§ 7 · The Honest BitWhat this study isn’t

§ 8 · What This MeansFor Perclaim users, in plain terms