How accurate is your AI fact‑checker? We tested five.
Behind every fact-check verdict is a language model making judgment calls. We ran 157 real social-media posts through five different models and adjudicated where they disagreed. What we found changed how we think about the cost–accuracy tradeoff — and about whether “offline” AI is a viable substitute at all.
Perclaim Research · 14 May 2026 · revised 19 May 2026
§ 1 · The QuestionWhich model should be doing the work?
Perclaim is a fact-checking application that reads social-media posts and tells you what’s true, false, misleading, or unverifiable. Behind that verdict is a language model — currently GPT-5.4-mini, with OpenAI’s hosted web search tool pulling in current-events context. But it’s not the only option. Other models claim better accuracy, lower costs, or the appeal of running entirely on your own hardware.
So we tested five of them, head-to-head, on the same 157 posts. Same input, same task, same scoring system. The corpus was political-heavy (Trump administration policy, Iran war coverage, election claims) because that’s what users actually submit. Some posts were text-only. Most carried images. The hardest were image-only posts where a screenshot of a Truth Social post or a quote-card meme carried the entire claim.
The five models were:
Claude Sonnet 4.6 — Anthropic’s mid-tier frontier model with the strongest available web search
Claude Haiku 4.5 — Anthropic’s smaller, cheaper frontier model with web search
GPT-5.4-mini — OpenAI’s current Perclaim production model
gpt-oss:120b — an open-weight reasoning model running locally, no web access
gemma4:26b — an open-weight multimodal model running locally, no web access
§ 2 · The MethodTreating the best model as ground truth
Fact-checking evaluations have a chicken-and-egg problem: to score a model’s accuracy you need ground-truth answers, but producing those requires human fact-checkers spending hours on each contested political claim. Inter-rater agreement on charged political content is itself noisy.
Our shortcut: use the most accurate model in the field as a proxy for ground truth, then check it. When Claude Sonnet 4.6 disagreed with GPT-5.4-mini on factually consequential cases, we manually reviewed each one with web-search verification. Sonnet was correct in 6 cases, in tie/legitimate-calibration on 5 cases, and wrong in 0 cases. That gave us enough confidence to use Sonnet’s verdicts as a comparison anchor for the other models.
A caveat we own
Using Sonnet as the reference biases this study toward “models that look like Sonnet rank well.” It’s a proxy, not absolute truth. We report the decomposition of disagreements in detail below so calibration differences are visible separately from substantive ones.
§ 3 · The HeadlineTwo tiers, sharp divide
Here’s how each candidate agreed with Sonnet — first the raw verdict-match rate, then a more useful number: the rate at which the candidate reached the same factual conclusion, allowing for one-notch calibration differences (like calling something “MIXED” where Sonnet called it “MOSTLY_TRUE”) and category swaps (calling something “OPINION” instead of “UNVERIFIABLE”).
Effective concordance with Sonnet 4.6
After collapsing 1-notch calibration shifts and orthogonal category swaps
Haiku 4.5 Anthropic, w/ web search
96.5%
GPT-5.4-mini OpenAI, w/ web search
92.7%
gpt-oss:120b Open-weight, no retrieval
76.8%
gemma4:26b Open-weight, no retrieval
63.2%
The two retrieval-equipped frontier models — Haiku and GPT — cluster tightly with Sonnet, reaching the same factual conclusion 90+% of the time. The two offline models fall off a cliff, dropping to 63–77%. There is no intermediate tier in our sample. The divide is categorical.
§ 4 · The FindingThe retrieval cliff is real
Here’s the part of the data that genuinely surprised us. Models that disagreed with Sonnet were doing so for fundamentally different reasons depending on whether they had web search.
For the frontier models (Haiku, GPT-5.4-mini): disagreements were mostly about calibration. Both models found the same facts. They just labeled them differently — calling a post “MIXED” where Sonnet called it “MOSTLY_TRUE,” or “MISLEADING” where Sonnet said “MOSTLY_FALSE.” Same factual reasoning, slightly different scoring rubric.
For the offline models (gpt-oss, gemma4): disagreements were about whether real events occurred at all. When a post mentioned something that happened after the model’s training cutoff, the offline model couldn’t verify it. So it defaulted to “FALSE.” Over and over.
95%
Of gpt-oss’s “FALSE” verdicts landed on rows where the retrieval-equipped reference identified real, documented events.
Three real examples from the study:
The Vivek Ramaswamy primary win. Sonnet correctly identified that Ramaswamy won the 2026 Ohio GOP gubernatorial primary against Amy Acton (Democratic nominee). gpt-oss insisted: “Ramaswamy is a presidential candidate with no primary win, he is not running for Ohio governor, and Amy Acton is not on the ballot.” gpt-oss was using 2024 training data and couldn’t update.
The FireAid $100M concert. Sonnet confirmed the FireAid wildfire-aid concert raised approximately $100M, citing a House Judiciary Committee report. gemma4 declared: “There is no evidence that a ‘Pacific Palisades Fire Aid’ concert raised $100 million... No such large-scale fund or corresponding official statements exist in the public record.” The concert is a real, widely-covered event.
Virginia Democrats’ redistricting spending. Sonnet confirmed the approximately $62M figure from VPAP data. gemma4 said: “No official financial disclosures or reports from credible news organizations corroborate such an enormous expenditure.” VPAP data is public and well-documented.
When a fact-checker can’t look things up, it can’t tell “claim about a real event” from “fabricated claim.” It just denies both.
This isn’t a quirk to be tuned out with better prompting. It’s a structural limitation. An offline fact-checker is reading a news feed through training-data memories that may be a year or two stale. For current-events fact-checking, that’s disqualifying.
§ 5 · Five Days LaterThe cheap model that almost wasn’t
The original study flagged one model as the most frustrating result: Haiku 4.5. On accuracy it was essentially indistinguishable from Sonnet — zero factual errors against the reference on every substantive disagreement we adjudicated — at a fraction of Sonnet’s cost. But it had a disqualifying flaw: roughly one in eleven responses came back as unparseable output. Sometimes malformed JSON; more often the model declining to answer in conversational prose instead of returning a structured verdict. An 8.9% failure rate is not something you put in front of users.
We treated that as an engineering problem rather than a capability gap — and five days after the original study, we fixed it. The fix was a parser-side recovery cascade: when a response fails to parse cleanly, a first pass tries to extract the verdict structurally even from broken JSON; if that finds nothing, a second pass recognizes a prose refusal and converts it into a clean “UNVERIFIABLE” verdict with the refusal preserved. No model change, no retraining — just better handling of the output the model already produces.
Haiku 4.5 null-verdict rate
Across three replays — original study, prompt fix, recovery cascade
Original 14 May, pre-fix
8.9%
Prompt fix 19 May, intermediate
5.9%
Recovery cascade 19 May, final
1.6%
The failure rate dropped from 8.9% to 1.6%— an 82% reduction. At that level, Haiku is reliable enough to deploy. It now runs in production as Perclaim’s cross-vendor backup: if the primary model has a transient error, the check fails over to Haiku rather than to a second model from the same vendor — so an outage or a systematic quirk at one provider doesn’t take the whole system down.
§ 6 · The ModelsWhat each one is good for
Combining accuracy, cost, and reliability data, here’s where each model landed. Per-check cost is shown relative to the production model— we run GPT-5.4-mini at roughly a nickel a check, and everything else is expressed as a multiple of that.
Production primary
GPT-5.4-mini
OpenAI · with hosted web search
92.7%
Effective concordance with Sonnet
1×
Per-check cost (baseline)
0%
Failure rate
The model running Perclaim today, across both the free and Pro tiers. Reliable, cost-efficient, and on manually-adjudicated hard cases, never factually wrong against Sonnet — just slightly less precise on calibration. This is the production default.
Production backup
Claude Haiku 4.5
Anthropic · with web search
96.5%
Effective concordance with Sonnet
~2×
Per-check cost (rel.)
1.6%
Failure rate (post-fix)
The biggest surprise of the evaluation. On accuracy, indistinguishable from Sonnet — zero factual errors against the reference on the substantive disagreements we adjudicated. Once its output-reliability problem was fixed (§ 5), it became the obvious cross-vendor backup: nearly Sonnet-grade judgment, a fraction of the cost, and a different vendor than the primary for resilience.
Reference model
Claude Sonnet 4.6
Anthropic · with dynamic-filter web search
100%
Reference baseline
~6×
Per-check cost (rel.)
0%
Failure rate
The most accurate model we tested, and the yardstick the others are measured against. On 11 head-to-head substantive disagreements with GPT-5.4-mini, Sonnet was correct in 6 and never wrong. It cites specific fact-checkers (Lead Stories, PolitiFact, Snopes) with specific dates. We keep it as our evaluation reference rather than a production model — at several times the per-check cost, it doesn’t fit a sustainable consumer price point, and Haiku delivers comparable judgment for far less.
Not viable today
gpt-oss:120b
Open-weight · local hardware · no retrieval
76.8%
Effective concordance with Sonnet
power
Cost is electricity only
6%
Hard failure rate
A capable reasoning model with no way to learn about events after its training cutoff. 95% of its “FALSE” verdicts in this study were on posts that contained real, documented events. Cost is essentially free at runtime, but the systematic false-negative problem disqualifies it for current-events fact-checking. Would require a separate web-retrieval system to be built before it could be considered.
Not viable today
gemma4:26b
Open-weight · local hardware · no retrieval
63.2%
Effective concordance with Sonnet
power
Cost is electricity only
17%
Total failure rate
Multimodal but offline. Same retrieval problem as gpt-oss, plus a much higher overall failure rate driven by JSON parse failures (22 of 152 rows) and hard errors. Same conclusion: not viable without a separate web-retrieval system.
§ 7 · The Honest BitWhat this study isn’t
We’re publishing this because we think it’s a useful picture, but it has real limits.
It’s a sample of 157 posts. Topically heavy on US politics (Trump administration, Iran, elections). Models may calibrate differently on science, health, finance, or local news.
Sonnet isn’t ground truth. It’s the most accurate model among the five we tested. Using it as the reference biases the matrix toward models that calibrate similarly. We show the decomposition (calibration vs. substantive disagreement) so this is visible.
The adjudicator was another LLM. Manual case-by-case review was performed by an AI assistant with web-search verification. Independent human fact-checker adjudication would strengthen the calibration check. We list this as future work.
Each model was run once. Some fraction of observed disagreement is run-to-run sampling noise, not a real difference. A multi-replay variance study is a future work item. (The Haiku follow-up made this concrete: of eleven failures in one run, seven recovered on their own in the next, with no intervention.)
Cost numbers are eval-derived. They come from evaluation runs, not production traffic, and we report them only as rough relative multiples rather than precise figures.
§ 8 · What This MeansFor Perclaim users, in plain terms
If you’re using Perclaim today, the model checking your posts is GPT-5.4-mini — a frontier model with web search. Our evaluation says that’s a defensible choice: it reaches the same factual conclusions as the best available model 93% of the time, makes no factual errors against that reference on hard cases, and runs at a cost that lets us hold the free tier open. If it ever has a transient problem, your check fails over to Claude Haiku 4.5 — a different vendor’s model, with near-identical accuracy in our tests — so a hiccup at one provider doesn’t leave you without an answer.
We looked hard at whether a more expensive model should sit behind a Pro tier, and decided against it: the most accurate model we tested costs several times more per check without being more accurate on the cases that actually matter — it’s tighter on calibration, not on facts. So Pro and free run the samestrong model. What Pro buys you is higher usage limits and the surrounding features, not a different brain behind the verdict. We’d rather put a frontier model in front of everyone than gate accuracy behind a paywall.
And if you’ve wondered whether an “open-source” or “self-hosted” AI could do this work: not yet. Not on current events. Until we build the architecture for a model to search the web independently of OpenAI or Anthropic, the offline candidates will continue to deny real events they have no way to learn about. We’re working on it.