Devlog · 2026 · ProvenanceBench
the default RAG metric scores a correct "i don't know" as NaN — so i built a benchmark where it's a pass
in high-stakes domains, a confident wrong answer from a RAG system isn't an embarrassment — it can be a compliance event. so when the documents don't support an answer, the system has to say so: "i don't know" is the required output, not a failure.
here's the problem. the default RAG faithfulness metric — RAGAS — returns NaN for a correct refusal. not a low score. not a pass. NaN. and it's on purpose. from the maintainer, closing issue #794:
"this is intentional. since the system refuses to answer it does not make sense to score it … therefore it's NaN."
a refusal produces no grounded statements, so faithful_statements / num_statements
divides by zero and the metric gives up. which means: the one behavior my domain most
requires is invisible to the standard metric. that's backwards.
so i built ProvenanceBench — a faithfulness + justified-abstention benchmark for regulated documentation, where a correct refusal is a first-class pass, and credited only when the system abstains for the right reason.
run it yourself in ten seconds — no install, no API key: make demo.
not the first refusal benchmark — here's what's new, and what isn't
this stands on work i cite, not around it. a regulator, a frontier lab, and a frontier paper already wrote down the rule: a system that doesn't know should say so. and the eval world has built pieces of the measurement — AbstentionBench scores abstention as correct, UAEval4RAG names six categories of unanswerable request, RefusalBench shows frontier models score below 50% on multi-document refusal, ALCE scores citation quality.
scoring refusal next to citations, per item, isn't new either — CReSt and Trust-Align already do that. what i haven't found is this specific slice: a regulated corpus where a justified refusal is a first-class pass and is credited only for the right reason — the taxonomy-matched one. that's an open gap as far as i can tell — an absence-of-evidence claim, not a proof — and it's what ProvenanceBench targets. synthetic data only — the method is the point, not the data — and every gold label is adversarially re-checked against the corpus before it ships (a benchmark about unsupported claims can't ship unsupported gold labels).
what broke when i ran it
i ran four systems on the same 72 cases: a naive baseline that always answers, the
cite-or-refuse gate method, and two real LLMs (claude haiku and sonnet, via subscription —
the harness is vendor-neutral). four things fell out, and i'm leading with them because
they're the point. (the claude column below is haiku; sonnet is within a case
or two of it on every row — see below.)
| on 72 cases | naive | grounded | claude |
|---|---|---|---|
| overall correct | 0.29 | 0.67 | 0.93 |
| "joint score" (gameable) | 0.70 | 0.22 | 0.94 |
| abstains for right reason | 0.00 | 0.28 | 0.78 |
| over-refuses answerable | 0.00 | 0.81 | 0.00 |
| citations faithful (judge) | — | — | 0.92 |
1. the obvious score is gameable. the natural "joint score" (the field's convention, weighting answer-accuracy 0.7) rewards a system that never abstains. the always-answer baseline beats the grounded one on the joint — 0.70 to 0.22 — while scoring 0.29 against the grounded system's 0.67 on "did it do the right thing on every item." so the headline can't be the joint. a benchmark about knowing when to stop can't use a metric you win by never stopping. and this isn't unique to my benchmark — BenchJack (Dawn Song et al., 2026) shows agents can score near-perfect on most agent benchmarks without solving a single task, and surveys of agent-as-a-judge evaluation name the same gameability as an open problem. eval gets gamed the moment looking-right and being-right come apart.
2. knowing when to refuse is far easier than knowing why. both models abstain on ~89% of the cases they should — but name the right reason only ~78–80% of the time. they can tell they shouldn't answer; they can't always say whether the question was underspecified, built on a false premise, or simply not in the docs. and scale barely helps: a right-reason gap that looked real on an earlier 45-case run (0.71 vs 0.82) collapsed to about one case when i grew the set to 72 — it was small-sample noise. (the grounded baseline is far worse: it abstains 91% but gets the reason 28%.) this reproduces RefusalBench's "detection ≫ categorization" and its "neither scale nor extended reasoning improves refusal" — on real models, in a regulated setting.
3. mis-calibration runs both ways. the naive system over-answers everything. the cite-or-refuse gate over-refuses — 0.81 on naturally-phrased answerable questions, because its conservative guard trips on ordinary words the corpus doesn't happen to use. the LLMs are the best-calibrated of the lot, but they still over-answer a handful of cases they should have refused — most of them underspecified, the category RefusalBench and UAEval4RAG both rate hardest. you have to test where a behavior should occur and where it shouldn't, or you only see half the failure.
4. when it answers, the citations hold. the faithfulness judge confirmed nearly every answer cited a span that actually supported the claim (sonnet 26/26, haiku 23/25). so in this regulated setting the failure mode isn't bad citations — it's the abstention decision: when to stop and why. that's the thing worth measuring, and the thing the default metric throws away.
why i'm putting it in the open
i don't think the number that matters is a leaderboard win. these runs are two small models on a small synthetic corpus, and i'm saying so. what i think is worth shipping is the shape of the measurement: refusal as a passing answer, credited only for the right reason, on the kind of documents where that's the whole job — and a side-by-side that shows the standard metric can't score it.
the repo is the source of truth. it runs fully offline; the model side runs on a subscription with no API key; every prior-art claim is cited and re-verified; every gold label is checked against the corpus at load time. if it scores a refusal as a pass, it's because the system refused — for the right reason.
the spec, the taxonomy, the harness, and every run's report:
github.com/shryu1994/provenance-bench →