Devlog · 2026 · cite-or-refuse
the FDA and Anthropic wrote down the same rule my RAG runs on — and my own review caught it breaking that rule, twice
within about a year, three groups that don't talk to each other wrote down the same rule. a frontier lab put it in an engineering blog. a widely-cited paper put it in its abstract. and a drug regulator put it in its first warning letter to cite overreliance on AI.
the rule: a system that doesn't know should be able to say so — and you have to check that it actually does.
i built a small open-source thing on exactly that rule. then, running an adversarial review before publishing, i caught my own system breaking it. twice. this is that story, with the receipts — because a post about citation discipline that doesn't cite its own sources would be its own punchline.
the three who wrote it down
Anthropic's engineering team, in Demystifying evals for AI agents, says it almost as an instruction:
"give the LLM a way out, like providing an instruction to return 'Unknown' when it doesn't have enough information."
and then the part most people skip — the test, not just the behavior:
"Test both the cases where a behavior should occur and where it shouldn't. One-sided evals create one-sided optimization."
so: give it an out, and measure that it takes the out when it should. that second sentence is the whole game.
OpenAI's paper Why Language Models Hallucinate explains why the out is so hard to get: standard evals reward confident guessing and score "i don't know" the same as a wrong answer. a model that abstains looks worse on the leaderboard than a model that bluffs. the proposed fix is to credit calibrated abstention — make "i don't know" able to score points.
and then the one that isn't a blog or a paper. the FDA's first warning letter to cite overreliance on AI — issued to Purolea Cosmetics Lab, a contract manufacturer of homeopathic over-the-counter drugs, on April 2, 2026. read it closely and the striking thing is what it doesn't say. it doesn't say the AI was wrong. it says the company used AI to draft regulated documents and then had no one accountable for verifying the output was correct. the cited violation is 21 CFR 211.22(c) — the quality unit's responsibility — not a model defect.
three different vocabularies. one shape. the output has to be able to abstain, and a process has to verify it does. a lab calls it an eval. a paper calls it calibrated abstention. a regulator calls it accountability. it's the same rule.
the thing i built
cite-or-refuse is a grounded RAG with exactly one rule: cite a real source, or honestly say "not in the docs." never guess.
the part i actually cared about: i made refusing a first-class eval. honestly saying "i don't know" scores as a PASS, not a gap. every reply is forced into one of three shapes, which is what makes faithfulness checkable instead of vibes:
answer— a retrieved sentence addresses the question, with a citation.not_in_sources— the docs don't support an answer, so it says so.out_of_scope— the question isn't this assistant's job.
it runs fully offline. no API key, no network. synthetic data — a fictional product — because the method is the point, not the data.
that is, almost line for line, the rule those three authorities had each already written down. i'm not claiming i got there first — i didn't; all three predate this repo. which would make for a tidy "great minds" ending, except building on a rule and obeying it are not the same thing.
the two times my own system broke it
before publishing i ran an adversarial review — not on the corpus, on the system. i wanted to know where it lied. it found out, fast.
"can i password-protect a share link?"
the docs talk about share links. they never once mention passwords. a faithful system should refuse. mine cheerfully cited the generic share-link sentence — because the question and the sentence both contain "share" and "link," and that overlap was enough to score them as relevant. the system built to stop confident hallucination was confidently hallucinating, on its own demo.
so i patched it, and — this is the part that matters — i ran the review again. the second round caught that my patch was a patch, not a fix: it still hallucinated on undocumented sub-features, and my README was quietly overclaiming what the system could do. that second round is the entire reason i trust the project now.
root cause: the relevance score gave zero weight to the one word that made the question unanswerable — "password." retrieval rewarded the words that matched and never penalized the word that existed nowhere in the docs. exactly OpenAI's point, one layer down: the scoring rewarded a confident match and was blind to the missing piece.
the fix is two cheap gates: an undocumented-term gate (a question naming something that appears nowhere in the corpus gets refused — matched words don't get to drown out the unmatched one), plus light morphological normalization so the gate isn't fooled by plurals and word forms. verified after: zero confident hallucinations on the review's probes, zero over-refusals on cases that should answer. the failing paraphrases are pinned into the eval as fixed cases so they can't regress.
why three authorities and one small repo end up in the same place
in the kind of regulated documents nobody wants to read, the FDA letter isn't an abstraction — it's exactly the failure mode this is about. "i don't know" is not a weakness to engineer away; it's the control. and a control you can't measure isn't a control.
what the convergence actually tells you isn't that i was early — i wasn't; i built on a rule three authorities had already written down. it's that the easy part is agreeing with the rule. a lab, a paper, and a regulator each landed on it without coordinating, and i agreed with all of them. the hard part is the thing the FDA letter is really about and the thing my own review caught: checking that the system obeys the rule when no one is watching. the agreement is cheap. the verification is the work.
my system agreed with itself, too. then it broke its own rule twice, and only an adversarial pass made it honest. that's the whole point — not that i wrote the rule down, but that i tried to break it before anyone else could.
the code, the eval, and the full saga (including the first round's failure):
github.com/shryu1994/cite-or-refuse →