i shipped a RAG that refuses to answer — then caught it breaking its own rule

the most common way a RAG system fails in production isn't a wrong answer. it's a confident hallucination — a fluent, well-formatted reply to a question the documents don't actually answer. it looks right. it cites something. it's just made up.

so i built a small open-source thing called cite-or-refuse. one rule: cite a real source, or honestly say not in the docs. never guess. and — the part i actually care about — i made refusing a first-class eval. honestly saying "i don't know" scores as a PASS, not a gap.

every reply is forced into one of three shapes, which is what makes faithfulness checkable instead of vibes:

answer — a retrieved sentence actually addresses the question, with a citation.
not_in_sources — the docs don't support an answer. so it says so.
out_of_scope — the question isn't this assistant's job.

it runs fully offline. no API key, no network. synthetic data only — the corpus is a fictional product, because the method is the point, not the data.

the part i didn't plan for

before publishing, i ran an adversarial review on it. not on the corpus — on the system. i wanted to know where it lied.

it caught my own system confidently answering a question it had no source for. the exact failure the whole project exists to prevent.

"can i password-protect a share link?"

the docs talk about share links. they never once mention passwords. a faithful system should refuse. mine cheerfully cited the generic share-link sentence — because the question and the sentence share the words "share" and "link," and that was enough to score them as relevant.

so the thing built to stop confident hallucination was confidently hallucinating. on its own demo. that's the honest version of the story, and it's a better story than "i built a thing and it worked."

the fix wasn't the first fix

first pass, i patched it. felt fine. so i ran the review again — a second round, fresh eyes — and it caught that my fix was a patch, not a fix. it still hallucinated on undocumented sub-features, and my README was quietly overclaiming what the system could do.

that second round is the whole reason i trust the project at all.

root cause, once i actually found it: the relevance score gave zero weight to the one word that made the question unanswerable — "password." retrieval rewarded the words that matched and never penalized the word that didn't exist anywhere in the docs.

the real fix is two cheap gates working together:

an undocumented-term gate — if a question names something that appears nowhere in the corpus, refuse. don't let matched words drown out the unmatched one.
light morphological normalization so the gate doesn't get fooled by plurals and word forms.

verified: zero confident hallucinations on the review's adversarial probes, zero over-refusals on the cases that should answer. the eval is 11/11 — those 11 deliberately mix all three kinds, including the exact paraphrases the review caught it failing, pinned as N3–N6 so they can't regress. 20 offline tests around it.

why i'm writing this down

in the kind of regulated documents nobody wants to read, a confident wrong answer isn't an embarrassment — it's the failure mode that actually hurts someone. "i don't know" is a feature, and it has to be measurable.

the demo is small on purpose. the thing i wanted to show isn't a clever retriever. it's a way of working: make refusal a measurable requirement, then adversarially try to make your own system break its one rule — twice — before you let anyone else see it.

it caught me. that's the point.

code, the eval, and the whole writeup (including the saga above):

github.com/shryu1994/cite-or-refuse →