Writing

Devlogs on grounded RAG, refusal-as-eval, and honesty-engineering.

i ran my prompt-analysis tool on my own logs. 82% of my "questions" weren't me.

promptprint measures how you question your AI coding agent, 100% locally. I dogfooded it and found it was overcounting my questions ~5x — most of the log was machine traffic, not me. The fix, and why the tool now prints its own filter ratio (and leaves the residue it can't cleanly sort in plain sight) instead of reporting a tidier number.

the default RAG metric scores a correct "i don't know" as NaN — so i built a benchmark where it's a pass

ProvenanceBench: a faithfulness + justified-abstention benchmark for regulated docs, where a correct refusal is a pass — credited only for the right reason. What broke when I ran three systems, including a real model: the obvious score is gameable, and knowing when to refuse is far easier than knowing why.

the FDA and Anthropic wrote down the same rule my RAG runs on — and my own review caught it breaking that rule, twice

A regulator, a frontier lab, and a widely-cited paper converged on the same rule: a system that doesn't know should say so — and you have to verify that it does. cite-or-refuse runs on it; an adversarial review still caught two violations before launch.

i shipped a RAG that refuses to answer — then caught it breaking its own rule

Making refusal a first-class eval — and the adversarial review that caught the system confidently hallucinating on its own demo, twice, before launch.