Why output-only QA fails
A disengaged human produces labels that look exactly like engaged ones. A human with a frontier model open in the next tab produces labels that are, by construction, indistinguishable from human judgment. Output review cannot catch this — the output is the disguise. The only place to verify quality is upstream of the label, in the human, at the moment of judgment.
How it's different
Vs attention checks: attention checks inspect the label and are gameable. Sensie inspects the human and cannot be defeated by going through the motions.
Vs inter-rater agreement: IAA is retrospective and can't tell you why a rater was off. Sensie is per-session and prospective.
Vs LLM-as-judge: model graders compare outputs to outputs. At the bottom of the eval stack is a human whose judgment is ground truth — Sensie verifies that human gave it.
What partners get
- Per-rater, per-session alignment score
- Pre-registered baseline study on your domain
- Open eval harness for reproducibility: github.com/sensie-app/sensie-eval-harness
Evidence
9 PhD-led research trials · 18,000+ sessions · 83.6% post-calibration accuracy · 2 granted US patents + 1 filing.