From Scores to Decisions - Oliver Hennhoefer

Anomaly detectors often stop at a score. The score may rank observations in a useful way, but it does not say what level of evidence should trigger an action, nor how often that action is expected to be wrong under a stated reference condition.

Conformal inference is useful here because it gives a way to turn a score into a calibrated quantity. In anomaly detection, that often means converting a raw anomaly score (s(x)) into a p-value relative to calibration data:

p(x) = \frac{1 + \sum_{i=1}^{n} \mathbf{1}\{s_i \ge s(x)\}}{n + 1}.

The resulting value is not magic, and it still depends on the assumptions behind the calibration set, but it gives the downstream decision a statistical meaning.

The Software Boundary

For research software, I like the boundary where a library accepts the model’s scores and returns calibrated evidence. That keeps the anomaly detector flexible: it can be a classical method, a neural model, or a domain-specific scoring rule. The calibration layer handles the part that ad hoc thresholds leave implicit.

This is the design direction behind nonconform: keep the scoring model separate, expose the calibration assumptions clearly, and make the final decision rule inspectable.

Sequential Settings

The same issue appears when tests arrive over time. A single threshold can look reasonable in isolation and still behave poorly when repeated decisions accumulate. Online FDR procedures address that by controlling the error rate across a stream of hypotheses, adapting the available testing budget as evidence arrives.

The common thread is simple: scores are useful, but decisions need accounting. Good APIs should make that accounting visible.