False Discovery Rate Control¶
FDR control is the decision layer for batch anomaly detection. Use it when you turn many conformal p-values into many anomaly flags and want to control the expected fraction of false flags among the points you investigate.
What is FDR and Why Does It Matter?¶
When you test many observations for anomalies, some will look anomalous by chance even if they are truly normal. If you tested 1,000 truly normal, well-calibrated p-values one by one at alpha = 0.05, you would expect about 50 false positives before any multiple-testing correction.
The false discovery proportion in one realized batch is the fraction of false positives among all observations you flag as anomalies:
More precisely, that displayed fraction is the realized false discovery proportion (FDP); FDR is the expected FDP over repeated data draws.
An equivalent operational interpretation of the expected proportion is:
False Discovery Rate (FDR) control adjusts the selection threshold so that the expected false-positive proportion among discoveries stays below a target level, such as 5%, when the p-values and dependence assumptions are valid. This differs from controlling false positives per individual test: FDR controls the average error proportion among the points you actually flag.
Example
Suppose your pipeline flags 100 observations as anomalies with
alpha = 0.05 FDR control and the statistical assumptions hold.
- Target: expected false discovery proportion at or below 5%
- Realized false alarms in one batch can be lower or higher
Now compare this to an uncontrolled setup that flags 200 observations, where 50 are false positives:
- False positives: 50/200 = 25% realized FDP
- This means 1 in 4 investigations is wasted effort
Quick Start¶
detector.select() is the recommended single-call entry point. It combines
p-value computation with the appropriate FDR-controlled selection procedure,
automatically dispatching to weighted selection when a weight_estimator is
configured:
detector.fit(X_train)
mask = detector.select(X_test, alpha=0.05)
For the weighted case with custom pruning:
from nonconform.enums import Pruning
mask = detector.select(
X_test,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
seed=42,
)
When you need raw p-values for custom downstream analysis (multi-alpha sweeps,
diagnostics, or a separately justified combination workflow), use
compute_p_values(...) plus SciPy BH:
from scipy.stats import false_discovery_control
p_values = detector.compute_p_values(X_test)
decisions = false_discovery_control(p_values, method="bh") <= 0.05
Note
detector.last_result is populated by the most recent
detector.compute_p_values(...) or detector.select(...) call.
See Weighted Conformal Selection below for
a complete runnable example.
Selection Entry Points¶
Primary (recommended): detector.select(X_test, alpha=...) - dispatches
automatically based on detector configuration; no manual result-bundle
handling required.
Advanced/low-level options (for custom workflows):
- Standard (exchangeable): apply BH directly via
scipy.stats.false_discovery_control(...)to conformal p-values. - Weighted (covariate shift with importance weights):
weighted_false_discovery_control(result=...)orweighted_false_discovery_control_from_arrays(...).
Parameter Roles (delta vs alpha)¶
When using ConditionalEmpirical, keep these roles separate:
delta: calibration confidence/failure budget inside the conditional p-value map.alpha: target FDR level in the final selection rule.
They do not need to be equal. A common pattern is to tune delta for p-value
calibration behavior and alpha for operational false discovery tolerance.
Guarantee Scope for BH-Style Selection¶
BH-style selection applied to conformal p-values has guarantees that depend on:
- how valid/calibrated those p-values are,
- exchangeability (or the relevant data-shift assumptions for weighted methods),
- and BH dependence assumptions (independence or PRDS).
For standard split conformal outlier p-values, Bates et al. prove the PRDS property needed for BH under their assumptions. This does not mean arbitrary post-processing is safe: shared calibration data can make generic p-value combination procedures invalid without additional justification.
In other words, the selection routine itself does not create validity from invalid inputs; it preserves guarantees under the assumptions above.
| Input situation | Recommended path |
|---|---|
| Standard exchangeable conformal p-values | detector.select(...) or SciPy BH on compute_p_values(...) |
| Weighted covariate-shift workflow | detector.select(...) with a weight_estimator so WCS is used |
| Arbitrary dependent or post-processed p-values | Do not assume BH validity without a separate justification |
| Streaming decisions over time | Use an online FDR method, not a fixed-batch BH shortcut |
Strict validation for weighted inputs
Weighted FDR routines fail fast on invalid inputs.
They now raise ValueError when:
- score/weight arrays are not 1D numeric arrays of matching lengths
- any score/weight/p-value contains non-finite values
- any weight is negative
- total calibration weight is not strictly positive
result.metadata["kde"]is present but malformed (missing keys, invalid shapes, non-monotone grid/CDF, or non-positive total weight)
from scipy.stats import false_discovery_control
from nonconform.fdr import (
weighted_false_discovery_control,
weighted_false_discovery_control_from_arrays,
)
# Standard BH selection from explicit p-values
cs_mask = false_discovery_control(result.p_values, method="bh") <= 0.05
# Strict WCS from cached result bundle
wcs_from_result = weighted_false_discovery_control(
result=result,
alpha=0.05,
)
# Strict WCS from explicit arrays
wcs_mask = weighted_false_discovery_control_from_arrays(
p_values=result.p_values,
test_scores=result.test_scores,
calib_scores=result.calib_scores,
test_weights=result.test_weights,
calib_weights=result.calib_weights,
alpha=0.05,
)
Basic Usage¶
from nonconform import ConformalDetector, Split
from pyod.models.lof import LOF
detector = ConformalDetector(
detector=LOF(),
strategy=Split(n_calib=0.2),
aggregation="median",
seed=42,
)
detector.fit(X_train)
# FDR-controlled selection at 5% - single call
discoveries = detector.select(X_test, alpha=0.05)
print(f"FDR-controlled discoveries: {discoveries.sum()}")
Weighted Conformal Selection¶
When calibration and test distributions differ in a way that matches the
covariate-shift assumptions, configure a weight_estimator and call
select() - it automatically dispatches to Weighted Conformalized Selection
(WCS):
from nonconform import ConformalDetector, JackknifeBootstrap, logistic_weight_estimator
from nonconform.enums import Pruning
from pyod.models.iforest import IForest
detector = ConformalDetector(
detector=IForest(random_state=1),
strategy=JackknifeBootstrap(n_bootstraps=50),
weight_estimator=logistic_weight_estimator(),
seed=1,
)
detector.fit(X_train)
selected = detector.select(
X_test,
alpha=0.1,
pruning=Pruning.DETERMINISTIC,
seed=1,
)
print(f"Selected points: {selected.sum()} / {len(selected)}")
The pruning parameter controls the second-stage WCS pruning rule.
DETERMINISTIC uses a fixed rule. HOMOGENEOUS and HETEROGENEOUS use
shared or independent randomness. Set seed for reproducible randomized
pruning decisions.
Available Methods¶
For direct BH control on conformal p-values, use
scipy.stats.false_discovery_control. SciPy documents method="bh" for
Benjamini-Hochberg and method="by" for the more conservative
Benjamini-Yekutieli dependency-robust adjustment.
Benjamini-Hochberg (BH)¶
- Method:
'bh' - Description: Most commonly used FDR control method
- Assumptions: Independent tests, or tests satisfying positive regression dependence on subsets (PRDS). In plain terms, PRDS means small p-values tend to occur together in a positively dependent way; it is stricter than generic "positive dependence." Standard split conformal outlier p-values satisfy PRDS in the Bates et al. setting.
- Usage:
false_discovery_control(p_values, method='bh')
from scipy.stats import false_discovery_control
# BH control on conformal p-values
bh_adjusted = false_discovery_control(p_values, method='bh')
bh_discoveries = (bh_adjusted < 0.05).sum()
print(f"BH discoveries: {bh_discoveries}")
Setting FDR Levels¶
You can control the desired FDR level using the alpha parameter:
from scipy.stats import false_discovery_control
# Different FDR levels
fdr_levels = [0.01, 0.05, 0.1, 0.2]
for alpha in fdr_levels:
discoveries = (false_discovery_control(p_values, method="bh") <= alpha).sum()
print(f"FDR level {alpha}: {discoveries} discoveries")
When to Use FDR Control¶
Use FDR control whenever you make more than one test-level anomaly decision. This includes both batch decisions made simultaneously and decisions accumulated over time.
Core Rule¶
- One test: a per-test threshold may be enough.
- Multiple tests: control FDR to bound the expected fraction of false discoveries among flagged points.
Why¶
- Controlled false discoveries: bounds expected false-positive proportion among detections.
- Practical power trade-off: usually more powerful than stricter family-wise error control.
- Scales to many tests: suitable for modern high-throughput anomaly workflows.
Sequential Note¶
If decisions are made over time (not a fixed batch), use procedures designed for online settings (see Online FDR Control for Streaming Data).
Integration with Conformal Prediction¶
select() dispatches automatically - standard or weighted - based on the
detector's configuration:
from nonconform import ConformalDetector, Split, logistic_weight_estimator
from nonconform.enums import Pruning
from pyod.models.lof import LOF
base_detector = LOF()
strategy = Split(n_calib=0.2)
# Standard: BH-style FDR selection on conformal p-values
standard_detector = ConformalDetector(
detector=base_detector,
strategy=strategy,
aggregation="median",
seed=42,
)
standard_detector.fit(X_train)
standard_mask = standard_detector.select(X_test, alpha=0.05)
# Weighted: WCS (handles covariate shift via importance weights)
weighted_detector = ConformalDetector(
detector=base_detector,
strategy=strategy,
aggregation="median",
weight_estimator=logistic_weight_estimator(),
seed=42,
)
weighted_detector.fit(X_train)
weighted_mask = weighted_detector.select(
X_test,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
seed=42,
)
print(f"Standard detections: {standard_mask.sum()}")
print(f"Weighted detections: {weighted_mask.sum()}")
Performance Evaluation¶
Evaluate the effectiveness of FDR control using nonconform's built-in metrics:
from scipy.stats import false_discovery_control
from nonconform.metrics import false_discovery_rate, statistical_power
def evaluate_fdr_control(p_values, true_labels, alpha=0.05):
"""Evaluate FDR control performance."""
# Apply FDR control
discoveries = false_discovery_control(p_values, method="bh") <= alpha
# Calculate metrics using nonconform functions
empirical_fdr = false_discovery_rate(true_labels, discoveries)
power = statistical_power(true_labels, discoveries)
return {
'discoveries': discoveries.sum(),
'empirical_fdr': empirical_fdr,
'power': power
}
# Example usage
results = evaluate_fdr_control(p_values, y_true, alpha=0.05)
print(f"Discoveries: {results['discoveries']}")
print(f"Empirical FDR: {results['empirical_fdr']:.3f}")
print(f"Statistical Power: {results['power']:.3f}")
Best Practices¶
1. Choose Appropriate FDR Level¶
- Very strict:
alpha = 0.01only when false positives are extremely costly (often too strict for exploratory workflows) - Standard:
alpha = 0.05for most applications - Exploratory / higher-recall:
alpha = 0.10when missing anomalies is costlier than investigating additional false positives
2. Method Selection¶
- Use
detector.select(...)for most conformal workflows - Use BH via SciPy for manual p-value thresholding workflows
- Use BY only when you need a conservative fallback for dependence that is not covered by the BH assumptions and you accept reduced power
3. Combine with Domain Knowledge¶
from scipy.stats import false_discovery_control
# Incorporate prior knowledge about anomaly prevalence
expected_anomaly_rate = 0.02 # 2% expected anomalies
adjusted_alpha = min(0.05, expected_anomaly_rate * 2) # Adjust FDR level
discoveries = false_discovery_control(p_values, method="bh") <= adjusted_alpha
4. Monitor Performance¶
from scipy.stats import false_discovery_control
# Track FDR control performance over time
fdr_history = []
for batch in data_batches:
p_vals = detector.compute_p_values(batch)
discoveries = false_discovery_control(p_vals, method="bh") <= 0.05
if len(true_labels_batch) > 0: # If ground truth available
metrics = evaluate_fdr_control(p_vals, true_labels_batch)
fdr_history.append(metrics['empirical_fdr'])
Common Pitfalls¶
1. Inappropriate Independence Assumptions¶
- BH assumes independence or positive dependence
- Re-check assumptions or move to methods designed for your dependence structure
2. Multiple Rounds of Testing¶
- Don't apply FDR control multiple times to the same data
- If doing sequential testing, use specialized methods
Online FDR Control for Streaming Data¶
For dynamic settings with streaming data batches, the optional online-fdr package provides methods that adapt to temporal dependencies while maintaining FDR control.
Do not conflate this with martingale alarm thresholds such as
ville_threshold or restarted_ville_threshold in
Exchangeability Martingales: those provide
anytime false-alarm control on evidence processes, not FDR control across
multiple tested hypotheses.
Installation and Basic Usage¶
# Install FDR dependencies
# pip install nonconform[fdr]
from online_fdr.investing.alpha.alpha import Gai
# Example with streaming conformal p-values
def streaming_anomaly_detection(data_stream, detector, alpha=0.05):
"""Online FDR control for streaming anomaly detection."""
# Initialize online FDR method
# GAI: alpha-investing style online FDR control
online_fdr = Gai(alpha=alpha, wealth=alpha / 2)
discoveries = []
for batch in data_stream:
# Get p-values for current batch
p_values = detector.compute_p_values(batch)
# Apply online FDR control
for p_val in p_values:
decision = online_fdr.test_one(float(p_val))
discoveries.append(decision)
return discoveries
LORD (Levels based On Recent Discovery) Method¶
from online_fdr.investing.lord.three import LordThree
# LORD 3: alpha allocation adapts over the testing stream
lord_fdr = LordThree(alpha=0.05, wealth=0.04, reward=0.05)
# Process streaming data with temporal adaptation
for t, (batch, p_values) in enumerate(stream_with_pvalues):
for p_val in p_values:
# LORD adapts rejection threshold based on recent discoveries
reject = lord_fdr.test_one(float(p_val))
if reject:
print(f"Anomaly detected at time {t} with p-value {p_val:.4f}")
Statistical Assumptions for Online FDR¶
Key Requirements: - Independence assumption: Test statistics should be independent or satisfy specific dependency structures - Sequential testing: Methods designed for sequential hypothesis testing scenarios - Temporal stability: Underlying anomaly detection model should be reasonably stable
When NOT to use online FDR: - Strong temporal dependencies in p-values without proper correction - Concept drift affecting p-value calibration - Non-stationary data streams requiring model retraining
Best practice: Combine with windowed model retraining and exchangeability monitoring for robust streaming anomaly detection.
References¶
- Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B, 57(1), 289-300.
- Benjamini, Y., & Yekutieli, D. (2001). The Control of the False Discovery Rate in Multiple Testing under Dependency. The Annals of Statistics, 29(4), 1165-1188.
- Bates, S., Candès, E., Lei, L., Romano, Y., & Sesia, M. (2023). Testing for Outliers with Conformal p-values. The Annals of Statistics, 51(1), 149-178.
- Jin, Y., & Candès, E. J. (2023). Model-free Selective Inference Under Covariate Shift via Weighted Conformal p-values. Biometrika, 110(4), 1090-1106.
- SciPy documentation.
scipy.stats.false_discovery_control.
Next Steps¶
- Learn about weighted conformal p-values for handling distribution shift
- Explore different conformalization strategies for various scenarios
- Read about best practices for robust anomaly detection