Weighted Conformal P-values¶
Handle covariate shift between calibration and test data when the weighted conformal assumptions are appropriate.
Executive Summary
When to use: Your test data comes from a different feature distribution than your calibration data, while the anomaly mechanism is still stable.
How it works: Weighted conformal prediction estimates how much the feature distributions differ and reweights the calibration evidence accordingly.
Quick start:
from nonconform import ConformalDetector, Split, logistic_weight_estimator
detector = ConformalDetector(
detector=your_detector,
strategy=Split(n_calib=0.3),
weight_estimator=logistic_weight_estimator(), # Add this
seed=42
)
Key assumption: Only the feature distribution P(X) changes. The
relationship between features and anomaly status, P(Y | X), must stay
stable. You also need independent samples, enough feature-support overlap,
and good enough weights. If distributions are too far apart, weighting can
become unstable and guarantees can degrade.
Overview¶
Weighted conformal p-values extend classical conformal prediction to covariate
shift scenarios [Jin & Candès, 2023; Tibshirani et al.,
2019]. The marginal feature distribution P(X) may change between
calibration and test data, but the conditional relationship P(Y | X) must stay
stable. You also need independent calibration/test samples and enough support
overlap between feature distributions; when shift is too extreme, estimated
density ratios become unstable and weighted conformal adjustment may fail.
The weighted p-values provide per-hypothesis calibration under the paper's assumptions. For multiple simultaneous anomaly decisions, pair them with Weighted Conformalized Selection (WCS); this is the documented finite-sample FDR path under covariate shift [Jin & Candès, 2023]. If the weights are learned rather than known, exactness depends on weight quality and the paper's estimated-weight bounds should be read as potential FDR inflation rather than an automatic exact guarantee.
ConformalDetector(weight_estimator=...) estimates importance weights by
distinguishing calibration samples from the current test batch, then uses those
weights to compute weighted p-values and WCS selections.
Basic Usage¶
import numpy as np
from nonconform import ConformalDetector, Split, logistic_weight_estimator
from pyod.models.lof import LOF
# Initialize base detector
base_detector = LOF()
strategy = Split(n_calib=0.2)
# Create weighted conformal detector
detector = ConformalDetector(
detector=base_detector,
strategy=strategy,
aggregation="median",
weight_estimator=logistic_weight_estimator(),
seed=42,
)
# Fit on training data and get weighted p-values
# By default, prediction refits the weight model for each batch
p_values = detector.fit(X_train).compute_p_values(X_test)
How It Works¶
The weighted conformal method works through the following steps:
1. Calibration¶
During fitting, the detector: - Uses the specified strategy to split data and train models - Computes calibration scores on held-out calibration data - Stores calibration samples for later weight computation
2. Weight Estimation¶
During prediction, the detector: - Fits the configured likelihood-ratio estimator (typically a probabilistic binary domain classifier) to distinguish calibration from test samples - Uses predicted probabilities/scores from that estimator to compute importance weights - Applies weights to both calibration and test instances
For explicit control of this state transition, you can precompute weights once and reuse them:
detector.fit(X_train)
detector.prepare_weights_for(X_test_shifted)
p_values = detector.compute_p_values(X_test_shifted, refit_weights=False)
By default, this reuse path verifies exact batch content identity. If you need
maximum throughput on very large batches and can guarantee your own batch identity
discipline, set verify_prepared_batch_content=False when constructing
ConformalDetector to validate only batch size.
3. Weighted P-value Calculation¶
The p-values are computed using weighted empirical distribution functions. By default, nonconform uses the classical (non-randomized) formula. The randomized variant [Jin & Candès, 2023] handles ties more gracefully:
# Randomized weighted p-value calculation (Jin & Candes 2023)
import numpy as np
def weighted_p_value(test_score, calibration_scores, calibration_weights, test_weight):
"""
Calculate weighted conformal p-value with randomized tie handling.
"""
# Count calibration scores strictly greater than test score
weighted_greater = np.sum(calibration_weights[calibration_scores > test_score])
# Handle ties: add random fraction of tied weights
tied_weights = np.sum(calibration_weights[calibration_scores == test_score])
u = np.random.uniform(0, 1)
# Randomized formula: strictly greater + U * (tied + test weight)
numerator = weighted_greater + u * (tied_weights + test_weight)
denominator = np.sum(calibration_weights) + test_weight
return numerator / denominator
Classical vs. Randomized
By default, Empirical() uses tie_break="classical" (non-randomized
formula). Valid values are "classical" and "randomized" (or
TieBreakMode.CLASSICAL / TieBreakMode.RANDOMIZED). None is not valid.
For randomized smoothing as shown above, use
Empirical(tie_break="randomized"). With small calibration sets,
randomized smoothing is less conservative than the classical formula and
adds run-to-run variability; set a seed when reproducibility matters.
When to Use Weighted Conformal¶
Covariate Shift Scenarios¶
Use weighted conformal detection when the shift is primarily in P(X) and not in P(Y|X), for example:
- Domain Adaptation: Training on one domain, testing on another with stable anomaly mechanism
- Sampling/Selection Shift: Deployment sampling differs from calibration sampling (population mix changes)
- Subgroup Mixture Shift: Different subgroup prevalence between calibration and test data
- Time-based Deployment Changes: Different time periods, only if the change is mostly covariate shift and
P(Y|X)is still approximately stable
Do not treat generic temporal drift as automatically suitable for weighted conformal. If the anomaly mechanism itself changes (P(Y|X) shift), weighting alone is insufficient.
Examples Where Covariate Shift May Occur¶
# Example 1: Time-separated data
# Use this only if P(Y|X) is approximately stable across periods
detector.fit(X_train_2020)
p_values_2024 = detector.compute_p_values(X_test_2024)
# Example 2: Geographic shift
# Training on US data, testing on European data
detector.fit(X_us)
p_values_europe = detector.compute_p_values(X_europe)
# Example 3: Sensor/population shift
# Suitable when feature distribution changed but anomaly semantics stayed stable
detector.fit(X_before_drift)
p_values_after_drift = detector.compute_p_values(X_after_drift)
Comparison with Standard Conformal¶
# Standard conformal detector
standard_detector = ConformalDetector(
detector=base_detector,
strategy=strategy,
aggregation="median",
seed=42
)
# Weighted conformal detector
weighted_detector = ConformalDetector(
detector=base_detector,
strategy=strategy,
aggregation="median",
weight_estimator=logistic_weight_estimator(),
seed=42,
)
# Fit both on training data
standard_detector.fit(X_train)
weighted_detector.fit(X_train)
# Compare on shifted test data using the current one-step API
from nonconform.enums import Pruning
standard_mask = standard_detector.select(X_test_shifted, alpha=0.05)
weighted_mask = weighted_detector.select(
X_test_shifted,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
seed=42,
)
print(f"Standard conformal detections: {standard_mask.sum()}")
print(f"Weighted conformal detections: {weighted_mask.sum()}")
Different Aggregation Strategies¶
The choice of aggregation method can affect performance under covariate shift:
# Compare different aggregation methods
from nonconform.enums import Pruning
aggregation_methods = [
"mean",
"median",
"maximum",
]
for agg_method in aggregation_methods:
detector = ConformalDetector(
detector=base_detector,
strategy=strategy,
aggregation=agg_method,
weight_estimator=logistic_weight_estimator(),
seed=42,
)
detector.fit(X_train)
wcs_mask = detector.select(
X_test_shifted,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
seed=42,
)
print(f"{agg_method}: {wcs_mask.sum()} discoveries")
Note: Aggregation is applied to the raw anomaly scores from each model before conformal p-values are computed. P-values are not averaged; the aggregated score is turned into a single p-value per point.
Weight Estimators¶
nonconform provides two weight estimator factory functions for handling covariate shift:
logistic_weight_estimator¶
Uses logistic regression to estimate likelihood ratios between calibration and test distributions:
from nonconform import logistic_weight_estimator
detector = ConformalDetector(
detector=base_detector,
strategy=strategy,
aggregation="median",
weight_estimator=logistic_weight_estimator(),
seed=42,
)
When to use: - Linear or moderately complex covariate shifts - High-dimensional data where interpretability matters - Fast weight estimation is needed - Default choice for most applications
Parameters:
- regularization: Regularization strength ('auto' or float C value)
- clip_quantile: Quantile for weight clipping (default: 0.05). Set to None to disable clipping.
- class_weight: Class weights for LogisticRegression (default: 'balanced')
- max_iter: Maximum iterations (default: 1000)
forest_weight_estimator¶
Uses random forest classification to estimate likelihood ratios:
from nonconform import forest_weight_estimator
detector = ConformalDetector(
detector=base_detector,
strategy=strategy,
aggregation="median",
weight_estimator=forest_weight_estimator(n_estimators=100, max_depth=10),
seed=42,
)
When to use: - Complex, non-linear covariate shifts - Feature interactions are important - More robust to outliers in feature space - When you have sufficient calibration data (hundreds+ samples)
Parameters:
- n_estimators: Number of trees (default: 100)
- max_depth: Maximum tree depth (default: 5)
- min_samples_leaf: Minimum samples in leaf (default: 10)
- clip_quantile: Quantile for weight clipping (default: 0.05). Set to None to disable clipping.
Comparison¶
# Compare weight estimators on complex shift
from nonconform import logistic_weight_estimator, forest_weight_estimator
estimators = {
'Logistic': logistic_weight_estimator(),
'Forest': forest_weight_estimator(n_estimators=100),
}
for name, weight_est in estimators.items():
detector = ConformalDetector(
detector=base_detector,
strategy=Split(n_calib=0.2),
aggregation="median",
weight_estimator=weight_est,
seed=42,
)
detector.fit(X_train)
wcs_mask = detector.select(
X_test_shifted,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
seed=42,
)
print(f"{name}: {wcs_mask.sum()} discoveries")
General recommendations:
- Start with logistic_weight_estimator() (faster, more interpretable)
- Switch to forest_weight_estimator() if:
- Distribution shift is highly non-linear
- You have >500 calibration samples
- Logistic weights show poor discrimination
BootstrapBaggedWeightEstimator¶
Wraps any base weight estimator with bootstrap bagging for improved stability in extreme imbalance scenarios. It is most relevant when the calibration set is much larger than the test batch, where standalone importance weights can become spiky and overly influential:
BootstrapBaggedWeightEstimator currently uses scoring_mode="frozen" (default and only supported mode). After fit(calibration_samples, test_samples), it can return stored weights only for that exact calibration/test batch pair; scoring arbitrary new batches requires refitting.
from nonconform import forest_weight_estimator
from nonconform.weighting import BootstrapBaggedWeightEstimator
# Bootstrap bagging with forest base (best for extreme imbalance)
weight_est = BootstrapBaggedWeightEstimator(
base_estimator=forest_weight_estimator(n_estimators=50),
n_bootstraps=50,
clip_quantile=0.05,
)
detector = ConformalDetector(
detector=base_detector,
strategy=Split(n_calib=1000),
aggregation="median",
weight_estimator=weight_est,
seed=42,
)
How It Works¶
Bootstrap bagging creates an ensemble of weight estimators:
- For each bootstrap iteration (n_bootstraps times):
- Resample both calibration and test sets to balanced size
- Fit the base estimator on the bootstrap sample
- Score all original instances in that batch
-
Store log-weights for each instance
-
After all iterations:
- Aggregate using geometric mean (exp of mean log-weights)
- Apply clipping to maintain bounded weights
Every instance receives exactly n_bootstraps weight estimates, ensuring
symmetric scoring coverage regardless of calibration/test set size ratios.
When to Use¶
DO use BootstrapBaggedWeightEstimator when:
- Extreme imbalance: Large calibration set (>1000) with small test batches (<50)
- Common in online/streaming detection
-
Example: 1000 calibration samples, 25 test instances
-
High-stakes applications: Where weight quality is critical
- Medical diagnosis with small patient batches
- Fraud detection with limited transactions
-
Safety-critical systems
-
Severe covariate shift: When base estimators produce extreme weights
DO NOT use for:
- Balanced or moderate imbalance: Marginal benefit (2-3% improvement) doesn't justify 2-5x computational overhead
- Large test sets: Benefits diminish with larger batches
- Latency-sensitive production: Significant computational cost (20-50x slower)
Performance Benchmarks¶
The numbers below are illustrative development benchmarks for the estimator configuration shown here. They are not statistical guarantees and should not be carried over to a new dataset without local validation.
Balanced Scenario (1000 calib vs 1000 test)¶
| Metric | Base | Bagged-50 | Improvement |
|---|---|---|---|
| Weight Std | 2.884 | 2.957 | -2.5% (worse) |
| Extreme Weights | 0 | 0 | No change |
| Time | 0.14s | 0.34s | 2.4x slower |
Verdict: Not recommended for balanced sets.
Extreme Imbalance (1000 calib vs 25 test)¶
| Metric | Logistic Base | Logistic Bagged-50 | Improvement |
|---|---|---|---|
| Weight Std | 1.604 | 0.841 | 48% better |
| Extreme Weights | 612 | 385 | 37% reduction |
| Recall | 0.067 | 0.200 | 3x better |
| Time | 0.14s | 0.34s | 2.4x slower |
| Metric | Forest Base | Forest Bagged-50 | Improvement |
|---|---|---|---|
| Weight Std | 0.153 | 0.259 | Slightly higher but stable |
| Extreme Weights | 599 | 0 | 100% elimination |
| Recall | 0.333 | 1.000 | Better in this benchmark |
| FDR | 0.000 | 0.190 | Acceptable trade-off |
| Time | 0.24s | 6.4s | 27x slower |
Verdict: Consider for extreme imbalance when the added runtime is acceptable. Validate on your own labeled data before adopting it as a default.
Configuration Parameters¶
n_bootstraps (default: 100): - Number of bootstrap iterations - Higher = more stable, but slower - Recommended: 20-50 for small test batches, 50-100 for critical applications
clip_quantile (default: 0.05): - Adaptive quantile-based clipping - Clips to (quantile, 1-quantile) percentiles - Use when weight distribution is unknown - Set to None to disable clipping
Advanced Example: Streaming Detection¶
For online/streaming anomaly detection with small batches:
from nonconform import ConformalDetector, Split, forest_weight_estimator
from nonconform.enums import Pruning
from nonconform.weighting import BootstrapBaggedWeightEstimator
from pyod.models.iforest import IForest
# Configuration for small batch streaming
weight_est = BootstrapBaggedWeightEstimator(
base_estimator=forest_weight_estimator(n_estimators=50, max_depth=10),
n_bootstraps=50,
clip_quantile=0.05, # Adaptive clipping
)
detector = ConformalDetector(
detector=IForest(),
strategy=Split(n_calib=1000), # Large calibration set
aggregation="median",
weight_estimator=weight_est,
seed=42,
)
# Train on historical data
detector.fit(X_historical)
# Process small incoming batches
for X_batch in stream_data(batch_size=25):
discoveries = detector.select(
X_batch,
alpha=0.1,
pruning=Pruning.DETERMINISTIC,
seed=42,
)
print(f"Detected {discoveries.sum()} anomalies in batch of {len(X_batch)}")
Cost-Benefit Analysis¶
| Configuration | Time | Quality | Use Case |
|---|---|---|---|
| Logistic (Base) | 0.14s | Baseline | Standard balanced scenarios |
| Logistic + Bagging(50) | 0.34s | +48% weight stability | Moderate imbalance, quality focus |
| Forest (Base) | 0.24s | Good for non-linear | Standard scenarios |
| Forest + Bagging(50) | 6.4s | Best in this benchmark | Extreme imbalance, quality focus |
Recommendation: Use forest_weight_estimator + BootstrapBaggedWeightEstimator when:
- Calibration set is 40x larger than test batch (e.g., 1000:25)
- Missing anomalies is very costly
- Computational budget allows 20-50x overhead
- Online/streaming detection with small batches
Decision Guide¶
Which weight estimator should I use?
┌─ Is your test batch very small (<50) AND calibration large (>1000)?
│
├─ YES → BootstrapBaggedWeightEstimator(
│ forest_weight_estimator(50), n_bootstraps=50
│ )
│ Cost: High (6-7s), Quality: Best in the illustrative benchmark
│
└─ NO → Standard weight estimators
│
├─ Linear/moderate shift → logistic_weight_estimator()
│ Cost: Low (0.14s), Quality: Good
│
└─ Complex/non-linear shift → forest_weight_estimator(50)
Cost: Medium (0.24s), Quality: Better
Treat the cost and quality labels above as examples from one benchmark, not portable promises.
Strategy Selection¶
Different strategies can be used with weighted conformal detection:
from nonconform import CrossValidation, JackknifeBootstrap
# JaB+ strategy for stability
jab_strategy = JackknifeBootstrap(n_bootstraps=50)
jab_detector = ConformalDetector(
detector=base_detector,
strategy=jab_strategy,
aggregation="median",
weight_estimator=logistic_weight_estimator(),
seed=42
)
# Cross-validation strategy for efficiency
cv_strategy = CrossValidation(k=5)
cv_detector = ConformalDetector(
detector=base_detector,
strategy=cv_strategy,
aggregation="median",
weight_estimator=logistic_weight_estimator(),
seed=42
)
Weighted Conformal Selection¶
Weighted conformal p-values provide per-hypothesis calibration under covariate shift assumptions. To obtain finite-sample FDR control across many test points, combine them with Weighted Conformal Selection (WCS) under the independence, support-overlap, and weight-quality assumptions in Jin & Candès [Jin & Candès, 2023]:
from nonconform.enums import Pruning
wcs_mask = weighted_detector.select(
X_test_shifted,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
seed=42,
)
print(f"WCS-selected anomalies: {wcs_mask.sum()} of {len(wcs_mask)}")
After any call to compute_p_values(), score_samples(), or select(), the detector caches
the relevant arrays (p_values, scores, weights) inside detector.last_result.
Passing this object to weighted_false_discovery_control avoids plumbing the raw
arrays manually.
For explicit array-first workflows, use:
from nonconform.enums import Pruning
from nonconform.fdr import (
weighted_false_discovery_control_from_arrays,
)
from nonconform.scoring import calculate_weighted_p_val
# WCS from precomputed p-values + arrays
wcs_from_arrays = weighted_false_discovery_control_from_arrays(
p_values=weighted_detector.last_result.p_values,
test_scores=weighted_detector.last_result.test_scores,
calib_scores=weighted_detector.last_result.calib_scores,
test_weights=weighted_detector.last_result.test_weights,
calib_weights=weighted_detector.last_result.calib_weights,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
seed=42,
)
# WCS with explicit empirical p-value computation
p_values_empirical = calculate_weighted_p_val(
scores=weighted_detector.last_result.test_scores,
calibration_set=weighted_detector.last_result.calib_scores,
test_weights=weighted_detector.last_result.test_weights,
calib_weights=weighted_detector.last_result.calib_weights,
tie_break="classical",
)
wcs_empirical = weighted_false_discovery_control_from_arrays(
p_values=p_values_empirical,
test_scores=weighted_detector.last_result.test_scores,
calib_scores=weighted_detector.last_result.calib_scores,
test_weights=weighted_detector.last_result.test_weights,
calib_weights=weighted_detector.last_result.calib_weights,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
)
Pruning Modes¶
The pruning parameter controls the second-stage WCS pruning rule [Jin &
Candès, 2023]:
Pruning.DETERMINISTIC¶
wcs_mask = weighted_detector.select(
X_test_shifted,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
seed=42, # seed has no effect for deterministic mode
)
Behavior: Uses the deterministic WCS pruning rule without additional randomization.
When to use: - Reproducibility is critical - You don't want any randomness in selections - Reporting results that must be exactly reproducible
Trade-off: May be slightly conservative (reject fewer hypotheses) compared to randomized methods.
Pruning.HOMOGENEOUS¶
wcs_mask = weighted_detector.select(
X_test_shifted,
alpha=0.05,
pruning=Pruning.HOMOGENEOUS,
seed=42, # controls randomization
)
Behavior: Draws a single uniform random variable and applies the same pruning randomization across test instances.
When to use: - Default randomized method - Want randomized WCS pruning under the method's assumptions - Acceptable to have some randomness
Trade-off: Less conservative than DETERMINISTIC, but results vary across random seeds.
Pruning.HETEROGENEOUS¶
wcs_mask = weighted_detector.select(
X_test_shifted,
alpha=0.05,
pruning=Pruning.HETEROGENEOUS,
seed=42, # controls randomization
)
Behavior: Draws independent uniform random variables for each test instance. Provides the most flexible pruning randomization.
When to use: - You want the less conservative randomized pruning option - Research settings where run-to-run variance is acceptable
Trade-off: Highest variance across random seeds, and often less conservative than deterministic pruning.
Comparison of Pruning Methods¶
from nonconform.enums import Pruning
pruning_methods = [
Pruning.DETERMINISTIC,
Pruning.HOMOGENEOUS,
Pruning.HETEROGENEOUS
]
for pruning_method in pruning_methods:
wcs_mask = weighted_detector.select(
X_test_shifted,
alpha=0.05,
pruning=pruning_method,
seed=42,
)
print(f"{pruning_method.name}: {wcs_mask.sum()} detections")
Expected relationship: Deterministic pruning is contained in both randomized variants in the WCS theory, so it is usually the most conservative. The two randomized variants can differ by data set and seed.
Performance Considerations¶
Computational Cost¶
Weighted conformal detection has additional overhead: - Weight estimation via logistic regression - Weighted p-value computation
import time
# Compare computation times
def time_detector(detector, X_train, X_test):
start_time = time.time()
detector.fit(X_train)
fit_time = time.time() - start_time
start_time = time.time()
p_values = detector.compute_p_values(X_test)
predict_time = time.time() - start_time
return fit_time, predict_time
# Standard vs Weighted timing
standard_fit, standard_pred = time_detector(standard_detector, X_train, X_test)
weighted_fit, weighted_pred = time_detector(weighted_detector, X_train, X_test)
print(f"Standard: Fit={standard_fit:.2f}s, Predict={standard_pred:.2f}s")
print(f"Weighted: Fit={weighted_fit:.2f}s, Predict={weighted_pred:.2f}s")
print(f"Overhead: {((weighted_fit + weighted_pred) / (standard_fit + standard_pred) - 1) * 100:.1f}%")
Memory Usage¶
Weighted conformal detection requires storing: - Calibration samples for weight computation - Calibration scores for p-value calculation
For large datasets, consider: - Using a subset of calibration samples for weight estimation - Implementing online/streaming versions
Best Practices¶
1. Validate Covariate Shift¶
Always check whether a feature-distribution shift is actually present, and then separately decide whether the anomaly mechanism is still stable:
# Use statistical tests to detect shift
from scipy.stats import ks_2samp
def detect_feature_shift(X_train, X_test):
"""Detect feature-distribution shift in individual features."""
shift_detected = []
p_values = []
for i in range(X_train.shape[1]):
statistic, p_value = ks_2samp(X_train[:, i], X_test[:, i])
shift_detected.append(p_value < 0.05)
p_values.append(p_value)
print(f"Features with significant shift: {sum(shift_detected)}/{len(shift_detected)}")
return shift_detected, p_values
shift_features, shift_p_values = detect_feature_shift(X_train, X_test_shifted)
2. Combine with Weighted Conformal Selection¶
from nonconform.enums import Pruning
wcs_mask = weighted_detector.select(
X_test_shifted,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
seed=42,
)
print(f"WCS-controlled discoveries: {wcs_mask.sum()}")
3. Monitor Weight Quality¶
Extreme weights can indicate poor weight estimation:
def check_weight_quality(detector, X_calib, X_test):
"""Check for extreme weights that might indicate poor estimation."""
# This is a conceptual example - actual implementation would require
# access to the internal weights computed by the detector
# Rule of thumb: weights should typically be between 0.1 and 10
# Extreme weights (< 0.01 or > 100) suggest problems
pass
4. Use Appropriate Base Detectors¶
Some detectors work better with weighted conformal: - Good: Distance-based methods (LOF, KNN) that are sensitive to distribution - Moderate: Tree-based methods (Isolation Forest) that are somewhat robust - Challenging: Neural networks that might already adapt to shift
Advanced Applications¶
Multi-domain Adaptation¶
# Handle multiple domains with different shift patterns
domains = ['domain_A', 'domain_B', 'domain_C']
domain_detectors = {}
for domain in domains:
detector = ConformalDetector(
detector=base_detector,
strategy=strategy,
aggregation="median",
weight_estimator=logistic_weight_estimator(),
seed=42
)
detector.fit(X_train) # Common training set
domain_detectors[domain] = detector
# Predict on domain-specific test sets with WCS
from nonconform.enums import Pruning
for domain in domains:
X_test_domain = load_domain_data(domain) # Load domain-specific test data
wcs_mask = domain_detectors[domain].select(
X_test_domain,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
seed=42,
)
print(f"{domain}: {wcs_mask.sum()} discoveries")
Illustrative Online Reweighting Pattern¶
The following is an engineering pattern for repeated batch processing. The WCS guarantee still applies only to batches that satisfy the covariate-shift, independence, support-overlap, and weight-quality assumptions.
from nonconform.enums import Pruning
# Refit weights on each incoming batch when assumptions are plausible
def online_weighted_detection(detector, data_stream, window_size=1000):
"""Online weighted conformal detection with sliding window."""
detections = []
for i, (X_batch, _) in enumerate(data_stream):
if i == 0:
# Initialize with first batch
detector.fit(X_batch)
else:
# Use sliding window for calibration
if i * len(X_batch) > window_size:
start_idx = (i * len(X_batch)) - window_size
X_calib = get_recent_data(start_idx, window_size)
detector.fit(X_calib)
# Predict on current batch with WCS
wcs_mask = detector.select(
X_batch,
alpha=0.05,
pruning=Pruning.DETERMINISTIC,
seed=42,
)
detections.append(wcs_mask.sum())
return detections
Troubleshooting¶
Common Issues¶
- Poor Weight Estimation
- Insufficient calibration data
- High-dimensional data with small samples
-
Solution: Increase calibration size or use dimensionality reduction
-
Extreme P-values
- All p-values near 0 or 1
-
Solution: Check for severe covariate shift, poor support overlap, or model mismatch
-
Inconsistent Results
- High variance in detection counts
- Solution: Use bootstrap strategy or increase sample size
Debugging Tools¶
def debug_weighted_conformal(detector, X_train, X_test):
"""Debug weighted conformal detection issues."""
print("=== Weighted Conformal Debug Report ===")
# Check data properties
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Feature dimensions: {X_train.shape[1]}")
# Fit detector
detector.fit(X_train)
# Check calibration set size
print(f"Calibration samples: {len(detector.calibration_set)}")
if len(detector.calibration_set) < 50:
print("WARNING: Small calibration set may lead to unreliable weights")
# Get predictions
p_values = detector.compute_p_values(X_test)
# Check p-value distribution
print(f"P-value range: [{p_values.min():.4f}, {p_values.max():.4f}]")
print(f"P-value mean: {p_values.mean():.4f}")
print(f"P-value std: {p_values.std():.4f}")
if p_values.std() < 0.01:
print("WARNING: Very low p-value variance - check for issues")
print("=== End Debug Report ===")
# Example usage
debug_weighted_conformal(weighted_detector, X_train, X_test_shifted)
References¶
-
Jin, Y., & Candès, E. J. (2023). Model-free Selective Inference Under Covariate Shift via Weighted Conformal p-values. Biometrika, 110(4), 1090-1106. [Foundational paper on weighted conformal inference and WCS procedure]
-
Tibshirani, R. J., Barber, R. F., Candes, E., & Ramdas, A. (2019). Conformal Prediction Under Covariate Shift. Advances in Neural Information Processing Systems, 32. [Early work on conformal prediction with covariate shift]
-
Genovese, C. R., Roeder, K., & Wasserman, L. (2006). False Discovery Control with p-value Weighting. Biometrika, 93(3), 509-524. [Theoretical foundation for weighted FDR control]
Next Steps¶
- Learn about FDR control for multiple testing scenarios
- Explore different conformalization strategies for various use cases
- Read about best practices for robust anomaly detection
- Check the troubleshooting guide for common issues
- See input validation for parameter constraints and edge cases