Weighted Conformal P-values¶

Handle covariate shift between calibration and test data when the weighted conformal assumptions are appropriate.

Executive Summary

When to use: Your test data comes from a different feature distribution than your calibration data, while the anomaly mechanism is still stable.

How it works: Weighted conformal prediction estimates how much the feature distributions differ and reweights the calibration evidence accordingly.

Quick start:

from nonconform import ConformalDetector, Split, logistic_weight_estimator

detector = ConformalDetector(
    detector=your_detector,
    strategy=Split(n_calib=0.3),
    weight_estimator=logistic_weight_estimator(),  # Add this
    seed=42
)

Key assumption: Only the feature distribution P(X) changes. The relationship between features and anomaly status, P(Y | X), must stay stable. You also need independent samples, enough feature-support overlap, and good enough weights. If distributions are too far apart, weighting can become unstable and guarantees can degrade.

Overview¶

Weighted conformal p-values extend classical conformal prediction to covariate shift scenarios [Jin & Candès, 2023; Tibshirani et al., 2019]. The marginal feature distribution P(X) may change between calibration and test data, but the conditional relationship P(Y | X) must stay stable. You also need independent calibration/test samples and enough support overlap between feature distributions; when shift is too extreme, estimated density ratios become unstable and weighted conformal adjustment may fail.

The weighted p-values provide per-hypothesis calibration under the paper's assumptions. For multiple simultaneous anomaly decisions, pair them with Weighted Conformalized Selection (WCS); this is the documented finite-sample FDR path under covariate shift [Jin & Candès, 2023]. If the weights are learned rather than known, exactness depends on weight quality and the paper's estimated-weight bounds should be read as potential FDR inflation rather than an automatic exact guarantee.

ConformalDetector(weight_estimator=...) estimates importance weights by distinguishing calibration samples from the current test batch, then uses those weights to compute weighted p-values and WCS selections.

Basic Usage¶

import numpy as np
from nonconform import ConformalDetector, Split, logistic_weight_estimator

from pyod.models.lof import LOF

# Initialize base detector
base_detector = LOF()
strategy = Split(n_calib=0.2)

# Create weighted conformal detector
detector = ConformalDetector(
    detector=base_detector,
    strategy=strategy,
    aggregation="median",
    weight_estimator=logistic_weight_estimator(),
    seed=42,
)

# Fit on training data and get weighted p-values
# By default, prediction refits the weight model for each batch
p_values = detector.fit(X_train).compute_p_values(X_test)

How It Works¶

The weighted conformal method works through the following steps:

1. Calibration¶

During fitting, the detector: - Uses the specified strategy to split data and train models - Computes calibration scores on held-out calibration data - Stores calibration samples for later weight computation

2. Weight Estimation¶

During prediction, the detector: - Fits the configured likelihood-ratio estimator (typically a probabilistic binary domain classifier) to distinguish calibration from test samples - Uses predicted probabilities/scores from that estimator to compute importance weights - Applies weights to both calibration and test instances

For explicit control of this state transition, you can precompute weights once and reuse them:

detector.fit(X_train)
detector.prepare_weights_for(X_test_shifted)
p_values = detector.compute_p_values(X_test_shifted, refit_weights=False)

By default, this reuse path verifies exact batch content identity. If you need maximum throughput on very large batches and can guarantee your own batch identity discipline, set verify_prepared_batch_content=False when constructing ConformalDetector to validate only batch size.

3. Weighted P-value Calculation¶

The p-values are computed using weighted empirical distribution functions. By default, nonconform uses the classical (non-randomized) formula. The randomized variant [Jin & Candès, 2023] handles ties more gracefully:

# Randomized weighted p-value calculation (Jin & Candes 2023)
import numpy as np

def weighted_p_value(test_score, calibration_scores, calibration_weights, test_weight):
    """
    Calculate weighted conformal p-value with randomized tie handling.
    """
    # Count calibration scores strictly greater than test score
    weighted_greater = np.sum(calibration_weights[calibration_scores > test_score])

    # Handle ties: add random fraction of tied weights
    tied_weights = np.sum(calibration_weights[calibration_scores == test_score])
    u = np.random.uniform(0, 1)

    # Randomized formula: strictly greater + U * (tied + test weight)
    numerator = weighted_greater + u * (tied_weights + test_weight)
    denominator = np.sum(calibration_weights) + test_weight

    return numerator / denominator

Classical vs. Randomized

By default, Empirical() uses tie_break="classical" (non-randomized formula). Valid values are "classical" and "randomized" (or TieBreakMode.CLASSICAL / TieBreakMode.RANDOMIZED). None is not valid. For randomized smoothing as shown above, use Empirical(tie_break="randomized"). With small calibration sets, randomized smoothing is less conservative than the classical formula and adds run-to-run variability; set a seed when reproducibility matters.

When to Use Weighted Conformal¶

Covariate Shift Scenarios¶

Use weighted conformal detection when the shift is primarily in P(X) and not in P(Y|X), for example:

Domain Adaptation: Training on one domain, testing on another with stable anomaly mechanism
Sampling/Selection Shift: Deployment sampling differs from calibration sampling (population mix changes)
Subgroup Mixture Shift: Different subgroup prevalence between calibration and test data
Time-based Deployment Changes: Different time periods, only if the change is mostly covariate shift and P(Y|X) is still approximately stable

Do not treat generic temporal drift as automatically suitable for weighted conformal. If the anomaly mechanism itself changes (P(Y|X) shift), weighting alone is insufficient.

Examples Where Covariate Shift May Occur¶

# Example 1: Time-separated data
# Use this only if P(Y|X) is approximately stable across periods
detector.fit(X_train_2020)
p_values_2024 = detector.compute_p_values(X_test_2024)

# Example 2: Geographic shift
# Training on US data, testing on European data
detector.fit(X_us)
p_values_europe = detector.compute_p_values(X_europe)

# Example 3: Sensor/population shift
# Suitable when feature distribution changed but anomaly semantics stayed stable
detector.fit(X_before_drift)
p_values_after_drift = detector.compute_p_values(X_after_drift)

Comparison with Standard Conformal¶

# Standard conformal detector
standard_detector = ConformalDetector(
    detector=base_detector,
    strategy=strategy,
    aggregation="median",
    seed=42
)

# Weighted conformal detector
weighted_detector = ConformalDetector(
    detector=base_detector,
    strategy=strategy,
    aggregation="median",
    weight_estimator=logistic_weight_estimator(),
    seed=42,
)

# Fit both on training data
standard_detector.fit(X_train)
weighted_detector.fit(X_train)

# Compare on shifted test data using the current one-step API
from nonconform.enums import Pruning

standard_mask = standard_detector.select(X_test_shifted, alpha=0.05)
weighted_mask = weighted_detector.select(
    X_test_shifted,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
    seed=42,
)

print(f"Standard conformal detections: {standard_mask.sum()}")
print(f"Weighted conformal detections: {weighted_mask.sum()}")

Different Aggregation Strategies¶

The choice of aggregation method can affect performance under covariate shift:

# Compare different aggregation methods
from nonconform.enums import Pruning

aggregation_methods = [
    "mean",
    "median",
    "maximum",
]

for agg_method in aggregation_methods:
    detector = ConformalDetector(
        detector=base_detector,
        strategy=strategy,
        aggregation=agg_method,
        weight_estimator=logistic_weight_estimator(),
        seed=42,
    )
    detector.fit(X_train)
    wcs_mask = detector.select(
        X_test_shifted,
        alpha=0.05,
        pruning=Pruning.DETERMINISTIC,
        seed=42,
    )
    print(f"{agg_method}: {wcs_mask.sum()} discoveries")

Note: Aggregation is applied to the raw anomaly scores from each model before conformal p-values are computed. P-values are not averaged; the aggregated score is turned into a single p-value per point.

Weight Estimators¶

nonconform provides two weight estimator factory functions for handling covariate shift:

logistic_weight_estimator¶

Uses logistic regression to estimate likelihood ratios between calibration and test distributions:

from nonconform import logistic_weight_estimator

detector = ConformalDetector(
    detector=base_detector,
    strategy=strategy,
    aggregation="median",
    weight_estimator=logistic_weight_estimator(),
    seed=42,
)

When to use: - Linear or moderately complex covariate shifts - High-dimensional data where interpretability matters - Fast weight estimation is needed - Default choice for most applications

Parameters: - regularization: Regularization strength ('auto' or float C value) - clip_quantile: Quantile for weight clipping (default: 0.05). Set to None to disable clipping. - class_weight: Class weights for LogisticRegression (default: 'balanced') - max_iter: Maximum iterations (default: 1000)

forest_weight_estimator¶

Uses random forest classification to estimate likelihood ratios:

from nonconform import forest_weight_estimator

detector = ConformalDetector(
    detector=base_detector,
    strategy=strategy,
    aggregation="median",
    weight_estimator=forest_weight_estimator(n_estimators=100, max_depth=10),
    seed=42,
)

When to use: - Complex, non-linear covariate shifts - Feature interactions are important - More robust to outliers in feature space - When you have sufficient calibration data (hundreds+ samples)

Parameters: - n_estimators: Number of trees (default: 100) - max_depth: Maximum tree depth (default: 5) - min_samples_leaf: Minimum samples in leaf (default: 10) - clip_quantile: Quantile for weight clipping (default: 0.05). Set to None to disable clipping.

Comparison¶

# Compare weight estimators on complex shift
from nonconform import logistic_weight_estimator, forest_weight_estimator

estimators = {
    'Logistic': logistic_weight_estimator(),
    'Forest': forest_weight_estimator(n_estimators=100),
}

for name, weight_est in estimators.items():
    detector = ConformalDetector(
        detector=base_detector,
        strategy=Split(n_calib=0.2),
        aggregation="median",
        weight_estimator=weight_est,
        seed=42,
    )
    detector.fit(X_train)
    wcs_mask = detector.select(
        X_test_shifted,
        alpha=0.05,
        pruning=Pruning.DETERMINISTIC,
        seed=42,
    )
    print(f"{name}: {wcs_mask.sum()} discoveries")

General recommendations: - Start with logistic_weight_estimator() (faster, more interpretable) - Switch to forest_weight_estimator() if: - Distribution shift is highly non-linear - You have >500 calibration samples - Logistic weights show poor discrimination

BootstrapBaggedWeightEstimator¶

Wraps any base weight estimator with bootstrap bagging for improved stability in extreme imbalance scenarios. It is most relevant when the calibration set is much larger than the test batch, where standalone importance weights can become spiky and overly influential:

BootstrapBaggedWeightEstimator currently uses scoring_mode="frozen" (default and only supported mode). After fit(calibration_samples, test_samples), it can return stored weights only for that exact calibration/test batch pair; scoring arbitrary new batches requires refitting.

from nonconform import forest_weight_estimator
from nonconform.weighting import BootstrapBaggedWeightEstimator

# Bootstrap bagging with forest base (best for extreme imbalance)
weight_est = BootstrapBaggedWeightEstimator(
    base_estimator=forest_weight_estimator(n_estimators=50),
    n_bootstraps=50,
    clip_quantile=0.05,
)

detector = ConformalDetector(
    detector=base_detector,
    strategy=Split(n_calib=1000),
    aggregation="median",
    weight_estimator=weight_est,
    seed=42,
)

How It Works¶

Bootstrap bagging creates an ensemble of weight estimators:

For each bootstrap iteration (n_bootstraps times):
Resample both calibration and test sets to balanced size
Fit the base estimator on the bootstrap sample
Score all original instances in that batch
Store log-weights for each instance
After all iterations:
Aggregate using geometric mean (exp of mean log-weights)
Apply clipping to maintain bounded weights

Every instance receives exactly n_bootstraps weight estimates, ensuring symmetric scoring coverage regardless of calibration/test set size ratios.

When to Use¶

DO use BootstrapBaggedWeightEstimator when:

Extreme imbalance: Large calibration set (>1000) with small test batches (<50)
Common in online/streaming detection
Example: 1000 calibration samples, 25 test instances
High-stakes applications: Where weight quality is critical
Medical diagnosis with small patient batches
Fraud detection with limited transactions
Safety-critical systems
Severe covariate shift: When base estimators produce extreme weights

DO NOT use for:

Balanced or moderate imbalance: Marginal benefit (2-3% improvement) doesn't justify 2-5x computational overhead
Large test sets: Benefits diminish with larger batches
Latency-sensitive production: Significant computational cost (20-50x slower)

Performance Benchmarks¶

The numbers below are illustrative development benchmarks for the estimator configuration shown here. They are not statistical guarantees and should not be carried over to a new dataset without local validation.

Balanced Scenario (1000 calib vs 1000 test)¶

Metric	Base	Bagged-50	Improvement
Weight Std	2.884	2.957	-2.5% (worse)
Extreme Weights	0	0	No change
Time	0.14s	0.34s	2.4x slower

Verdict: Not recommended for balanced sets.

Extreme Imbalance (1000 calib vs 25 test)¶

Metric	Logistic Base	Logistic Bagged-50	Improvement
Weight Std	1.604	0.841	48% better
Extreme Weights	612	385	37% reduction
Recall	0.067	0.200	3x better
Time	0.14s	0.34s	2.4x slower

Metric	Forest Base	Forest Bagged-50	Improvement
Weight Std	0.153	0.259	Slightly higher but stable
Extreme Weights	599	0	100% elimination
Recall	0.333	1.000	Better in this benchmark
FDR	0.000	0.190	Acceptable trade-off
Time	0.24s	6.4s	27x slower

Verdict: Consider for extreme imbalance when the added runtime is acceptable. Validate on your own labeled data before adopting it as a default.

Configuration Parameters¶

n_bootstraps (default: 100): - Number of bootstrap iterations - Higher = more stable, but slower - Recommended: 20-50 for small test batches, 50-100 for critical applications

clip_quantile (default: 0.05): - Adaptive quantile-based clipping - Clips to (quantile, 1-quantile) percentiles - Use when weight distribution is unknown - Set to None to disable clipping

Advanced Example: Streaming Detection¶

For online/streaming anomaly detection with small batches:

from nonconform import ConformalDetector, Split, forest_weight_estimator
from nonconform.enums import Pruning
from nonconform.weighting import BootstrapBaggedWeightEstimator
from pyod.models.iforest import IForest

# Configuration for small batch streaming
weight_est = BootstrapBaggedWeightEstimator(
    base_estimator=forest_weight_estimator(n_estimators=50, max_depth=10),
    n_bootstraps=50,
    clip_quantile=0.05,  # Adaptive clipping
)

detector = ConformalDetector(
    detector=IForest(),
    strategy=Split(n_calib=1000),  # Large calibration set
    aggregation="median",
    weight_estimator=weight_est,
    seed=42,
)

# Train on historical data
detector.fit(X_historical)

# Process small incoming batches
for X_batch in stream_data(batch_size=25):
    discoveries = detector.select(
        X_batch,
        alpha=0.1,
        pruning=Pruning.DETERMINISTIC,
        seed=42,
    )

    print(f"Detected {discoveries.sum()} anomalies in batch of {len(X_batch)}")

Cost-Benefit Analysis¶

Configuration	Time	Quality	Use Case
Logistic (Base)	0.14s	Baseline	Standard balanced scenarios
Logistic + Bagging(50)	0.34s	+48% weight stability	Moderate imbalance, quality focus
Forest (Base)	0.24s	Good for non-linear	Standard scenarios
Forest + Bagging(50)	6.4s	Best in this benchmark	Extreme imbalance, quality focus

Recommendation: Use forest_weight_estimator + BootstrapBaggedWeightEstimator when: - Calibration set is 40x larger than test batch (e.g., 1000:25) - Missing anomalies is very costly - Computational budget allows 20-50x overhead - Online/streaming detection with small batches

Decision Guide¶

Which weight estimator should I use?

┌─ Is your test batch very small (<50) AND calibration large (>1000)?
│
├─ YES → BootstrapBaggedWeightEstimator(
│         forest_weight_estimator(50), n_bootstraps=50
│       )
│       Cost: High (6-7s), Quality: Best in the illustrative benchmark
│
└─ NO → Standard weight estimators
    │
    ├─ Linear/moderate shift → logistic_weight_estimator()
    │                          Cost: Low (0.14s), Quality: Good
    │
    └─ Complex/non-linear shift → forest_weight_estimator(50)
                                   Cost: Medium (0.24s), Quality: Better

Treat the cost and quality labels above as examples from one benchmark, not portable promises.

Strategy Selection¶

Different strategies can be used with weighted conformal detection:

from nonconform import CrossValidation, JackknifeBootstrap

# JaB+ strategy for stability
jab_strategy = JackknifeBootstrap(n_bootstraps=50)
jab_detector = ConformalDetector(
    detector=base_detector,
    strategy=jab_strategy,
    aggregation="median",
    weight_estimator=logistic_weight_estimator(),
    seed=42
)

# Cross-validation strategy for efficiency
cv_strategy = CrossValidation(k=5)
cv_detector = ConformalDetector(
    detector=base_detector,
    strategy=cv_strategy,
    aggregation="median",
    weight_estimator=logistic_weight_estimator(),
    seed=42
)

Weighted Conformal Selection¶

Weighted conformal p-values provide per-hypothesis calibration under covariate shift assumptions. To obtain finite-sample FDR control across many test points, combine them with Weighted Conformal Selection (WCS) under the independence, support-overlap, and weight-quality assumptions in Jin & Candès [Jin & Candès, 2023]:

from nonconform.enums import Pruning

wcs_mask = weighted_detector.select(
    X_test_shifted,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
    seed=42,
)

print(f"WCS-selected anomalies: {wcs_mask.sum()} of {len(wcs_mask)}")

After any call to compute_p_values(), score_samples(), or select(), the detector caches the relevant arrays (p_values, scores, weights) inside detector.last_result. Passing this object to weighted_false_discovery_control avoids plumbing the raw arrays manually.

For explicit array-first workflows, use:

from nonconform.enums import Pruning
from nonconform.fdr import (
    weighted_false_discovery_control_from_arrays,
)
from nonconform.scoring import calculate_weighted_p_val

# WCS from precomputed p-values + arrays
wcs_from_arrays = weighted_false_discovery_control_from_arrays(
    p_values=weighted_detector.last_result.p_values,
    test_scores=weighted_detector.last_result.test_scores,
    calib_scores=weighted_detector.last_result.calib_scores,
    test_weights=weighted_detector.last_result.test_weights,
    calib_weights=weighted_detector.last_result.calib_weights,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
    seed=42,
)

# WCS with explicit empirical p-value computation
p_values_empirical = calculate_weighted_p_val(
    scores=weighted_detector.last_result.test_scores,
    calibration_set=weighted_detector.last_result.calib_scores,
    test_weights=weighted_detector.last_result.test_weights,
    calib_weights=weighted_detector.last_result.calib_weights,
    tie_break="classical",
)
wcs_empirical = weighted_false_discovery_control_from_arrays(
    p_values=p_values_empirical,
    test_scores=weighted_detector.last_result.test_scores,
    calib_scores=weighted_detector.last_result.calib_scores,
    test_weights=weighted_detector.last_result.test_weights,
    calib_weights=weighted_detector.last_result.calib_weights,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
)

Pruning Modes¶

The pruning parameter controls the second-stage WCS pruning rule [Jin & Candès, 2023]:

Pruning.DETERMINISTIC¶

wcs_mask = weighted_detector.select(
    X_test_shifted,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
    seed=42,  # seed has no effect for deterministic mode
)

Behavior: Uses the deterministic WCS pruning rule without additional randomization.

When to use: - Reproducibility is critical - You don't want any randomness in selections - Reporting results that must be exactly reproducible

Trade-off: May be slightly conservative (reject fewer hypotheses) compared to randomized methods.

Pruning.HOMOGENEOUS¶

wcs_mask = weighted_detector.select(
    X_test_shifted,
    alpha=0.05,
    pruning=Pruning.HOMOGENEOUS,
    seed=42,  # controls randomization
)

Behavior: Draws a single uniform random variable and applies the same pruning randomization across test instances.

When to use: - Default randomized method - Want randomized WCS pruning under the method's assumptions - Acceptable to have some randomness

Trade-off: Less conservative than DETERMINISTIC, but results vary across random seeds.

Pruning.HETEROGENEOUS¶

wcs_mask = weighted_detector.select(
    X_test_shifted,
    alpha=0.05,
    pruning=Pruning.HETEROGENEOUS,
    seed=42,  # controls randomization
)

Behavior: Draws independent uniform random variables for each test instance. Provides the most flexible pruning randomization.

When to use: - You want the less conservative randomized pruning option - Research settings where run-to-run variance is acceptable

Trade-off: Highest variance across random seeds, and often less conservative than deterministic pruning.

Comparison of Pruning Methods¶

from nonconform.enums import Pruning

pruning_methods = [
    Pruning.DETERMINISTIC,
    Pruning.HOMOGENEOUS,
    Pruning.HETEROGENEOUS
]

for pruning_method in pruning_methods:
    wcs_mask = weighted_detector.select(
        X_test_shifted,
        alpha=0.05,
        pruning=pruning_method,
        seed=42,
    )

    print(f"{pruning_method.name}: {wcs_mask.sum()} detections")

Expected relationship: Deterministic pruning is contained in both randomized variants in the WCS theory, so it is usually the most conservative. The two randomized variants can differ by data set and seed.

Performance Considerations¶

Computational Cost¶

Weighted conformal detection has additional overhead: - Weight estimation via logistic regression - Weighted p-value computation

import time

# Compare computation times
def time_detector(detector, X_train, X_test):
    start_time = time.time()
    detector.fit(X_train)
    fit_time = time.time() - start_time

    start_time = time.time()
    p_values = detector.compute_p_values(X_test)
    predict_time = time.time() - start_time

    return fit_time, predict_time

# Standard vs Weighted timing
standard_fit, standard_pred = time_detector(standard_detector, X_train, X_test)
weighted_fit, weighted_pred = time_detector(weighted_detector, X_train, X_test)

print(f"Standard: Fit={standard_fit:.2f}s, Predict={standard_pred:.2f}s")
print(f"Weighted: Fit={weighted_fit:.2f}s, Predict={weighted_pred:.2f}s")
print(f"Overhead: {((weighted_fit + weighted_pred) / (standard_fit + standard_pred) - 1) * 100:.1f}%")

Memory Usage¶

Weighted conformal detection requires storing: - Calibration samples for weight computation - Calibration scores for p-value calculation

For large datasets, consider: - Using a subset of calibration samples for weight estimation - Implementing online/streaming versions

Best Practices¶

1. Validate Covariate Shift¶

Always check whether a feature-distribution shift is actually present, and then separately decide whether the anomaly mechanism is still stable:

# Use statistical tests to detect shift
from scipy.stats import ks_2samp

def detect_feature_shift(X_train, X_test):
    """Detect feature-distribution shift in individual features."""
    shift_detected = []
    p_values = []

    for i in range(X_train.shape[1]):
        statistic, p_value = ks_2samp(X_train[:, i], X_test[:, i])
        shift_detected.append(p_value < 0.05)
        p_values.append(p_value)

    print(f"Features with significant shift: {sum(shift_detected)}/{len(shift_detected)}")
    return shift_detected, p_values

shift_features, shift_p_values = detect_feature_shift(X_train, X_test_shifted)

2. Combine with Weighted Conformal Selection¶

from nonconform.enums import Pruning

wcs_mask = weighted_detector.select(
    X_test_shifted,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
    seed=42,
)

print(f"WCS-controlled discoveries: {wcs_mask.sum()}")

3. Monitor Weight Quality¶

Extreme weights can indicate poor weight estimation:

def check_weight_quality(detector, X_calib, X_test):
    """Check for extreme weights that might indicate poor estimation."""
    # This is a conceptual example - actual implementation would require
    # access to the internal weights computed by the detector

    # Rule of thumb: weights should typically be between 0.1 and 10
    # Extreme weights (< 0.01 or > 100) suggest problems
    pass

4. Use Appropriate Base Detectors¶

Some detectors work better with weighted conformal: - Good: Distance-based methods (LOF, KNN) that are sensitive to distribution - Moderate: Tree-based methods (Isolation Forest) that are somewhat robust - Challenging: Neural networks that might already adapt to shift

Advanced Applications¶

Multi-domain Adaptation¶

# Handle multiple domains with different shift patterns
domains = ['domain_A', 'domain_B', 'domain_C']
domain_detectors = {}

for domain in domains:
    detector = ConformalDetector(
        detector=base_detector,
        strategy=strategy,
        aggregation="median",
        weight_estimator=logistic_weight_estimator(),
        seed=42
    )
    detector.fit(X_train)  # Common training set
    domain_detectors[domain] = detector

# Predict on domain-specific test sets with WCS
from nonconform.enums import Pruning

for domain in domains:
    X_test_domain = load_domain_data(domain)  # Load domain-specific test data
    wcs_mask = domain_detectors[domain].select(
        X_test_domain,
        alpha=0.05,
        pruning=Pruning.DETERMINISTIC,
        seed=42,
    )
    print(f"{domain}: {wcs_mask.sum()} discoveries")

Illustrative Online Reweighting Pattern¶

The following is an engineering pattern for repeated batch processing. The WCS guarantee still applies only to batches that satisfy the covariate-shift, independence, support-overlap, and weight-quality assumptions.

from nonconform.enums import Pruning

# Refit weights on each incoming batch when assumptions are plausible
def online_weighted_detection(detector, data_stream, window_size=1000):
    """Online weighted conformal detection with sliding window."""
    detections = []

    for i, (X_batch, _) in enumerate(data_stream):
        if i == 0:
            # Initialize with first batch
            detector.fit(X_batch)
        else:
            # Use sliding window for calibration
            if i * len(X_batch) > window_size:
                start_idx = (i * len(X_batch)) - window_size
                X_calib = get_recent_data(start_idx, window_size)
                detector.fit(X_calib)

            # Predict on current batch with WCS
            wcs_mask = detector.select(
                X_batch,
                alpha=0.05,
                pruning=Pruning.DETERMINISTIC,
                seed=42,
            )
            detections.append(wcs_mask.sum())

    return detections

Troubleshooting¶

Common Issues¶

Poor Weight Estimation
Insufficient calibration data
High-dimensional data with small samples
Solution: Increase calibration size or use dimensionality reduction
Extreme P-values
All p-values near 0 or 1
Solution: Check for severe covariate shift, poor support overlap, or model mismatch
Inconsistent Results
High variance in detection counts
Solution: Use bootstrap strategy or increase sample size

Debugging Tools¶

def debug_weighted_conformal(detector, X_train, X_test):
    """Debug weighted conformal detection issues."""
    print("=== Weighted Conformal Debug Report ===")

    # Check data properties
    print(f"Training samples: {len(X_train)}")
    print(f"Test samples: {len(X_test)}")
    print(f"Feature dimensions: {X_train.shape[1]}")

    # Fit detector
    detector.fit(X_train)

    # Check calibration set size
    print(f"Calibration samples: {len(detector.calibration_set)}")

    if len(detector.calibration_set) < 50:
        print("WARNING: Small calibration set may lead to unreliable weights")

    # Get predictions
    p_values = detector.compute_p_values(X_test)

    # Check p-value distribution
    print(f"P-value range: [{p_values.min():.4f}, {p_values.max():.4f}]")
    print(f"P-value mean: {p_values.mean():.4f}")
    print(f"P-value std: {p_values.std():.4f}")

    if p_values.std() < 0.01:
        print("WARNING: Very low p-value variance - check for issues")

    print("=== End Debug Report ===")

# Example usage
debug_weighted_conformal(weighted_detector, X_train, X_test_shifted)

References¶

Jin, Y., & Candès, E. J. (2023). Model-free Selective Inference Under Covariate Shift via Weighted Conformal p-values. Biometrika, 110(4), 1090-1106. [Foundational paper on weighted conformal inference and WCS procedure]
Tibshirani, R. J., Barber, R. F., Candes, E., & Ramdas, A. (2019). Conformal Prediction Under Covariate Shift. Advances in Neural Information Processing Systems, 32. [Early work on conformal prediction with covariate shift]
Genovese, C. R., Roeder, K., & Wasserman, L. (2006). False Discovery Control with p-value Weighting. Biometrika, 93(3), 509-524. [Theoretical foundation for weighted FDR control]

Next Steps¶

Learn about FDR control for multiple testing scenarios
Explore different conformalization strategies for various use cases
Read about best practices for robust anomaly detection
Check the troubleshooting guide for common issues
See input validation for parameter constraints and edge cases