Skip to content

Weighted Conformal P-values

Handle distribution shift between training and test data while maintaining statistical guarantees.

Executive Summary

When to use: Your test data comes from a different distribution than your training data (e.g., different time period, different sensor, different domain).

How it works: Weighted conformal prediction estimates how much the distributions differ and reweights the calibration data accordingly.

Quick start:

from nonconform import ConformalDetector, Split, logistic_weight_estimator

detector = ConformalDetector(
    detector=your_detector,
    strategy=Split(n_calib=0.3),
    weight_estimator=logistic_weight_estimator(),  # Add this
    seed=42
)

Key assumption: Only the feature distribution P(X) changes—the relationship between features and anomaly status P(Y|X) must stay the same. You also need sufficient feature-support overlap between calibration and test data; if distributions are too far apart, weighting can become unstable and guarantees can degrade.

Overview

Weighted conformal p-values extend classical conformal prediction to handle covariate shift scenarios [Jin & Candès, 2023; Tibshirani et al., 2019]. Key assumption: the marginal distribution P(X) may change between calibration and test data, while the conditional distribution P(Y|X) – the relationship between features and anomaly status – remains constant. This assumption is crucial for the validity of weighted conformal inference. A second practical requirement is sufficient support overlap between calibration and test feature distributions; when shift is too extreme, estimated density ratios become unstable and weighted conformal adjustment may fail. When assumptions hold you can pair the p-values with Weighted Conformal Selection (WCS) to obtain rigorous False Discovery Rate control under distribution shift [Jin & Candès, 2023].

The ConformalDetector with a weight_estimator parameter automatically estimates importance weights to distinguish between calibration and test samples, then uses these weights to compute adjusted p-values.

Basic Usage

import numpy as np
from nonconform import ConformalDetector, Split, logistic_weight_estimator

from pyod.models.lof import LOF

# Initialize base detector
base_detector = LOF()
strategy = Split(n_calib=0.2)

# Create weighted conformal detector
detector = ConformalDetector(
    detector=base_detector,
    strategy=strategy,
    aggregation="median",
    weight_estimator=logistic_weight_estimator(),
    seed=42,
)

# Fit on training data and get weighted p-values
# By default, prediction refits the weight model for each batch
p_values = detector.fit(X_train).compute_p_values(X_test)

How It Works

The weighted conformal method works through the following steps:

1. Calibration

During fitting, the detector: - Uses the specified strategy to split data and train models - Computes calibration scores on held-out calibration data - Stores calibration samples for later weight computation

2. Weight Estimation

During prediction, the detector: - Fits the configured likelihood-ratio estimator (typically a probabilistic binary domain classifier) to distinguish calibration from test samples - Uses predicted probabilities/scores from that estimator to compute importance weights - Applies weights to both calibration and test instances

For explicit control of this state transition, you can precompute weights once and reuse them:

detector.fit(X_train)
detector.prepare_weights_for(X_test_shifted)
p_values = detector.compute_p_values(X_test_shifted, refit_weights=False)

By default, this reuse path verifies exact batch content identity. If you need maximum throughput on very large batches and can guarantee your own batch identity discipline, set verify_prepared_batch_content=False when constructing ConformalDetector to validate only batch size.

3. Weighted P-value Calculation

The p-values are computed using weighted empirical distribution functions. By default, nonconform uses the classical (non-randomized) formula. The randomized variant [Jin & Candès, 2023] handles ties more gracefully:

# Randomized weighted p-value calculation (Jin & Candes 2023)
import numpy as np

def weighted_p_value(test_score, calibration_scores, calibration_weights, test_weight):
    """
    Calculate weighted conformal p-value with randomized tie handling.
    """
    # Count calibration scores strictly greater than test score
    weighted_greater = np.sum(calibration_weights[calibration_scores > test_score])

    # Handle ties: add random fraction of tied weights
    tied_weights = np.sum(calibration_weights[calibration_scores == test_score])
    u = np.random.uniform(0, 1)

    # Randomized formula: strictly greater + U * (tied + test weight)
    numerator = weighted_greater + u * (tied_weights + test_weight)
    denominator = np.sum(calibration_weights) + test_weight

    return numerator / denominator

Classical vs. Randomized

By default, Empirical() uses tie_break="classical" (non-randomized formula). Valid values are "classical" and "randomized" (or TieBreakMode.CLASSICAL / TieBreakMode.RANDOMIZED). None is not valid. For randomized smoothing as shown above, use Empirical(tie_break="randomized"). Note that with small calibration sets, randomized smoothing can produce anti-conservative p-values.

When to Use Weighted Conformal

Covariate Shift Scenarios

Use weighted conformal detection when the shift is primarily in P(X) and not in P(Y|X), for example:

  1. Domain Adaptation: Training on one domain, testing on another with stable anomaly mechanism
  2. Sampling/Selection Shift: Deployment sampling differs from calibration sampling (population mix changes)
  3. Subgroup Mixture Shift: Different subgroup prevalence between calibration and test data
  4. Time-based Deployment Changes: Different time periods, only if the change is mostly covariate shift and P(Y|X) is still approximately stable

Do not treat generic temporal drift as automatically suitable for weighted conformal. If the anomaly mechanism itself changes (P(Y|X) shift), weighting alone is insufficient.

Examples Where Covariate Shift May Occur

# Example 1: Time-separated data
# Use this only if P(Y|X) is approximately stable across periods
detector.fit(X_train_2020)
p_values_2024 = detector.compute_p_values(X_test_2024)

# Example 2: Geographic shift
# Training on US data, testing on European data
detector.fit(X_us)
p_values_europe = detector.compute_p_values(X_europe)

# Example 3: Sensor/population shift
# Suitable when feature distribution changed but anomaly semantics stayed stable
detector.fit(X_before_drift)
p_values_after_drift = detector.compute_p_values(X_after_drift)

Comparison with Standard Conformal

# Standard conformal detector
standard_detector = ConformalDetector(
    detector=base_detector,
    strategy=strategy,
    aggregation="median",
    seed=42
)

# Weighted conformal detector
weighted_detector = ConformalDetector(
    detector=base_detector,
    strategy=strategy,
    aggregation="median",
    weight_estimator=logistic_weight_estimator(),
    seed=42,
)

# Fit both on training data
standard_detector.fit(X_train)
weighted_detector.fit(X_train)

# Compare on shifted test data using the current one-step API
from nonconform.enums import Pruning

standard_mask = standard_detector.select(X_test_shifted, alpha=0.05)
weighted_mask = weighted_detector.select(
    X_test_shifted,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
    seed=42,
)

print(f"Standard conformal detections: {standard_mask.sum()}")
print(f"Weighted conformal detections: {weighted_mask.sum()}")

Different Aggregation Strategies

The choice of aggregation method can affect performance under distribution shift:

# Compare different aggregation methods
from nonconform.enums import Pruning
from nonconform.fdr import weighted_false_discovery_control

aggregation_methods = [
    "mean",
    "median",
    "maximum",
]

for agg_method in aggregation_methods:
    detector = ConformalDetector(
        detector=base_detector,
        strategy=strategy,
        aggregation=agg_method,
        weight_estimator=logistic_weight_estimator(),
        seed=42,
    )
    detector.fit(X_train)
    _ = detector.compute_p_values(X_test_shifted)

    wcs_mask = weighted_false_discovery_control(
        result=detector.last_result,
        alpha=0.05,
        pruning=Pruning.DETERMINISTIC,
        seed=42,
    )
    print(f"{agg_method}: {wcs_mask.sum()} discoveries")

Note: Aggregation is applied to the raw anomaly scores from each model before conformal p-values are computed. P-values are not averaged; the aggregated score is turned into a single p-value per point.

Weight Estimators

nonconform provides two weight estimator factory functions for handling covariate shift:

logistic_weight_estimator

Uses logistic regression to estimate likelihood ratios between calibration and test distributions:

from nonconform import logistic_weight_estimator

detector = ConformalDetector(
    detector=base_detector,
    strategy=strategy,
    aggregation="median",
    weight_estimator=logistic_weight_estimator(),
    seed=42,
)

When to use: - Linear or moderately complex distribution shifts - High-dimensional data where interpretability matters - Fast weight estimation is needed - Default choice for most applications

Parameters: - regularization: Regularization strength ('auto' or float C value) - clip_quantile: Quantile for weight clipping (default: 0.05). Set to None to disable clipping. - class_weight: Class weights for LogisticRegression (default: 'balanced') - max_iter: Maximum iterations (default: 1000)

forest_weight_estimator

Uses random forest classification to estimate likelihood ratios:

from nonconform import forest_weight_estimator

detector = ConformalDetector(
    detector=base_detector,
    strategy=strategy,
    aggregation="median",
    weight_estimator=forest_weight_estimator(n_estimators=100, max_depth=10),
    seed=42,
)

When to use: - Complex, non-linear distribution shifts - Feature interactions are important - More robust to outliers in feature space - When you have sufficient calibration data (hundreds+ samples)

Parameters: - n_estimators: Number of trees (default: 100) - max_depth: Maximum tree depth (default: 5) - min_samples_leaf: Minimum samples in leaf (default: 10) - clip_quantile: Quantile for weight clipping (default: 0.05). Set to None to disable clipping.

Comparison

# Compare weight estimators on complex shift
from nonconform import logistic_weight_estimator, forest_weight_estimator

estimators = {
    'Logistic': logistic_weight_estimator(),
    'Forest': forest_weight_estimator(n_estimators=100),
}

for name, weight_est in estimators.items():
    detector = ConformalDetector(
        detector=base_detector,
        strategy=Split(n_calib=0.2),
        aggregation="median",
        weight_estimator=weight_est,
        seed=42,
    )
    detector.fit(X_train)
    _ = detector.compute_p_values(X_test_shifted)

    wcs_mask = weighted_false_discovery_control(
        result=detector.last_result,
        alpha=0.05,
        pruning=Pruning.DETERMINISTIC,
        seed=42,
    )
    print(f"{name}: {wcs_mask.sum()} discoveries")

General recommendations: - Start with logistic_weight_estimator() (faster, more interpretable) - Switch to forest_weight_estimator() if: - Distribution shift is highly non-linear - You have >500 calibration samples - Logistic weights show poor discrimination

BootstrapBaggedWeightEstimator

Wraps any base weight estimator with bootstrap bagging for improved stability in extreme imbalance scenarios. It is most relevant when the calibration set is much larger than the test batch, where standalone importance weights can become spiky and overly influential:

BootstrapBaggedWeightEstimator currently uses scoring_mode="frozen" (default and only supported mode). After fit(calibration_samples, test_samples), it can return stored weights only for that exact calibration/test batch pair; scoring arbitrary new batches requires refitting.

from nonconform import forest_weight_estimator
from nonconform.weighting import BootstrapBaggedWeightEstimator

# Bootstrap bagging with forest base (best for extreme imbalance)
weight_est = BootstrapBaggedWeightEstimator(
    base_estimator=forest_weight_estimator(n_estimators=50),
    n_bootstraps=50,
    clip_quantile=0.05,
)

detector = ConformalDetector(
    detector=base_detector,
    strategy=Split(n_calib=1000),
    aggregation="median",
    weight_estimator=weight_est,
    seed=42,
)

How It Works

Bootstrap bagging creates an ensemble of weight estimators:

  1. For each bootstrap iteration (n_bootstraps times):
  2. Resample both calibration and test sets to balanced size
  3. Fit the base estimator on the bootstrap sample
  4. Score ALL original instances (perfect coverage)
  5. Store log-weights for each instance

  6. After all iterations:

  7. Aggregate using geometric mean (exp of mean log-weights)
  8. Apply clipping to maintain bounded weights

Every instance receives exactly n_bootstraps weight estimates, ensuring symmetric coverage regardless of set size ratios.

When to Use

DO use BootstrapBaggedWeightEstimator when:

  1. Extreme imbalance: Large calibration set (>1000) with small test batches (<50)
  2. Common in online/streaming detection
  3. Example: 1000 calibration samples, 25 test instances

  4. High-stakes applications: Where weight quality is critical

  5. Medical diagnosis with small patient batches
  6. Fraud detection with limited transactions
  7. Safety-critical systems

  8. Severe distribution shift: When base estimators produce extreme weights

DO NOT use for:

  1. Balanced or moderate imbalance: Marginal benefit (2-3% improvement) doesn't justify 2-5x computational overhead
  2. Large test sets: Benefits diminish with larger batches
  3. Latency-sensitive production: Significant computational cost (20-50x slower)

Performance Benchmarks

Empirical testing shows context-dependent value:

Balanced Scenario (1000 calib vs 1000 test)
Metric Base Bagged-50 Improvement
Weight Std 2.884 2.957 -2.5% (worse)
Extreme Weights 0 0 No change
Time 0.14s 0.34s 2.4x slower

Verdict: Not recommended for balanced sets.

Extreme Imbalance (1000 calib vs 25 test)
Metric Logistic Base Logistic Bagged-50 Improvement
Weight Std 1.604 0.841 48% better
Extreme Weights 612 385 37% reduction
Recall 0.067 0.200 3x better
Time 0.14s 0.34s 2.4x slower
Metric Forest Base Forest Bagged-50 Improvement
Weight Std 0.153 0.259 Slightly higher but stable
Extreme Weights 599 0 100% elimination
Recall 0.333 1.000 Perfect detection
FDR 0.000 0.190 Acceptable trade-off
Time 0.24s 6.4s 27x slower

Verdict: Strongly recommended for extreme imbalance. Best combination: forest_weight_estimator + Bagging.

Configuration Parameters

n_bootstraps (default: 100): - Number of bootstrap iterations - Higher = more stable, but slower - Recommended: 20-50 for small test batches, 50-100 for critical applications

clip_quantile (default: 0.05): - Adaptive quantile-based clipping - Clips to (quantile, 1-quantile) percentiles - Use when weight distribution is unknown - Set to None to disable clipping

Advanced Example: Streaming Detection

For online/streaming anomaly detection with small batches:

from nonconform import ConformalDetector, Split, forest_weight_estimator
from nonconform.enums import Pruning
from nonconform.fdr import weighted_false_discovery_control
from nonconform.weighting import BootstrapBaggedWeightEstimator
from pyod.models.iforest import IForest

# Configuration for small batch streaming
weight_est = BootstrapBaggedWeightEstimator(
    base_estimator=forest_weight_estimator(n_estimators=50, max_depth=10),
    n_bootstraps=50,
    clip_quantile=0.05,  # Adaptive clipping
)

detector = ConformalDetector(
    detector=IForest(),
    strategy=Split(n_calib=1000),  # Large calibration set
    aggregation="median",
    weight_estimator=weight_est,
    seed=42,
)

# Train on historical data
detector.fit(X_historical)

# Process small incoming batches
for X_batch in stream_data(batch_size=25):
    p_values = detector.compute_p_values(X_batch)

    # Apply weighted FDR control
    discoveries = weighted_false_discovery_control(
        result=detector.last_result,
        alpha=0.1,
        pruning=Pruning.DETERMINISTIC,
        seed=42
    )

    print(f"Detected {discoveries.sum()} anomalies in batch of {len(X_batch)}")

Cost-Benefit Analysis

Configuration Time Quality Use Case
Logistic (Base) 0.14s Baseline Standard balanced scenarios
Logistic + Bagging(50) 0.34s +48% weight stability Moderate imbalance, quality focus
Forest (Base) 0.24s Good for non-linear Standard scenarios
Forest + Bagging(50) 6.4s Perfect detection Extreme imbalance, premium quality

Recommendation: Use forest_weight_estimator + BootstrapBaggedWeightEstimator when: - Calibration set is 40x larger than test batch (e.g., 1000:25) - Missing anomalies is very costly - Computational budget allows 20-50x overhead - Online/streaming detection with small batches

Decision Guide

Which weight estimator should I use?

┌─ Is your test batch very small (<50) AND calibration large (>1000)?
├─ YES → BootstrapBaggedWeightEstimator(
│         forest_weight_estimator(50), n_bootstraps=50
│       )
│       Cost: High (6-7s), Quality: Best (perfect detection)
└─ NO → Standard weight estimators
    ├─ Linear/moderate shift → logistic_weight_estimator()
    │                          Cost: Low (0.14s), Quality: Good
    └─ Complex/non-linear shift → forest_weight_estimator(50)
                                   Cost: Medium (0.24s), Quality: Better

Strategy Selection

Different strategies can be used with weighted conformal detection:

from nonconform import CrossValidation, JackknifeBootstrap

# JaB+ strategy for stability
jab_strategy = JackknifeBootstrap(n_bootstraps=50)
jab_detector = ConformalDetector(
    detector=base_detector,
    strategy=jab_strategy,
    aggregation="median",
    weight_estimator=logistic_weight_estimator(),
    seed=42
)

# Cross-validation strategy for efficiency
cv_strategy = CrossValidation(k=5)
cv_detector = ConformalDetector(
    detector=base_detector,
    strategy=cv_strategy,
    aggregation="median",
    weight_estimator=logistic_weight_estimator(),
    seed=42
)

Weighted Conformal Selection

Weighted conformal p-values are valid on their own. To obtain finite-sample FDR control under covariate shift, combine them with Weighted Conformal Selection (WCS) [Jin & Candès, 2023]:

from nonconform.enums import Pruning
from nonconform.fdr import weighted_false_discovery_control

# Collect weighted p-values and cached statistics
weighted_detector.compute_p_values(X_test_shifted)

wcs_mask = weighted_false_discovery_control(
    result=weighted_detector.last_result,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
    seed=42,
)

print(f"WCS-selected anomalies: {wcs_mask.sum()} of {len(wcs_mask)}")

After any call to compute_p_values() or score_samples(), the detector caches the relevant arrays (p_values, scores, weights) inside detector.last_result. Passing this object to weighted_false_discovery_control avoids plumbing the raw arrays manually.

For explicit array-first workflows, use:

from nonconform.enums import Pruning
from nonconform.fdr import (
    weighted_false_discovery_control_from_arrays,
)
from nonconform.scoring import calculate_weighted_p_val

# WCS from precomputed p-values + arrays
wcs_from_arrays = weighted_false_discovery_control_from_arrays(
    p_values=weighted_detector.last_result.p_values,
    test_scores=weighted_detector.last_result.test_scores,
    calib_scores=weighted_detector.last_result.calib_scores,
    test_weights=weighted_detector.last_result.test_weights,
    calib_weights=weighted_detector.last_result.calib_weights,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
    seed=42,
)

# WCS with explicit empirical p-value computation
p_values_empirical = calculate_weighted_p_val(
    scores=weighted_detector.last_result.test_scores,
    calibration_set=weighted_detector.last_result.calib_scores,
    test_weights=weighted_detector.last_result.test_weights,
    calib_weights=weighted_detector.last_result.calib_weights,
    tie_break="classical",
)
wcs_empirical = weighted_false_discovery_control_from_arrays(
    p_values=p_values_empirical,
    test_scores=weighted_detector.last_result.test_scores,
    calib_scores=weighted_detector.last_result.calib_scores,
    test_weights=weighted_detector.last_result.test_weights,
    calib_weights=weighted_detector.last_result.calib_weights,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
)

Pruning Modes

The pruning parameter controls how ties and randomization are handled in the WCS procedure [Jin & Candès, 2023]:

Pruning.DETERMINISTIC

wcs_mask = weighted_false_discovery_control(
    result=weighted_detector.last_result,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
    seed=42,  # seed has no effect for deterministic mode
)

Behavior: Uses a deterministic threshold without randomization. When there are tied p-values at the threshold, includes all or none based on deterministic rule.

When to use: - Reproducibility is critical - You don't want any randomness in selections - Reporting results that must be exactly reproducible

Trade-off: May be slightly conservative (reject fewer hypotheses) compared to randomized methods.

Pruning.HOMOGENEOUS

wcs_mask = weighted_false_discovery_control(
    result=weighted_detector.last_result,
    alpha=0.05,
    pruning=Pruning.HOMOGENEOUS,
    seed=42,  # controls randomization
)

Behavior: Draws a single uniform random variable and applies the same randomized threshold to all test instances. Handles ties by probabilistically including tied instances.

When to use: - Default randomized method - Want exact FDR control in expectation - Acceptable to have some randomness

Trade-off: Less conservative than DETERMINISTIC, but results vary across random seeds.

Pruning.HETEROGENEOUS

wcs_mask = weighted_false_discovery_control(
    result=weighted_detector.last_result,
    alpha=0.05,
    pruning=Pruning.HETEROGENEOUS,
    seed=42,  # controls randomization
)

Behavior: Draws independent uniform random variables for each test instance. Provides the most flexible randomization.

When to use: - Maximum power (fewer false negatives) - Most aggressive FDR control - Research settings where slight variance is acceptable

Trade-off: Highest variance across random seeds, but best expected power.

Comparison of Pruning Methods

from nonconform.enums import Pruning
from nonconform.fdr import weighted_false_discovery_control

pruning_methods = [
    Pruning.DETERMINISTIC,
    Pruning.HOMOGENEOUS,
    Pruning.HETEROGENEOUS
]

weighted_detector.compute_p_values(X_test_shifted)

for pruning_method in pruning_methods:
    wcs_mask = weighted_false_discovery_control(
        result=weighted_detector.last_result,
        alpha=0.05,
        pruning=pruning_method,
        seed=42,
    )

    print(f"{pruning_method.name}: {wcs_mask.sum()} detections")

Expected relationship: Typically HETEROGENEOUS ≥ HOMOGENEOUS ≥ DETERMINISTIC in terms of number of detections, though this can vary with data.

Performance Considerations

Computational Cost

Weighted conformal detection has additional overhead: - Weight estimation via logistic regression - Weighted p-value computation

import time

# Compare computation times
def time_detector(detector, X_train, X_test):
    start_time = time.time()
    detector.fit(X_train)
    fit_time = time.time() - start_time

    start_time = time.time()
    p_values = detector.compute_p_values(X_test)
    predict_time = time.time() - start_time

    return fit_time, predict_time

# Standard vs Weighted timing
standard_fit, standard_pred = time_detector(standard_detector, X_train, X_test)
weighted_fit, weighted_pred = time_detector(weighted_detector, X_train, X_test)

print(f"Standard: Fit={standard_fit:.2f}s, Predict={standard_pred:.2f}s")
print(f"Weighted: Fit={weighted_fit:.2f}s, Predict={weighted_pred:.2f}s")
print(f"Overhead: {((weighted_fit + weighted_pred) / (standard_fit + standard_pred) - 1) * 100:.1f}%")

Memory Usage

Weighted conformal detection requires storing: - Calibration samples for weight computation - Calibration scores for p-value calculation

For large datasets, consider: - Using a subset of calibration samples for weight estimation - Implementing online/streaming versions

Best Practices

1. Validate Distribution Shift

Always check if distribution shift is actually present:

# Use statistical tests to detect shift
from scipy.stats import ks_2samp

def detect_feature_shift(X_train, X_test):
    """Detect distribution shift in individual features."""
    shift_detected = []
    p_values = []

    for i in range(X_train.shape[1]):
        statistic, p_value = ks_2samp(X_train[:, i], X_test[:, i])
        shift_detected.append(p_value < 0.05)
        p_values.append(p_value)

    print(f"Features with significant shift: {sum(shift_detected)}/{len(shift_detected)}")
    return shift_detected, p_values

shift_features, shift_p_values = detect_feature_shift(X_train, X_test_shifted)

2. Combine with Weighted Conformal Selection

from nonconform.enums import Pruning
from nonconform.fdr import weighted_false_discovery_control

weighted_p_values = weighted_detector.compute_p_values(X_test_shifted)
wcs_mask = weighted_false_discovery_control(
    result=weighted_detector.last_result,
    alpha=0.05,
    pruning=Pruning.DETERMINISTIC,
    seed=42,
)

print(f"WCS-controlled discoveries: {wcs_mask.sum()}")

3. Monitor Weight Quality

Extreme weights can indicate poor weight estimation:

def check_weight_quality(detector, X_calib, X_test):
    """Check for extreme weights that might indicate poor estimation."""
    # This is a conceptual example - actual implementation would require
    # access to the internal weights computed by the detector

    # Rule of thumb: weights should typically be between 0.1 and 10
    # Extreme weights (< 0.01 or > 100) suggest problems
    pass

4. Use Appropriate Base Detectors

Some detectors work better with weighted conformal: - Good: Distance-based methods (LOF, KNN) that are sensitive to distribution - Moderate: Tree-based methods (Isolation Forest) that are somewhat robust - Challenging: Neural networks that might already adapt to shift

Advanced Applications

Multi-domain Adaptation

# Handle multiple domains with different shift patterns
domains = ['domain_A', 'domain_B', 'domain_C']
domain_detectors = {}

for domain in domains:
    detector = ConformalDetector(
        detector=base_detector,
        strategy=strategy,
        aggregation="median",
        weight_estimator=logistic_weight_estimator(),
        seed=42
    )
    detector.fit(X_train)  # Common training set
    domain_detectors[domain] = detector

# Predict on domain-specific test sets with WCS
from nonconform.enums import Pruning
from nonconform.fdr import weighted_false_discovery_control

for domain in domains:
    X_test_domain = load_domain_data(domain)  # Load domain-specific test data
    _ = domain_detectors[domain].compute_p_values(X_test_domain)
    wcs_mask = weighted_false_discovery_control(
        result=domain_detectors[domain].last_result,
        alpha=0.05,
        pruning=Pruning.DETERMINISTIC,
        seed=42,
    )
    print(f"{domain}: {wcs_mask.sum()} discoveries")

Online Adaptation

from nonconform.enums import Pruning
from nonconform.fdr import weighted_false_discovery_control

# Adapt to gradual distribution shift over time
def online_weighted_detection(detector, data_stream, window_size=1000):
    """Online weighted conformal detection with sliding window."""
    detections = []

    for i, (X_batch, _) in enumerate(data_stream):
        if i == 0:
            # Initialize with first batch
            detector.fit(X_batch)
        else:
            # Use sliding window for calibration
            if i * len(X_batch) > window_size:
                start_idx = (i * len(X_batch)) - window_size
                X_calib = get_recent_data(start_idx, window_size)
                detector.fit(X_calib)

            # Predict on current batch with WCS
            _ = detector.compute_p_values(X_batch)
            wcs_mask = weighted_false_discovery_control(
                result=detector.last_result,
                alpha=0.05,
                pruning=Pruning.DETERMINISTIC,
                seed=42,
            )
            detections.append(wcs_mask.sum())

    return detections

Troubleshooting

Common Issues

  1. Poor Weight Estimation
  2. Insufficient calibration data
  3. High-dimensional data with small samples
  4. Solution: Increase calibration size or use dimensionality reduction

  5. Extreme P-values

  6. All p-values near 0 or 1
  7. Solution: Check for severe distribution shift or model mismatch

  8. Inconsistent Results

  9. High variance in detection counts
  10. Solution: Use bootstrap strategy or increase sample size

Debugging Tools

def debug_weighted_conformal(detector, X_train, X_test):
    """Debug weighted conformal detection issues."""
    print("=== Weighted Conformal Debug Report ===")

    # Check data properties
    print(f"Training samples: {len(X_train)}")
    print(f"Test samples: {len(X_test)}")
    print(f"Feature dimensions: {X_train.shape[1]}")

    # Fit detector
    detector.fit(X_train)

    # Check calibration set size
    print(f"Calibration samples: {len(detector.calibration_set)}")

    if len(detector.calibration_set) < 50:
        print("WARNING: Small calibration set may lead to unreliable weights")

    # Get predictions
    p_values = detector.compute_p_values(X_test)

    # Check p-value distribution
    print(f"P-value range: [{p_values.min():.4f}, {p_values.max():.4f}]")
    print(f"P-value mean: {p_values.mean():.4f}")
    print(f"P-value std: {p_values.std():.4f}")

    if p_values.std() < 0.01:
        print("WARNING: Very low p-value variance - check for issues")

    print("=== End Debug Report ===")

# Example usage
debug_weighted_conformal(weighted_detector, X_train, X_test_shifted)

References

  • Jin, Y., & Candès, E. J. (2023). Model-free Selective Inference Under Covariate Shift via Weighted Conformal p-values. Biometrika, 110(4), 1090-1106. arXiv:2307.09291. [Foundational paper on weighted conformal inference and WCS procedure]

  • Tibshirani, R. J., Barber, R. F., Candes, E., & Ramdas, A. (2019). Conformal Prediction Under Covariate Shift. Advances in Neural Information Processing Systems, 32. arXiv:1904.06019. [Early work on conformal prediction with covariate shift]

  • Genovese, C. R., Roeder, K., & Wasserman, L. (2006). False Discovery Control with p-value Weighting. Biometrika, 93(3), 509-524. [Theoretical foundation for weighted FDR control]

Next Steps