Edit this page View source of this page

Choosing Calibration Strategies¶

This guide helps you choose a calibration strategy based on data size, runtime, memory, and how clean a validity story you need. The recommendations are starting points, not universal optima.

Strategy Overview¶

nonconform provides one simple split baseline and a family of resampling strategies for cases where a holdout calibration split would waste too much data:

Strategy	Speed	Data Efficiency	Validity Story	Best For
Split	High	Medium	Cleanest	Large datasets, production baselines
CV+	Medium	High	Resampling-based	Practical small-data default
Jackknife+	Low	Very high	Resampling-based	Very small datasets
JackknifeBootstrap (JaB+)	Low	High	Looser resampling bound	Bootstrap stability

Guarantee note: Split conformal is the cleanest strict finite-sample baseline. Resampling strategies such as cross-conformal, CV+, Jackknife+, and JaB+ use data more efficiently and often work well in practice, but their guarantees are weaker, approximate, asymptotic, or looser depending on the method. mode="plus" is the validity-oriented default for these families; mode="single_model" is lighter but weakens the validity story further.

Detailed Strategy Characteristics¶

Split Conformal¶

When to use: - Large training datasets (>5,000 samples) - Real-time or production environments requiring fast inference - When computational resources are limited - Initial prototyping and development

Advantages: - Fastest training and inference - Minimal memory usage - Simple to understand and implement - Predictable computational cost

Disadvantages: - Uses only a subset of data for calibration - May be less reliable with small datasets - No theoretical optimality guarantees

Configuration example:

from nonconform import Split

# For large datasets
strategy = Split(n_calib=0.2)  # Use 20% for calibration

# For fixed calibration size
strategy = Split(n_calib=2000)  # Use exactly 2000 samples

Data-Efficient Resampling¶

When to use: - Small to medium datasets where a fixed calibration holdout is too costly - Applications where every observation should help train at least one model - Research workflows where you can spend extra computation for smoother results - Production workflows where the memory and latency costs are acceptable

Advantages: - Avoids permanently reserving a calibration-only subset - Lets each observation contribute through folds, leave-one-out fits, or bootstrap out-of-bag structure - Often improves practical power when data is scarce - mode="plus" gives the most defensible resampling option in this package

Disadvantages: - More computationally expensive than Split - Memory usage can grow with folds, leave-one-out models, or bootstraps - Guarantees are weaker, approximate, asymptotic, or looser than the clean split-conformal baseline, depending on the method - Method choice depends on dataset size and compute budget

Configuration examples:

from nonconform import CrossValidation, JackknifeBootstrap

# Practical default for limited data
strategy = CrossValidation(k=5, mode="plus")

# Leave-one-out variant for very small datasets
strategy = CrossValidation.jackknife(mode="plus")

# Bootstrap variant for stability analysis
strategy = JackknifeBootstrap(n_bootstraps=100, mode="plus")

How to choose inside the family:

Method	Good First Use	Watch Out For
CV+	Limited data with practical compute	More folds cost more model fits
Jackknife+	Very small data	Leave-one-out fitting can be expensive
JaB+	Bootstrap stability or noisy data	Too few bootstraps can be unstable

Decision Framework¶

The thresholds below are practical defaults. Use labeled validation data when available and compare strategies on empirical FDR, power, runtime, and memory.

1. Dataset Size Considerations¶

Large datasets (>10,000 samples): - Primary choice: Split (fast, efficient) - Alternative: JackknifeBootstrap (if speed is not the top priority)

Medium datasets (1,000-10,000 samples): - Primary choice: JackknifeBootstrap (balanced robustness and practicality) - Alternative: Jackknife+ (if you want lower compute than larger-bootstrap setups)

Small datasets (<1,000 samples): - Primary choice: Jackknife+ - Alternative: Jackknife (for the smallest datasets)

2. Performance Requirements¶

Real-time applications (latency <100ms): - Use Split conformal - Pre-compute calibration sets where possible - Consider caching fitted detectors

Batch processing (latency <10s): - Jackknife+ or JackknifeBootstrap - Optimize based on accuracy requirements

Offline analysis (no latency constraints): - Any strategy based on accuracy needs - JackknifeBootstrap for maximum robustness

3. Accuracy vs Speed Trade-offs¶

Maximum speed (production systems):

# Fastest configuration
strategy = Split(n_calib=1000)  # Fixed size for predictable performance

Balanced (general applications):

# Good robustness with practical defaults
strategy = JackknifeBootstrap(n_bootstraps=100)

Maximum robustness checks (research/high-rigor applications):

# More resampling stability, but slower
strategy = JackknifeBootstrap(n_bootstraps=200)

Advanced Considerations¶

Data Distribution Properties¶

Exchangeable data (IID assumption holds): - All strategies work well - Choose based on computational constraints

Non-exchangeable data (distribution shift): - Consider weighted conformal detection only when the shift is plausibly covariate shift with support overlap - JackknifeBootstrap strategy may provide additional robustness - Monitor calibration performance over time

Heterogeneous data (mixed distributions): - JackknifeBootstrap recommended - Jackknife+ as alternative - Avoid Split with very diverse training sets

Computational Resource Planning¶

Memory constraints: - Split: O(n_calib) memory usage - Jackknife+: O(n_train) memory usage - Cross-Validation: O(k × n_test) inference peak; O(k) stored models + O(n_train) calibration scores - JackknifeBootstrap: O(n_train x n_bootstraps) memory usage (includes permanent _oob_mask storage)

CPU considerations: - Split: Single model training - Jackknife+: n_train + 1 model trainings - Cross-Validation: n_folds model trainings - JackknifeBootstrap: n_bootstraps model trainings

Strategy Transition Guide¶

From Research to Production¶

Development phase: Use JackknifeBootstrap for robust results
Validation phase: Compare with Jackknife+ for speed assessment
Production phase: Deploy with Split when latency, memory, and simple validation are the priorities
Monitoring phase: Validate that Split maintains required accuracy

Handling Performance Degradation¶

If you observe degraded performance after strategy changes:

Check calibration set size: Ensure adequate samples for reliable calibration
Validate data assumptions: Verify exchangeability hasn't changed
Monitor drift: Use weighted conformal only when detected drift matches the covariate-shift assumptions
Adjust parameters: Tune strategy-specific parameters

Common Pitfalls¶

Split Conformal¶

Don't: Use with very small datasets (<500 samples)
Don't: Use fixed small calibration sets with varying dataset sizes
Do: Use proportional calibration sizing for consistency

Resampling Strategies¶

Don't: Use too many folds with small datasets (overfitting risk)
Don't: Treat mode="single_model" as equivalent to plus-style resampling
Don't: Forget that Jackknife+ requires one fit per observation
Don't: Use too few bootstraps (<20) for robust estimates
Do: Balance folds, leave-one-out fits, or bootstraps against your compute budget
Do: Monitor bootstrap stability when using JaB+

Benchmarking Your Choice¶

Always validate your strategy choice with performance metrics:

from nonconform import ConformalDetector, CrossValidation, JackknifeBootstrap, Split
from nonconform.metrics import false_discovery_rate, statistical_power

# Compare strategies on your data
strategies = {
    "Split": Split(n_calib=0.2),
    "CV+": CrossValidation(k=5, mode="plus"),
    "Jackknife+": CrossValidation.jackknife(mode="plus"),
    "JaB+": JackknifeBootstrap(n_bootstraps=100, mode="plus"),
}

for name, strategy in strategies.items():
    detector = ConformalDetector(
        detector=your_detector,
        strategy=strategy,
        seed=42
    )
    detector.fit(X_train)
    decisions = detector.select(X_test, alpha=0.1)

    # Evaluate FDR-controlled decisions
    fdr = false_discovery_rate(y_test, decisions)
    power = statistical_power(y_test, decisions)

    print(f"{name}: FDR={fdr:.3f}, Power={power:.3f}")

Choose the strategy that best meets your requirements for FDR control, statistical power, runtime, and memory. When in doubt, keep Split as a baseline: it is easier to reason about, and it makes assumption failures easier to spot.

References¶

For the statistical background behind these recommendations, see Conformalization Strategies.