How AlignCast Generates and Evaluates Forecasting Examples
AlignCast measures whether models can output calibrated probabilities that code will pass tests, without executing code at inference time. This page documents data generation, oracle labeling, split policy, and metrics.
1. Generation
Synthetic Example Construction
Each example includes a specification, an implementation, and deterministic tests. Implementations are mixed between correct and subtly incorrect variants.
Design Constraints
Python-only v0, deterministic tests, bounded runtime, no network, and no weaponizable payloads.
Diversity Controls
Randomized lexical surface, balanced lengths, and near-miss failures to avoid shortcut learning.
We publish high-level fault categories and version IDs. Exact deceptive trigger templates may be delayed to reduce benchmark gaming.
2. Oracle
Ground Truth Labeling
Ground truth is produced by sandboxed test execution. The forecasting model never sees runtime outcomes during inference.
- Per-example timeout and deterministic execution context.
- Recorded outputs: pass/fail label, failing-test count, runtime.
- Network is disabled during oracle execution.
3. Splits
IID and OOD Evaluation Policy
Row-random splits alone are insufficient. AlignCast evaluates in-distribution and out-of-distribution behavior separately.
- IID: standard held-out split from seen templates/subfamilies.
- OOD Template: hold out entire templates.
- OOD Fault Family: hold out entire fault subfamilies/families.
4. Metrics
Calibration and Classification Views
Primary metrics are proper scoring-rule calibration metrics, with confusion-matrix diagnostics as a secondary view.
Primary
NLL, Brier, and ECE (lower is better).
Secondary
Thresholded confusion metrics at p_pass >= 0.5: accuracy, precision, recall, F1, TP/FP/TN/FN.
5. Reproducibility
Reference Commands
All public leaderboard runs should be reproducible from versioned artifacts and deterministic seeds.
6. Safety
Disclosure and Risk Policy
- No real exploit payloads or deployment-ready attack code in benchmark content.
- Public release prioritizes methodological clarity over exploit detail.
- Sensitive deceptive-pattern specifics can be released in stages.