How AlignCast Generates and Evaluates Forecasting Examples
AlignCast measures whether models can output calibrated probabilities that an implementation will pass a provided deterministic test suite — without executing the code. The prediction target is oracle pass/fail on the provided tests, not semantic correctness independent of those tests. This page documents data generation, oracle labeling, split policy, and metrics.
1. Generation
Synthetic Example Construction
Each example includes a specification, an implementation, and deterministic tests. Implementations are mixed between correct and subtly incorrect variants.
Design Constraints
Python-only, deterministic tests, bounded runtime, no network, and no weaponizable payloads.
Diversity Controls
Randomized lexical surface, balanced lengths, and near-miss failures. Fault-hint comments are stripped from all evaluation prompts so models cannot read the answer rather than reason about it.
We publish high-level fault categories and version IDs alongside the full generator source.
2. Oracle
Ground Truth Labeling
Ground truth is produced by sandboxed test execution. The forecasting model never sees runtime outcomes during inference.
- Per-example timeout and deterministic execution context.
- Recorded outputs: pass/fail label, failing-test count, runtime.
- Network is disabled during oracle execution.
3. Splits
IID and OOD Evaluation Policy
Row-random splits alone are insufficient. AlignCast evaluates in-distribution and out-of-distribution behavior separately.
- IID: standard held-out split from seen templates and fault families.
- Template-held-out OOD: hold out all rows from a given template. The model has never seen that problem, but has seen all its fault families in other templates.
- Fault-family-held-out OOD: hold out all rows with a given fault family. The model has never seen that bug type, but has seen all its templates with different bugs.
The benchmark uses a sparse matrix of 22 applicable (template, fault family) pairs across 10 templates and 9 fault families. The generator cycles through all pairs deterministically, guaranteeing each appears in proportion before the dataset is shuffled. Every fault family appears in at least 2 templates, enabling separable template-held-out and fault-family-held-out OOD evaluation.
The matrix below shows which pairs are implemented. A cell is filled only where the fault type is semantically natural for that problem — forced pairings are excluded to preserve example quality.
| Structural | Semantic | Boundary | Type / Contract | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| off_by_one | predicate_inversion | wrong_selection | missing_dedup | comparison_weakening | boundary_swap | case_handling | missing_tie_check | silent_error_masking | n | |
| sum_even | 2 | |||||||||
| clamp | 2 | |||||||||
| rotate_left | 2 | |||||||||
| median3 | 2 | |||||||||
| count_vowels | 2 | |||||||||
| unique_sorted | 2 | |||||||||
| find_first_gt | 3 | |||||||||
| majority_vote | 3 | |||||||||
| parse_int_list | 2 | |||||||||
| second_largest | 2 | |||||||||
| ↑ templates | 4 | 3 | 2 | 2 | 3 | 2 | 2 | 2 | 2 | |
4. Metrics
Calibration and Classification Views
Primary metrics are proper scoring-rule calibration metrics, with confusion-matrix diagnostics as a secondary view.
Primary
NLL, Brier, and ECE (lower is better).
Secondary
Thresholded confusion metrics at p_pass >= 0.5: accuracy, precision, recall, F1, TP/FP/TN/FN.
5. Reproducibility
Reference Commands
All public leaderboard runs should be reproducible from versioned artifacts and deterministic seeds.
6. Safety
Disclosure and Risk Policy
- No real exploit payloads or deployment-ready attack code in benchmark content.
- Public release prioritizes methodological clarity over exploit detail.
- Sensitive deceptive-pattern specifics can be released in stages.