Public Methodology

How AlignCast Generates and Evaluates Forecasting Examples

AlignCast measures whether models can output calibrated probabilities that an implementation will pass a provided deterministic test suite — without executing the code. The prediction target is oracle pass/fail on the provided tests, not semantic correctness independent of those tests. This page documents data generation, oracle labeling, split policy, and metrics.

Benchmark definition: benchgen_spec_canonical.md · Taxonomy reference (non-normative): fault_taxonomy_v1.md


1. Generation

Synthetic Example Construction

Each example includes a specification, an implementation, and deterministic tests. Implementations are mixed between correct and subtly incorrect variants.

Design Constraints

Python-only, deterministic tests, bounded runtime, no network, and no weaponizable payloads.

Diversity Controls

Randomized lexical surface, balanced lengths, and near-miss failures. Fault-hint comments are stripped from all evaluation prompts so models cannot read the answer rather than reason about it.

We publish high-level fault categories and version IDs alongside the full generator source.


2. Oracle

Ground Truth Labeling

Ground truth is produced by sandboxed test execution. The forecasting model never sees runtime outcomes during inference.


3. Splits

IID and OOD Evaluation Policy

Row-random splits alone are insufficient. AlignCast evaluates in-distribution and out-of-distribution behavior separately.

The benchmark uses a sparse matrix of 22 applicable (template, fault family) pairs across 10 templates and 9 fault families. The generator cycles through all pairs deterministically, guaranteeing each appears in proportion before the dataset is shuffled. Every fault family appears in at least 2 templates, enabling separable template-held-out and fault-family-held-out OOD evaluation.

The matrix below shows which pairs are implemented. A cell is filled only where the fault type is semantically natural for that problem — forced pairings are excluded to preserve example quality.

Structural Semantic Boundary Type / Contract
off_by_one predicate_inversion wrong_selection missing_dedup comparison_weakening boundary_swap case_handling missing_tie_check silent_error_masking n
sum_even 2
clamp 2
rotate_left 2
median3 2
count_vowels 2
unique_sorted 2
find_first_gt 3
majority_vote 3
parse_int_list2
second_largest2
↑ templates 432232222
applicable pair (22 total)
not coded
Structural
Semantic
Boundary
Type / Contract

Full design rationale: docs/fault_matrix_v1.html


4. Metrics

Calibration and Classification Views

Primary metrics are proper scoring-rule calibration metrics, with confusion-matrix diagnostics as a secondary view.

Primary

NLL, Brier, and ECE (lower is better).

Secondary

Thresholded confusion metrics at p_pass >= 0.5: accuracy, precision, recall, F1, TP/FP/TN/FN.


5. Reproducibility

Reference Commands

All public leaderboard runs should be reproducible from versioned artifacts and deterministic seeds.

python scripts/generate_benchmark.py --n 100 --seed 42 --out bench/bench_100_seed42.jsonl python scripts/run_benchmark.py --bench bench/bench_100_seed42.jsonl --model openai/gpt-4o-mini --run-dir runs/example --resume python scripts/build_website_payload.py --summary runs/example/summary.json --bench-file bench/bench_100_seed42.jsonl --seed 42

Recommended metadata: generator_version, template_version, fault_taxonomy_version, model_version.


6. Safety

Disclosure and Risk Policy