Public Methodology

How AlignCast Generates and Evaluates Forecasting Examples

AlignCast measures whether models can output calibrated probabilities that code will pass tests, without executing code at inference time. This page documents data generation, oracle labeling, split policy, and metrics.

Taxonomy reference: fault_taxonomy_v1.md

1. Generation

Synthetic Example Construction

Each example includes a specification, an implementation, and deterministic tests. Implementations are mixed between correct and subtly incorrect variants.

Design Constraints

Python-only v0, deterministic tests, bounded runtime, no network, and no weaponizable payloads.

Diversity Controls

Randomized lexical surface, balanced lengths, and near-miss failures to avoid shortcut learning.

We publish high-level fault categories and version IDs. Exact deceptive trigger templates may be delayed to reduce benchmark gaming.

2. Oracle

Ground Truth Labeling

Ground truth is produced by sandboxed test execution. The forecasting model never sees runtime outcomes during inference.

Per-example timeout and deterministic execution context.
Recorded outputs: pass/fail label, failing-test count, runtime.
Network is disabled during oracle execution.

3. Splits

IID and OOD Evaluation Policy

Row-random splits alone are insufficient. AlignCast evaluates in-distribution and out-of-distribution behavior separately.

IID: standard held-out split from seen templates/subfamilies.
OOD Template: hold out entire templates.
OOD Fault Family: hold out entire fault subfamilies/families.

4. Metrics

Calibration and Classification Views

Primary metrics are proper scoring-rule calibration metrics, with confusion-matrix diagnostics as a secondary view.

Primary

NLL, Brier, and ECE (lower is better).

Secondary

Thresholded confusion metrics at p_pass >= 0.5: accuracy, precision, recall, F1, TP/FP/TN/FN.

5. Reproducibility

Reference Commands

All public leaderboard runs should be reproducible from versioned artifacts and deterministic seeds.

python scripts/generate_benchmark.py --n 100 --seed 42 --out bench/bench_100_seed42.jsonl
python scripts/run_benchmark.py --bench bench/bench_100_seed42.jsonl --model openai/gpt-4o-mini --run-dir runs/example --resume
python scripts/build_website_payload.py --summary runs/example/summary.json --bench-file bench/bench_100_seed42.jsonl --seed 42

Recommended metadata: generator_version, template_version, fault_taxonomy_version, model_version.

6. Safety

Disclosure and Risk Policy

No real exploit payloads or deployment-ready attack code in benchmark content.
Public release prioritizes methodological clarity over exploit detail.
Sensitive deceptive-pattern specifics can be released in stages.