Public Methodology

How AlignCast Generates and Evaluates Forecasting Examples

AlignCast measures whether models can output calibrated probabilities that code will pass tests, without executing code at inference time. This page documents data generation, oracle labeling, split policy, and metrics.

Taxonomy reference: fault_taxonomy_v1.md


1. Generation

Synthetic Example Construction

Each example includes a specification, an implementation, and deterministic tests. Implementations are mixed between correct and subtly incorrect variants.

Design Constraints

Python-only v0, deterministic tests, bounded runtime, no network, and no weaponizable payloads.

Diversity Controls

Randomized lexical surface, balanced lengths, and near-miss failures to avoid shortcut learning.

We publish high-level fault categories and version IDs. Exact deceptive trigger templates may be delayed to reduce benchmark gaming.


2. Oracle

Ground Truth Labeling

Ground truth is produced by sandboxed test execution. The forecasting model never sees runtime outcomes during inference.


3. Splits

IID and OOD Evaluation Policy

Row-random splits alone are insufficient. AlignCast evaluates in-distribution and out-of-distribution behavior separately.


4. Metrics

Calibration and Classification Views

Primary metrics are proper scoring-rule calibration metrics, with confusion-matrix diagnostics as a secondary view.

Primary

NLL, Brier, and ECE (lower is better).

Secondary

Thresholded confusion metrics at p_pass >= 0.5: accuracy, precision, recall, F1, TP/FP/TN/FN.


5. Reproducibility

Reference Commands

All public leaderboard runs should be reproducible from versioned artifacts and deterministic seeds.

python scripts/generate_benchmark.py --n 100 --seed 42 --out bench/bench_100_seed42.jsonl python scripts/run_benchmark.py --bench bench/bench_100_seed42.jsonl --model openai/gpt-4o-mini --run-dir runs/example --resume python scripts/build_website_payload.py --summary runs/example/summary.json --bench-file bench/bench_100_seed42.jsonl --seed 42

Recommended metadata: generator_version, template_version, fault_taxonomy_version, model_version.


6. Safety

Disclosure and Risk Policy