How-to: Submit a leaderboard entry
Phase-0 placeholder
The populated leaderboard lands with M5 once the Stage-1 (AS + OM) spike rows are in. This page documents the submission protocol so that external contributors can prepare entries against the same contract used by the in-tree spikes.
Leaderboard entries follow the same preregistration discipline as every
other spike in the project: the hypothesis, threshold, and comparison
conditions are committed before the run starts, the YAML is tagged in
git, and the resulting rows are rendered into the leaderboard table by
chamber-render-tables. The pre-registration YAML schema, the result
archive schema, and the bootstrap / HRS pipeline live in
chamber.evaluation:
results.py defines the Pydantic models for SpikeRun /
EpisodeResult / LeaderboardEntry; prereg.py validates the YAML
and verifies the git-tag SHA per
ADR-007 §Discipline;
bootstrap.py ships the cluster + paired-cluster bootstrap used for
the ≥20 pp gap test (reviewer P1-9); hrs.py computes the per-axis
HRS vector and the headline scalar per
ADR-008 §Decision.
Prerequisites
Install CHAMBER with the eval optional extra so the renderer can
emit the rliable-style performance-profile column alongside the
native IQM and optimality-gap columns:
uv sync --extra eval
The eval extra pulls in rliable (Agarwal et al. 2021). IQM and
optimality gap are computed natively in
chamber.evaluation.bootstrap and do
not require the extra; only performance profiles do. Without the
extra, chamber-render-tables still produces a valid leaderboard
row — the performance-profile column is emitted as None and a
RuntimeWarning points back here.
Protocol
- Copy the nearest existing pre-registration YAML from
spikes/preregistration/as a template for your method's entry. The YAML schema is validated bychamber.evaluation.prereg.load_prereg; the required fields areaxis,condition_pair(homogeneous / heterogeneous ids),seeds,episodes_per_seed,estimator,bootstrap_method(defaults tocluster),failure_policy, andgit_tag. - Edit the hypothesis, threshold, comparison conditions, and the
method:name that will appear in the leaderboard row. Setbootstrap_method: clusterunless you have a written reason to usehierarchical(alias) oriid(power-calc only — not admitted to the leaderboard). - Commit the YAML and create a signed git tag of the form
leaderboard-<method>-<stage>-<date>. Editing the YAML after the tag exists is a project anti-pattern —chamber.evaluation.prereg.verify_git_tagrefuses any submission whose on-disk blob SHA disagrees with the SHA stored at the tag, so re-tag with a new YAML instead. - Run the spike via
chamber-spike run --axis <axis>against the tagged YAML. See Run a spike with a custom hypothesis for the end-to-end flow, including the M2 comm-degradation surface that the Stage-2 CM rows consume. -
Compose the leaderboard entry with
chamber-eval. The HRS bundle per ADR-008 §Decision covers the surviving ADR-007 axes, so pass one spike-run archive per surviving axis in a single invocation — the CLI builds the full HRS vector + scalar over the union (reviewer P1-3). Example:chamber-eval stage1_as.json stage1_om.json stage2_cm.json \ --method-id concerto --output entry.jsonPassing a single spike-run archive is still supported, but the rendered row is tagged
[PARTIAL: <axis>]so a one-axis result is never mistaken for a complete HRS-bundle row. If your run legitimately covers the same axis twice (e.g. two AS spikes with different control rates), pass--allow-duplicate-axesand the axis name is suffixed with thespike_idin the rendered output for disambiguation; without the flag the CLI exits with status 2. The pipeline (cluster bootstrap → paired-cluster gap test → HRS vector → HRS scalar) usesconcerto.training.seeding.derive_substreamfor deterministic resampling; identical inputs and seed produce byte-identical outputs. 6. Render the headline tables withchamber-render-tables --leaderboard entry.jsonand (if your spike emits the ADR-014 three-table safety report)chamber-render-tables --safety-report three_tables.json [--fmt latex]. 7. Open a PR that adds the tagged YAML and the result archive underspikes/results/. The CI gate re-renders the leaderboard table from the tagged result archives; no hand-edit of the README is required.
External contributors who do not have write access to the upstream repository can attach the signed result archive to a PR as a release asset and reference it from the preregistration YAML.
The HRS vector is emitted alongside the scalar on every entry per ADR-008 §Decision (reviewer P1-8); the renderer refuses entries that carry only the scalar, so consumers can always recompute the headline under a different weighting without re-running the spikes.