Skip to content

How-to: Submit a leaderboard entry

Phase-0 placeholder

The populated leaderboard lands with M5 once the Stage-1 (AS + OM) spike rows are in. This page documents the submission protocol so that external contributors can prepare entries against the same contract used by the in-tree spikes.

Leaderboard entries follow the same preregistration discipline as every other spike in the project: the hypothesis, threshold, and comparison conditions are committed before the run starts, the YAML is tagged in git, and the resulting rows are rendered into the leaderboard table by chamber-render-tables. The pre-registration YAML schema, the result archive schema, and the bootstrap / HRS pipeline live in chamber.evaluation: results.py defines the Pydantic models for SpikeRun / EpisodeResult / LeaderboardEntry; prereg.py validates the YAML and verifies the git-tag SHA per ADR-007 §Discipline; bootstrap.py ships the cluster + paired-cluster bootstrap used for the ≥20 pp gap test (reviewer P1-9); hrs.py computes the per-axis HRS vector and the headline scalar per ADR-008 §Decision.

Prerequisites

Install CHAMBER with the eval optional extra so the renderer can emit the rliable-style performance-profile column alongside the native IQM and optimality-gap columns:

uv sync --extra eval

The eval extra pulls in rliable (Agarwal et al. 2021). IQM and optimality gap are computed natively in chamber.evaluation.bootstrap and do not require the extra; only performance profiles do. Without the extra, chamber-render-tables still produces a valid leaderboard row — the performance-profile column is emitted as None and a RuntimeWarning points back here.

Protocol

  1. Copy the nearest existing pre-registration YAML from spikes/preregistration/ as a template for your method's entry. The YAML schema is validated by chamber.evaluation.prereg.load_prereg; the required fields are axis, condition_pair (homogeneous / heterogeneous ids), seeds, episodes_per_seed, estimator, bootstrap_method (defaults to cluster), failure_policy, and git_tag.
  2. Edit the hypothesis, threshold, comparison conditions, and the method: name that will appear in the leaderboard row. Set bootstrap_method: cluster unless you have a written reason to use hierarchical (alias) or iid (power-calc only — not admitted to the leaderboard).
  3. Commit the YAML and create a signed git tag of the form leaderboard-<method>-<stage>-<date>. Editing the YAML after the tag exists is a project anti-patternchamber.evaluation.prereg.verify_git_tag refuses any submission whose on-disk blob SHA disagrees with the SHA stored at the tag, so re-tag with a new YAML instead.
  4. Run the spike via chamber-spike run --axis <axis> against the tagged YAML. See Run a spike with a custom hypothesis for the end-to-end flow, including the M2 comm-degradation surface that the Stage-2 CM rows consume.
  5. Compose the leaderboard entry with chamber-eval. The HRS bundle per ADR-008 §Decision covers the surviving ADR-007 axes, so pass one spike-run archive per surviving axis in a single invocation — the CLI builds the full HRS vector + scalar over the union (reviewer P1-3). Example:

    chamber-eval stage1_as.json stage1_om.json stage2_cm.json \
      --method-id concerto --output entry.json
    

    Passing a single spike-run archive is still supported, but the rendered row is tagged [PARTIAL: <axis>] so a one-axis result is never mistaken for a complete HRS-bundle row. If your run legitimately covers the same axis twice (e.g. two AS spikes with different control rates), pass --allow-duplicate-axes and the axis name is suffixed with the spike_id in the rendered output for disambiguation; without the flag the CLI exits with status 2. The pipeline (cluster bootstrap → paired-cluster gap test → HRS vector → HRS scalar) uses concerto.training.seeding.derive_substream for deterministic resampling; identical inputs and seed produce byte-identical outputs. 6. Render the headline tables with chamber-render-tables --leaderboard entry.json and (if your spike emits the ADR-014 three-table safety report) chamber-render-tables --safety-report three_tables.json [--fmt latex]. 7. Open a PR that adds the tagged YAML and the result archive under spikes/results/. The CI gate re-renders the leaderboard table from the tagged result archives; no hand-edit of the README is required.

External contributors who do not have write access to the upstream repository can attach the signed result archive to a PR as a release asset and reference it from the preregistration YAML.

The HRS vector is emitted alongside the scalar on every entry per ADR-008 §Decision (reviewer P1-8); the renderer refuses entries that carry only the scalar, so consumers can always recompute the headline under a different weighting without re-running the spikes.