Evaluation

CHAMBER is committed to reproducibility and statistical discipline from the first leaderboard entry. This page records the contract: how many seeds, which aggregate metrics, which guard rails. The rationale anchors are ADR-007 (≥20pp gap rule for axis admission, as amended by ADR-026 — see the note below), ADR-009 (partner-zoo stratum sizing), and ADR-014 (safety-reporting tables). The peer-reviewed evidence is on the literature page §5.

Amended by ADR-026 (2026-06-15)

The ≥20 pp gap is necessary but not sufficient. Under ADR-026 (coupling-validity criterion) an axis is admitted / promoted to Validated only on a task that meets the coupling-validity criterion — the manipulated heterogeneity must be coupled to the outcome through the cooperation the task demands, demonstrated by a pre-registered coupling positive-control. The Stage-1 action-space and observation-modality axes as operationalized are non-coupling-valid; the coupling-valid re-operationalization is a Phase-2 co-carry design.

1. Seeds and reporting

CHAMBER leaderboard entries report multi-seed runs with 95% cluster bootstrap confidence intervals on every published metric. Episodes within a seed are correlated (same partner roll-out, same env-reset stream), so a pooled iid bootstrap understates the CI; the implementation in src/chamber/evaluation/bootstrap.py resamples seeds (the cluster level) with replacement, then resamples episodes within each resampled seed. The minimum seed counts are:

Run class	Minimum seeds	Source
Stage-1 / Stage-2 axis spike (Phase 0)	5	ADR-007 §Implementation staging
Stage-3 axis spike (Phase 0)	5	ADR-007 §Implementation staging
Phase-1 leaderboard entry	16	ADR-009 §partner-zoo stratum sizing

Each metric — episode success rate, violation rate, conformal λ mean, inter-robot-collision rate, force-limit violation rate — is reported as point estimate with a 95% cluster-bootstrap CI computed across the seed budget above. Submissions that report fewer seeds, omit the CI, or use a pooled iid bootstrap on episode-level data are not admitted to the leaderboard.

1.1 Homogeneous-vs-heterogeneous pairing

The ≥20pp gap test from ADR-007 §Validation criteria is computed on matched pairs, not on pooled means. chamber.evaluation.bootstrap.pacluster_bootstrap takes an iterable of paired episodes — homogeneous and heterogeneous rollouts sharing (seed, episode_idx, initial_state_seed) — and resamples first at the seed (cluster) level, then within each resampled seed at the paired-episode level. Pairing by initial state seed plus partner seed removes the dominant source of cross-condition variance (different initial configurations being rolled out for homogeneous vs heterogeneous): the gap statistic is the within-pair delta, not the cross-pool mean difference.

This page exists, in part, because Henderson et al. (2018) [henderson2018matters] catalogued a set of evaluation anti-patterns the project explicitly refuses to fall into:

single-seed bar charts;
mean returns without a confidence interval;
cherry-picked checkpoints (best-of-N reporting without disclosing N);
undocumented hyperparameter sweeps masked as a single "tuned" configuration;
comparison against re-implementations of baselines instead of the baseline authors' released code.

See henderson2018matters in literature.md §5 for the citation.

2. Aggregate metrics

Beyond mean ± 95% CI, the CHAMBER leaderboard reports the rliable-style robust aggregate metrics introduced by Agarwal et al. (2021) [agarwal2021precipice]: interquartile mean (IQM), optimality gap, and performance profiles. IQM is the median of the middle 50% of scores, robust to outliers in either tail.

Optimality gap reports the expected shortfall below a target threshold: optimality_gap(X, τ) = E[max(τ - X, 0)]. This is the Agarwal et al. 2021 / rliable definition: not the empirical CDF at τ, but the average amount by which the distribution falls short of τ.

Performance profiles visualise the full score distribution as a CDF, surfacing dispersion that point estimates hide.

The rliable contract is pinned in ADR-014 §Decision and mirrored verbatim here so that the two documents stay in sync (this page is §3 in the docs/reference/ outline; §3.1 below is the seed-count table in §1, and the present subsection is §3.2):

Aggregate metrics across seeds use rliable-style robust statistics (Agarwal et al. 2021): interquartile mean, optimality gap, and bootstrap performance profiles, in addition to mean ± 95% bootstrap CI. The minimum-seed count per cell is the figure committed in docs/reference/evaluation.md §3.1. This is the explicit avoidance of the reporting anti-patterns catalogued by Henderson et al. 2018.

Henderson et al. (2018) is catalogued in §1's anti-pattern list above.

The Phase-1 leaderboard renderer chamber-render-tables must emit a per-axis IQM column when the rliable package is available as an optional dependency, and should additionally emit optimality-gap and performance-profile artefacts in the same run. The renderer is the contractual surface — its CLI flags and output schema are the implementation work that closes this contract, scheduled as a Phase-1 follow-up (see ADR-014 for the three-table-format scaffold the renderer fills in).

chamber.evaluation.bootstrap.aggregate_metrics computes IQM and optimality gap natively (rliable-compatible definitions) so the leaderboard remains renderable without the optional extra; performance profiles delegate to rliable when the extra is installed and return None with a RuntimeWarning otherwise.

Native IQM and optimality gap are computed without optional dependencies. Performance profiles delegate to rliable; install via uv sync --extra eval to enable them.

See agarwal2021precipice in literature.md §5 for the citation.

3. Pre-registration and statistical guard rails

Every Phase-0 axis spike runs against a pre-registration YAML committed to spikes/preregistration/ before launch. The YAML fixes: the hypothesis, the homogeneous baseline pair, the heterogeneous condition, the seed list, the metric, the analysis formula, and the ≥20pp gap threshold that decides admission to the v1 benchmark. Editing the YAML after a spike has launched is a project anti-pattern (see ADR-007 §Validation criteria); the corrective action is to re-launch with a new YAML and a new git tag.

The ≥20pp gap rule from ADR-007 §Validation criteria is the project's binary admission criterion: an axis survives Phase 0 if it produces a ≥20 percentage-point gap in episode success rate between homogeneous and heterogeneous agent pairs on at least one benchmark scenario, measured at the seed budget above. The 20pp threshold and the seed budget jointly determine the minimum detectable effect size (MDE) the spike is powered to find. With a binary success metric, 5 seeds across 100 evaluation episodes each gives roughly 500 paired trials, sufficient to discriminate a 20pp gap from null at standard significance levels; 16 seeds at the Phase-1 sample size tightens the MDE substantially and is the level required for leaderboard admission. The underlying evaluation-comparison framework is Jordan et al. (2020) [jordan2020evaluating]; see also literature.md §5.

The pre-registration YAML template — including the seed list, the hypothesis, the analysis formula, and the threshold — lives in spikes/preregistration/; see the run-spike how-to for the operational flow.

3.3 Run purpose and bootstrap policy

Each pre-registration YAML declares a run_purpose, one of:

`run_purpose`	Meaning	`bootstrap_method: iid` allowed?
`leaderboard`	Default. Admitted to the public CHAMBER ranking and the HRS bundle.	No — rejected at load time.
`power`	Power-analysis run used to size cluster-aware bootstraps; not admitted to the leaderboard.	Yes.
`debug`	Local debug / smoke run; not admitted to the leaderboard.	Yes.

run_purpose defaults to leaderboard when the field is omitted, so every existing YAML is treated as a leaderboard entry and inherits the strictest bootstrap rules without explicit migration. New YAMLs must set run_purpose explicitly — copy the value from the run-spike how-to template — so reviewers can tell at a glance whether a run is bound for the leaderboard.

The iid-not-allowed rule is enforced in chamber.evaluation.prereg.PreregistrationSpec by a Pydantic model_validator: a YAML that pairs run_purpose: leaderboard with bootstrap_method: iid fails to load with a ValidationError quoting "iid bootstrap is not permitted for leaderboard entries; use cluster (default) or hierarchical." The rule is the same one stated in §1 — a pooled iid bootstrap on seed-clustered episode data understates the CI width — restated at the schema layer so an incorrectly configured spike refuses to launch rather than silently producing a CI that cannot be admitted.

3.4 Stewardship and machine-readability

Beyond per-experiment statistics, CHAMBER artifacts (datasets, policies, leaderboard entries) follow the FAIR principles for scientific data management and stewardship (Wilkinson et al. 2016 [wilkinson2016fair]): every artifact is Findable (Zenodo DOI on every release), Accessible (Apache-2.0 licence + public mirror), Interoperable (uv.lock-pinned dependencies + SCHEMA_VERSION-pinned wire format), and Reusable (CITATION.cff + SBOM + this reporting contract).

See wilkinson2016fair in literature.md §5 and the canonical entry in refs.bib for the bibliographic record.