How to run CONCERTO on a GPU host
This page is the recipe for reproducing the M4a draft-zoo seed
checkpoint on a Linux + CUDA box. The Mac-CPU tutorial in
tutorials/hello-spike.md covers
day-to-day development; this page is what an outside contributor with
a fresh CUDA-capable Linux box and Docker installed reads to
reproduce the canonical Phase-0 artefact.
Reproduction takes ~2 hours of GPU wall-time + ~10 minutes of build
time. It targets ADR-001 §Validation criteria (the
rig-validated panda_wristcam + fetch + allegro_hand_right env)
and writes the Phase-0 M4a draft-zoo seed checkpoint
(local://artifacts/happo_seed7_step50k.pt, per
ADR-009 §Decision).
Prerequisites
- Linux (Ubuntu 22.04 or equivalent) — native, OR Windows 11 +
WSL2 (Ubuntu) + Docker Engine. The WSL2 path is CUDA-only (no
Vulkan); pass
-e CHAMBER_RENDER_BACKEND=noneon thedocker runline as described in "Run the zoo-seed training" below. - NVIDIA GPU with CUDA 12.x driver. The image uses the
nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04base; minimum driver version per NVIDIA's CUDA compatibility table is 535.54. -
Docker ≥ 24, with
nvidia-container-toolkitinstalled and configured so--gpus allworks:docker run --rm --gpus all nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04 nvidia-smiIf this prints
nvidia-smioutput, the GPU is wired correctly. -make,git, ~12 GB of disk for the image + caches.
If your box doesn't satisfy the above, the
tutorials/hello-spike.md CPU recipe
runs the same training stack against the mpe_cooperative_push env
on Mac / Linux laptop CPU in ~5 minutes.
1. Clone
git clone https://github.com/fsafaei/concerto.git
cd concerto
2. Build the GPU image
The repository ships two Dockerfiles:
Dockerfile.cpu— Ubuntu 26.04 base; no CUDA. Used by the.devcontainer/Mac dev box. Not what you want here.Dockerfile.gpu—nvidia/cudabase with three stages (dev/prod/gpu-host). Thegpu-hoststage installslibvulkan1+mesa-vulkan-drivers+libegl1+libgles2, which SAPIEN / ManiSkill v3 need to initialise the rig (ADR-001 §Risks).
Build the gpu-host stage:
docker build -f Dockerfile.gpu --target gpu-host -t concerto:gpu .
First build takes ~10 minutes (driven by uv sync --frozen
--no-dev); subsequent builds reuse the apt + uv layer caches.
3. Smoke-test the rig
Inside a container against your GPU, run the ADR-001 Stage-0 smoke test. This verifies the rig is talking to SAPIEN correctly before you spend GPU wall-time on a 100k-frame training run.
docker run --rm --gpus all -v "$PWD:/workspace" -w /workspace \
concerto:gpu make smoke
Expected: Tier-2 real-SAPIEN tests pass (the Mac CI run skips them
via sapien_gpu_available(); here they execute).
4. Run the zoo-seed training
This is the load-bearing step. It runs the 100k-frame ego-AHT HAPPO
training on the Stage-0 env at seed=7 and publishes the step-50000
checkpoint under the canonical name ADR-009 §Decision
specifies:
docker run --rm --gpus all -v "$PWD:/workspace" -w /workspace \
concerto:gpu make zoo-seed-gpu
On Windows 11 + WSL2 + Docker Engine, add -e
CHAMBER_RENDER_BACKEND=none to the same command:
docker run --rm --gpus all \
-e CHAMBER_RENDER_BACKEND=none \
-v "$PWD:/workspace" -w /workspace \
concerto:gpu make zoo-seed-gpu
CHAMBER_RENDER_BACKEND defaults to "gpu" so the original command
above still works unchanged on native Linux hosts with a real NVIDIA
Vulkan driver. Setting it to "none" disables SAPIEN's Vulkan render
path; CUDA still drives PyTorch (HAPPO training) and the Stage-0 env
uses obs_mode="state_dict" so no rendered frames are ever requested.
What the target does (scripts/repro/zoo_seed.sh):
- Probes the device. Loud-fails (exit 2) if
torch_device() != "cuda". - Invokes
chamber-spike train --config configs/training/ego_aht_happo/stage0_smoke.yaml. - On training success, copies
./artifacts/artifacts/<run_id>_step50000.pt(+ sidecar) to the M4a-contract path./artifacts/artifacts/happo_seed7_step50k.pt. - Computes the SHA-256 of the published
.ptand writes it toscripts/repro/artifacts/happo_seed7_step50k.pt.sha256. - Prints a closing line directing you to commit the SHA file via the Branch-5 manifest PR.
Exit codes:
| Code | Meaning |
|---|---|
| 0 | Training completed, artefact published, SHA written. |
| 1 | Training failed or the published artefact wasn't found. |
| 2 | CPU-only host (torch_device() != "cuda"). |
Total GPU wall-time on a single consumer GPU: ~2 hours, plus ~1 minute of artefact handling.
Why is
--check-guaranteenot used here?The Stage-0 smoke env (
make_stage0_training_env/_Stage0SmokeEnv) loads an empty scene with no task and no reward function — it exists to validate the rig (wrapper dispatch, action routing, observation namespacing, GPU + CUDA + SAPIEN + Vulkan integration), per ADR-001 §Validation criteria. Reward stays at 0.0, so the slope test inconcerto.training.empirical_guaranteewould always trip withslope = 0,passed = False.The empirical-guarantee trip-wire is the [ADR-002 risk-mitigation
1]adr-002 gate for learnable tasks; running it on a zero-
reward rig-validation env is a category error. The Stage-1 spike protocol (ADR-007 §Stage 1) is where the gap test against a learnable task happens, and the Stage-1 runner wires
--check-guaranteeinto its own script.
5. Publish + verify
After the run, you should have on disk:
./artifacts/artifacts/happo_seed7_step50k.pt(the .pt payload)./artifacts/artifacts/happo_seed7_step50k.pt.json(the SHA-256- provenance sidecar, per
save_checkpoint) scripts/repro/artifacts/happo_seed7_step50k.pt.sha256(the hex digest, ready forgit commit)
Commit the .sha256 file and the .pt + sidecar (the M4a Phase-0
draft-zoo artefacts the project hosts out-of-tree per
plan/04 §3.8) via the Branch-5 manifest PR
(T4b.14). After that PR merges, future contributors verify the
artefact integrity with:
make zoo-seed-verify
which loads the .pt through
load_checkpoint and asserts the recomputed
SHA-256 matches the committed manifest.
Troubleshooting
Cannot connect to the Docker daemon— Docker is not running. Start it withsudo systemctl start docker(Linux) or via Docker Desktop.could not select device driver "" with capabilities: [gpu]—nvidia-container-toolkitis not installed or not configured. See NVIDIA's install guide.ChamberEnvCompatibilityError: SAPIEN/Vulkan initialisation failed— the container started but couldn't reach the GPU's Vulkan stack. On native Linux hosts with NVIDIA Vulkan drivers, check the host driver version (must be ≥535.54) and that--gpus allis on thedocker runline. On hosts where CUDA is available but Vulkan is not (e.g. Windows 11 + WSL2 + Docker Engine), setCHAMBER_RENDER_BACKEND=noneviadocker run -e CHAMBER_RENDER_BACKEND=none …; the simulator's render path is skipped (the env usesobs_mode="state_dict"so no rendered frames are needed) while CUDA continues to drive HAPPO training.