run_study
from llenergymeasure import run_study
Concept
run_study runs a structured set of experiments defined in a YAML study config and returns a
StudyResult containing all measurements together with summary statistics. It is the right
entry point whenever you need more than one experiment - sweeping over models, engines, or
parameters, repeating a configuration multiple times for statistical reliability, or comparing
across an axis in a single reproducible bundle.
The distinction from run_experiment is straightforward: run_experiment
measures one configuration and returns a single result; run_study measures many configurations
(expanded from sweep declarations at YAML parse time) and returns them together with study-level
metadata, a manifest on disk, and a result bundle you can share or archive.
run_study always writes a manifest.json to disk as a documented side-effect. The manifest is
both a resumption checkpoint (if the study is interrupted) and an audit trail linking each result
file to its config hash.
Simple usage
from llenergymeasure import run_study
study_result = run_study("study.yaml")
for r in study_result.experiments:
print(f"{r.model_name} / {r.engine}: {r.mj_per_tok_total:.3f} mJ/tok")
study.yaml (minimal multi-experiment form):
study_name: gpt2-comparison
experiments:
- task:
model: gpt2
engine: transformers
- task:
model: gpt2-medium
engine: transformers
Sweep usage
Sweeps are declared with a sweep: key in the YAML. The loader expands the Cartesian product
at parse time into a flat experiments list before any experiment runs.
study_name: model-sweep
sweep:
axes:
- field: task.model
values:
- gpt2
- gpt2-medium
- gpt2-large
- field: engine
values:
- transformers
study_result = run_study("sweep.yaml")
print(f"Ran {study_result.summary.completed} / {study_result.summary.total_experiments} experiments")
print(f"Total energy: {study_result.summary.total_energy_j:.1f} J")
Parameter table
| Parameter | Type | Default | Description |
|---|---|---|---|
config | str | Path | StudyConfig | (required) | YAML file path or a pre-built StudyConfig object. |
skip_preflight | bool | False | Skip Docker pre-flight checks. Useful in CI or remote-daemon setups. |
progress | ProgressCallback | None | None | Progress callback. Receives per-experiment begin/end events and per-step events from worker processes. |
resume_dir | Path | None | None | Explicit study directory to resume. Overrides resume. |
resume | bool | False | Auto-detect the most recent resumable study in output_dir and resume from the last checkpoint. |
output_dir | Path | None | None | Base directory used by auto-detect resume only. Ignored when resume_dir is given. |
skip_set | set[tuple[str, int]] | None | None | Set of (config_hash, cycle) pairs to skip. Populated automatically when resuming; callers rarely need to set this. |
no_lock | bool | False | Skip GPU advisory lock acquisition. Equivalent to the --no-lock CLI flag. |
config_path | Path | None | None | Original YAML path for artefact copying when config is a StudyConfig object. Preserved in _study-artefacts/ for reproducibility. |
cli_overrides | dict[str, Any] | None | None | Flat dict of CLI flag overrides written to per-experiment _resolution.json sidecars. Rarely needed outside the CLI. |
Returns
StudyResult - a Pydantic model:
study_result.experiments # list[ExperimentResult] - one per completed experiment
study_result.summary.completed # int - number of experiments that succeeded
study_result.summary.failed # int - number that failed
study_result.summary.total_energy_j # float - summed energy across all experiments
study_result.summary.total_wall_time_s # float - total wall-clock time
study_result.result_files # list[str] - paths to result.json files on disk
study_result.study_name # str | None
study_result.study_design_hash # str | None - 16-char SHA-256 of the experiment list
study_result.measurement_protocol # dict - execution config snapshot (n_cycles, order, etc.)
study_result.skipped_experiments # list[dict] - configs that failed validation at expand time
Each item in experiments is an ExperimentResult. See
Results schema for the on-disk layout.
Common patterns
Filter results by engine
transformers_results = [r for r in study_result.experiments if r.engine == "transformers"]
Compare energy across models
import statistics
by_model: dict[str, list[float]] = {}
for r in study_result.experiments:
by_model.setdefault(r.model_name, []).append(r.mj_per_tok_total or 0.0)
for model, values in by_model.items():
print(f"{model}: mean {statistics.mean(values):.3f} mJ/tok (n={len(values)})")
Export to a DataFrame
import json
from pathlib import Path
import pandas as pd
rows = [
{
"model": r.model_name,
"engine": r.engine,
"energy_j": r.total_energy_j,
"throughput": r.avg_tokens_per_second,
"mj_per_tok": r.mj_per_tok_total,
}
for r in study_result.experiments
]
df = pd.DataFrame(rows)
Resume an interrupted study
# Picks up from the last completed experiment automatically
study_result = run_study("sweep.yaml", resume=True)
Raises
| Exception | When |
|---|---|
ConfigError | Invalid config path or YAML parse error. |
PreFlightError | Multi-engine study without Docker; Docker not running. |
StudyError | resume=True but no resumable study found; config drift detected on resume (study hash changed). |
pydantic.ValidationError | A field value fails validation. Passes through unchanged. |
Pitfalls
Multi-engine studies require Docker. A study that references both engine: transformers
and engine: vllm in its experiment list requires Docker. Without it, run_study raises
PreFlightError at preflight before any inference begins.
Result bundle on disk. Every run_study call creates a timestamped directory under
./results/ (or output.results_dir from the YAML). That directory is not cleaned up
automatically. Budget for disk space when sweeping large grids. See
Results schema for the exact layout.
Skipped configs. If a sweep axis combination fails Pydantic validation (e.g.
engine=tensorrt with a dtype that is not supported), the invalid combination is
recorded in study_result.skipped_experiments and in _study-artefacts/skipped_configs.log,
but the rest of the study continues.
n_cycles vs list length. study_result.summary.total_experiments reflects the expanded
cycle count (len(experiments) * n_cycles). summary.unique_configurations is the number of
distinct configurations (pre-cycle). Both are in the summary.
See also
run_experiment- single-experiment convenience wrapperStudyConfig- the config model accepted and returned by the loaderExperimentResult- the per-experiment result type- Study config reference - YAML syntax (sweep axes, execution block)
- Results schema - on-disk layout