Skip to main content

run_study

from llenergymeasure import run_study

Concept

run_study runs a structured set of experiments defined in a YAML study config and returns a StudyResult containing all measurements together with summary statistics. It is the right entry point whenever you need more than one experiment - sweeping over models, engines, or parameters, repeating a configuration multiple times for statistical reliability, or comparing across an axis in a single reproducible bundle.

The distinction from run_experiment is straightforward: run_experiment measures one configuration and returns a single result; run_study measures many configurations (expanded from sweep declarations at YAML parse time) and returns them together with study-level metadata, a manifest on disk, and a result bundle you can share or archive.

run_study always writes a manifest.json to disk as a documented side-effect. The manifest is both a resumption checkpoint (if the study is interrupted) and an audit trail linking each result file to its config hash.


Simple usage

from llenergymeasure import run_study

study_result = run_study("study.yaml")

for r in study_result.experiments:
print(f"{r.model_name} / {r.engine}: {r.mj_per_tok_total:.3f} mJ/tok")

study.yaml (minimal multi-experiment form):

study_name: gpt2-comparison

experiments:
- task:
model: gpt2
engine: transformers

- task:
model: gpt2-medium
engine: transformers

Sweep usage

Sweeps are declared with a sweep: key in the YAML. The loader expands the Cartesian product at parse time into a flat experiments list before any experiment runs.

study_name: model-sweep

sweep:
axes:
- field: task.model
values:
- gpt2
- gpt2-medium
- gpt2-large

- field: engine
values:
- transformers
study_result = run_study("sweep.yaml")

print(f"Ran {study_result.summary.completed} / {study_result.summary.total_experiments} experiments")
print(f"Total energy: {study_result.summary.total_energy_j:.1f} J")

Parameter table

ParameterTypeDefaultDescription
configstr | Path | StudyConfig(required)YAML file path or a pre-built StudyConfig object.
skip_preflightboolFalseSkip Docker pre-flight checks. Useful in CI or remote-daemon setups.
progressProgressCallback | NoneNoneProgress callback. Receives per-experiment begin/end events and per-step events from worker processes.
resume_dirPath | NoneNoneExplicit study directory to resume. Overrides resume.
resumeboolFalseAuto-detect the most recent resumable study in output_dir and resume from the last checkpoint.
output_dirPath | NoneNoneBase directory used by auto-detect resume only. Ignored when resume_dir is given.
skip_setset[tuple[str, int]] | NoneNoneSet of (config_hash, cycle) pairs to skip. Populated automatically when resuming; callers rarely need to set this.
no_lockboolFalseSkip GPU advisory lock acquisition. Equivalent to the --no-lock CLI flag.
config_pathPath | NoneNoneOriginal YAML path for artefact copying when config is a StudyConfig object. Preserved in _study-artefacts/ for reproducibility.
cli_overridesdict[str, Any] | NoneNoneFlat dict of CLI flag overrides written to per-experiment _resolution.json sidecars. Rarely needed outside the CLI.

Returns

StudyResult - a Pydantic model:

study_result.experiments # list[ExperimentResult] - one per completed experiment
study_result.summary.completed # int - number of experiments that succeeded
study_result.summary.failed # int - number that failed
study_result.summary.total_energy_j # float - summed energy across all experiments
study_result.summary.total_wall_time_s # float - total wall-clock time
study_result.result_files # list[str] - paths to result.json files on disk
study_result.study_name # str | None
study_result.study_design_hash # str | None - 16-char SHA-256 of the experiment list
study_result.measurement_protocol # dict - execution config snapshot (n_cycles, order, etc.)
study_result.skipped_experiments # list[dict] - configs that failed validation at expand time

Each item in experiments is an ExperimentResult. See Results schema for the on-disk layout.


Common patterns

Filter results by engine

transformers_results = [r for r in study_result.experiments if r.engine == "transformers"]

Compare energy across models

import statistics

by_model: dict[str, list[float]] = {}
for r in study_result.experiments:
by_model.setdefault(r.model_name, []).append(r.mj_per_tok_total or 0.0)

for model, values in by_model.items():
print(f"{model}: mean {statistics.mean(values):.3f} mJ/tok (n={len(values)})")

Export to a DataFrame

import json
from pathlib import Path
import pandas as pd

rows = [
{
"model": r.model_name,
"engine": r.engine,
"energy_j": r.total_energy_j,
"throughput": r.avg_tokens_per_second,
"mj_per_tok": r.mj_per_tok_total,
}
for r in study_result.experiments
]
df = pd.DataFrame(rows)

Resume an interrupted study

# Picks up from the last completed experiment automatically
study_result = run_study("sweep.yaml", resume=True)

Raises

ExceptionWhen
ConfigErrorInvalid config path or YAML parse error.
PreFlightErrorMulti-engine study without Docker; Docker not running.
StudyErrorresume=True but no resumable study found; config drift detected on resume (study hash changed).
pydantic.ValidationErrorA field value fails validation. Passes through unchanged.

Pitfalls

Multi-engine studies require Docker. A study that references both engine: transformers and engine: vllm in its experiment list requires Docker. Without it, run_study raises PreFlightError at preflight before any inference begins.

Result bundle on disk. Every run_study call creates a timestamped directory under ./results/ (or output.results_dir from the YAML). That directory is not cleaned up automatically. Budget for disk space when sweeping large grids. See Results schema for the exact layout.

Skipped configs. If a sweep axis combination fails Pydantic validation (e.g. engine=tensorrt with a dtype that is not supported), the invalid combination is recorded in study_result.skipped_experiments and in _study-artefacts/skipped_configs.log, but the rest of the study continues.

n_cycles vs list length. study_result.summary.total_experiments reflects the expanded cycle count (len(experiments) * n_cycles). summary.unique_configurations is the number of distinct configurations (pre-cycle). Both are in the summary.


See also