`run_study`

from llenergymeasure import run_study

Concept

run_study runs a structured set of experiments defined in a YAML study config and returns a StudyResult containing all measurements together with summary statistics. It is the right entry point whenever you need more than one experiment - sweeping over models, engines, or parameters, repeating a configuration multiple times for statistical reliability, or comparing across an axis in a single reproducible bundle.

The distinction from run_experiment is straightforward: run_experiment measures one configuration and returns a single result; run_study measures many configurations (expanded from sweep declarations at YAML parse time) and returns them together with study-level metadata, a manifest on disk, and a result bundle you can share or archive.

run_study always writes a manifest.json to disk as a documented side-effect. The manifest is both a resumption checkpoint (if the study is interrupted) and an audit trail linking each result file to its config hash.

Simple usage

from llenergymeasure import run_study

study_result = run_study("study.yaml")

for r in study_result.experiments:
    print(f"{r.model_name} / {r.engine}: {r.mj_per_tok_total:.3f} mJ/tok")

study.yaml (minimal multi-experiment form):

study_name: gpt2-comparison

experiments:
  - task:
      model: gpt2
    engine: transformers

  - task:
      model: gpt2-medium
    engine: transformers

Sweep usage

Sweeps are declared with a sweep: key in the YAML. The loader expands the Cartesian product at parse time into a flat experiments list before any experiment runs.

study_name: model-sweep

sweep:
  axes:
    - field: task.model
      values:
        - gpt2
        - gpt2-medium
        - gpt2-large

    - field: engine
      values:
        - transformers

study_result = run_study("sweep.yaml")

print(f"Ran {study_result.summary.completed} / {study_result.summary.total_experiments} experiments")
print(f"Total energy: {study_result.summary.total_energy_j:.1f} J")

Parameter table

Parameter	Type	Default	Description
`config`	`str \| Path \| StudyConfig`	(required)	YAML file path or a pre-built `StudyConfig` object.
`skip_preflight`	`bool`	`False`	Skip Docker pre-flight checks. Useful in CI or remote-daemon setups.
`progress`	`ProgressCallback \| None`	`None`	Progress callback. Receives per-experiment begin/end events and per-step events from worker processes.
`resume_dir`	`Path \| None`	`None`	Explicit study directory to resume. Overrides `resume`.
`resume`	`bool`	`False`	Auto-detect the most recent resumable study in `output_dir` and resume from the last checkpoint.
`output_dir`	`Path \| None`	`None`	Base directory used by auto-detect resume only. Ignored when `resume_dir` is given.
`skip_set`	`set[tuple[str, int]] \| None`	`None`	Set of `(config_hash, cycle)` pairs to skip. Populated automatically when resuming; callers rarely need to set this.
`no_lock`	`bool`	`False`	Skip GPU advisory lock acquisition. Equivalent to the `--no-lock` CLI flag.
`config_path`	`Path \| None`	`None`	Original YAML path for artefact copying when `config` is a `StudyConfig` object. Preserved in `_study-artefacts/` for reproducibility.
`cli_overrides`	`dict[str, Any] \| None`	`None`	Flat dict of CLI flag overrides written to per-experiment `_resolution.json` sidecars. Rarely needed outside the CLI.

Returns

StudyResult - a Pydantic model:

study_result.experiments          # list[ExperimentResult] - one per completed experiment
study_result.summary.completed    # int   - number of experiments that succeeded
study_result.summary.failed       # int   - number that failed
study_result.summary.total_energy_j  # float - summed energy across all experiments
study_result.summary.total_wall_time_s  # float - total wall-clock time
study_result.result_files         # list[str] - paths to result.json files on disk
study_result.study_name           # str | None
study_result.study_design_hash    # str | None - 16-char SHA-256 of the experiment list
study_result.measurement_protocol # dict - execution config snapshot (n_cycles, order, etc.)
study_result.skipped_experiments  # list[dict] - configs that failed validation at expand time

Each item in experiments is an ExperimentResult. See Results schema for the on-disk layout.

Common patterns

Filter results by engine

transformers_results = [r for r in study_result.experiments if r.engine == "transformers"]

Compare energy across models

import statistics

by_model: dict[str, list[float]] = {}
for r in study_result.experiments:
    by_model.setdefault(r.model_name, []).append(r.mj_per_tok_total or 0.0)

for model, values in by_model.items():
    print(f"{model}: mean {statistics.mean(values):.3f} mJ/tok (n={len(values)})")

Export to a DataFrame

import json
from pathlib import Path
import pandas as pd

rows = [
    {
        "model": r.model_name,
        "engine": r.engine,
        "energy_j": r.total_energy_j,
        "throughput": r.avg_tokens_per_second,
        "mj_per_tok": r.mj_per_tok_total,
    }
    for r in study_result.experiments
]
df = pd.DataFrame(rows)

Resume an interrupted study

# Picks up from the last completed experiment automatically
study_result = run_study("sweep.yaml", resume=True)

Raises

Exception	When
`ConfigError`	Invalid config path or YAML parse error.
`PreFlightError`	Multi-engine study without Docker; Docker not running.
`StudyError`	`resume=True` but no resumable study found; config drift detected on resume (study hash changed).
`pydantic.ValidationError`	A field value fails validation. Passes through unchanged.

Pitfalls

Multi-engine studies require Docker. A study that references both engine: transformers and engine: vllm in its experiment list requires Docker. Without it, run_study raises PreFlightError at preflight before any inference begins.

Result bundle on disk. Every run_study call creates a timestamped directory under ./results/ (or output.results_dir from the YAML). That directory is not cleaned up automatically. Budget for disk space when sweeping large grids. See Results schema for the exact layout.

Skipped configs. If a sweep axis combination fails Pydantic validation (e.g. engine=tensorrt with a dtype that is not supported), the invalid combination is recorded in study_result.skipped_experiments and in _study-artefacts/skipped_configs.log, but the rest of the study continues.

n_cycles vs list length. study_result.summary.total_experiments reflects the expanded cycle count (len(experiments) * n_cycles). summary.unique_configurations is the number of distinct configurations (pre-cycle). Both are in the summary.

Concept​

Simple usage​

Sweep usage​

Parameter table​

Returns​

Common patterns​

Filter results by engine​

Compare energy across models​

Export to a DataFrame​

Resume an interrupted study​

Raises​

Pitfalls​

See also​