`ExperimentResult`

from llenergymeasure import ExperimentResult

Concept

ExperimentResult is the data structure returned by run_experiment and contained in StudyResult.experiments. It is a frozen Pydantic model that aggregates measurements across all GPU processes into a single user-facing record.

For single-GPU experiments, the aggregation is trivial. For multi-GPU experiments, energy values are summed across processes and throughput values are averaged; the raw per-process data is preserved in process_results for downstream analysis.

ExperimentResult mirrors the on-disk result.json schema closely - the JSON on disk is produced by model.model_dump(mode="json") and shares the same field names and units. See Results schema for the full on-disk layout including manifest.json and timeseries.parquet.

ExperimentResult is almost always returned by the harness, not constructed by users.

Fields

Identity

Field	Type	Description
`schema_version`	`str`	Result schema version (current: `"3.0"`).
`experiment_id`	`str`	Unique identifier for this experiment run.
`measurement_config_hash`	`str`	16-char SHA-256 hex of the `ExperimentConfig` (environment fields excluded). Matches the hash in the result directory name on disk.
`llenergymeasure_version`	`str \| None`	Package version that produced this result.

Engine and model

Field	Type	Description
`engine`	`str`	Inference engine used: `"transformers"`, `"vllm"`, or `"tensorrt"`.
`engine_version`	`str \| None`	Engine version string for reproducibility (e.g. `"4.47.0"` for Transformers).
`model_name`	`str`	Model name or path used.

Measurement methodology

Field	Type	Description
`measurement_methodology`	`"total" \| "steady_state" \| "windowed"`	What was measured: the full run, the steady-state window after warmup, or an explicit time window.
`steady_state_window`	`tuple[float, float] \| None`	`(start_sec, end_sec)` relative to experiment start. `None` when `methodology="total"`.

Core metrics

Field	Type	Units	Description
`total_tokens`	`int`	tokens	Total tokens generated across all processes.
`total_energy_j`	`float`	joules	Total GPU energy (summed across processes).
`energy_adjusted_j`	`float \| None`	joules	Baseline-subtracted energy attributable to inference. `None` when no baseline was taken.
`total_inference_time_sec`	`float`	seconds	Wall time for the inference phase.
`avg_tokens_per_second`	`float`	tok/s	Throughput (averaged across processes).
`avg_energy_per_token_j`	`float`	J/tok	Mean energy per token.
`mj_per_tok_total`	`float \| None`	mJ/tok	Millijoules per token from total (unadjusted) energy.
`mj_per_tok_adjusted`	`float \| None`	mJ/tok	Millijoules per token from baseline-adjusted energy. `None` when `energy_adjusted_j` is `None`.

FLOPs metrics

Field	Type	Description
`total_flops`	`float`	Estimated FLOPs. Derived from model config (reference metadata, not measured).
`flops_per_output_token`	`float \| None`	FLOPs per decode token. `None` when `total_flops=0` or `output_tokens=0`.
`flops_per_input_token`	`float \| None`	FLOPs per prefill token. `None` when `total_flops=0` or `input_tokens=0`.
`flops_per_second`	`float \| None`	FLOPs throughput (`total_flops / inference_time_sec`). `None` when `time=0` or `flops=0`.

Energy detail

Field	Type	Description
`baseline_power_w`	`float \| None`	Idle GPU power in watts measured before the experiment. `None` when baseline measurement is disabled.
`energy_per_device_j`	`list[float] \| None`	Per-GPU energy breakdown. Currently populated by the Zeus sampler only. `None` for NVML and CodeCarbon.
`energy_breakdown`	`EnergyBreakdown \| None`	Detailed breakdown with baseline adjustment intervals.

Multi-GPU

Field	Type	Description
`multi_gpu`	`MultiGPUMetrics \| None`	Multi-GPU aggregate metrics. `None` for single-GPU experiments.
`process_results`	`list[RawProcessResult]`	Raw per-process measurements (single item for single-GPU).
`aggregation`	`AggregationMetadata \| None`	Aggregation method and quality flags (populated for multi-GPU runs).

Quality and reproducibility

Field	Type	Description
`measurement_warnings`	`list[str]`	Quality warnings (e.g. short duration, thermal drift detected).
`warmup_excluded_samples`	`int \| None`	Prompts excluded during warmup. `None` when `methodology="total"`.
`reproducibility_notes`	`str`	Fixed disclaimer about NVML measurement accuracy (+/- 5%).
`thermal_throttle`	`ThermalThrottleInfo \| None`	GPU thermal and power throttle events during the run.
`warmup_result`	`WarmupResult \| None`	Warmup convergence result (populated when CV convergence detection is enabled).

Timing

Field	Type	Description
`start_time`	`datetime`	Earliest process start time (UTC).
`end_time`	`datetime`	Latest process end time (UTC).

Sidecar

Field	Type	Description
`timeseries`	`str \| None`	Relative filename of the timeseries Parquet sidecar (e.g. `"timeseries.parquet"`). `None` when timeseries saving is disabled.
`latency_stats`	`LatencyStatistics \| None`	TTFT/ITL statistics from streaming inference. `None` for non-streaming engines.
`extended_metrics`	`ExtendedEfficiencyMetrics \| None`	Extended efficiency metrics (TPOT, memory, GPU utilisation). Always present when the harness runs successfully; fields within are `None` when not computable.

Properties

ExperimentResult exposes two computed properties:

Property	Type	Description
`duration_sec`	`float`	Total experiment duration (`end_time - start_time`).
`tokens_per_joule`	`float`	Overall energy efficiency (`total_tokens / total_energy_j`). `0.0` when `total_energy_j` is zero.

Common patterns

Extract the headline efficiency metrics

result = run_experiment(model="gpt2", engine="transformers")

print(f"Energy (total):    {result.total_energy_j:.2f} J")
print(f"Energy (adjusted): {result.energy_adjusted_j or 'N/A'}")
print(f"mJ/tok (total):    {result.mj_per_tok_total:.3f}")
print(f"mJ/tok (adjusted): {result.mj_per_tok_adjusted or 'N/A'}")
print(f"Throughput:        {result.avg_tokens_per_second:.1f} tok/s")
print(f"FLOPs/s:           {result.flops_per_second or 'N/A'}")

Compare two results

a = run_experiment(model="gpt2", engine="transformers")
b = run_experiment(model="gpt2-medium", engine="transformers")

ratio = b.mj_per_tok_total / a.mj_per_tok_total
print(f"gpt2-medium is {ratio:.2f}x more expensive per token than gpt2")

Serialise to JSON

import json

with open("result.json", "w") as f:
    json.dump(result.model_dump(mode="json"), f, indent=2, default=str)

The on-disk result.json written by run_experiment / run_study uses this same serialisation. Loading it back:

data = json.loads(Path("results/study_name/001_c0_.../result.json").read_text())
loaded = ExperimentResult(**data)

Check for quality warnings

if result.measurement_warnings:
    for w in result.measurement_warnings:
        print(f"Warning: {w}")

if result.thermal_throttle and result.thermal_throttle.throttle_detected:
    print("Thermal throttling detected - results may be unreliable")

Pitfalls

energy_adjusted_j and mj_per_tok_adjusted are None when baseline is disabled. If measurement.baseline.enabled=False, neither field is populated. Always guard with if result.energy_adjusted_j is not None before using them.

energy_per_device_j is only populated by the Zeus sampler. With energy_sampler="nvml" or energy_sampler="codecarbon" (or "auto" resolving to either), energy_per_device_j is None. Use process_results[i].energy_metrics for per-process energy when not using Zeus.

extended_metrics fields can be None within an otherwise-present object. The ExtendedEfficiencyMetrics object is always attached but individual sub-fields (e.g. TTFT, memory bandwidth) are None when the data required to compute them was not available (non-streaming inference, no memory bandwidth counters, etc.).

flops_per_* are reference estimates, not measured values. FLOPs are estimated from model config (parameter count and sequence lengths) via AutoConfig, not from hardware counters. They are useful for relative comparisons but not for absolute roofline analysis.

Frozen model - no mutation. ExperimentResult has frozen=True. Attempting to set a field raises ValidationError. Use model_copy(update=...) to derive a modified copy.

Concept​

Fields​

Identity​

Engine and model​

Measurement methodology​

Core metrics​

FLOPs metrics​

Energy detail​

Multi-GPU​

Quality and reproducibility​

Timing​

Sidecar​

Properties​

Common patterns​

Extract the headline efficiency metrics​

Compare two results​

Serialise to JSON​

Check for quality warnings​

Pitfalls​

See also​