Skip to main content

ExperimentResult

from llenergymeasure import ExperimentResult

Concept

ExperimentResult is the data structure returned by run_experiment and contained in StudyResult.experiments. It is a frozen Pydantic model that aggregates measurements across all GPU processes into a single user-facing record.

For single-GPU experiments, the aggregation is trivial. For multi-GPU experiments, energy values are summed across processes and throughput values are averaged; the raw per-process data is preserved in process_results for downstream analysis.

ExperimentResult mirrors the on-disk result.json schema closely - the JSON on disk is produced by model.model_dump(mode="json") and shares the same field names and units. See Results schema for the full on-disk layout including manifest.json and timeseries.parquet.

ExperimentResult is almost always returned by the harness, not constructed by users.


Fields

Identity

FieldTypeDescription
schema_versionstrResult schema version (current: "3.0").
experiment_idstrUnique identifier for this experiment run.
measurement_config_hashstr16-char SHA-256 hex of the ExperimentConfig (environment fields excluded). Matches the hash in the result directory name on disk.
llenergymeasure_versionstr | NonePackage version that produced this result.

Engine and model

FieldTypeDescription
enginestrInference engine used: "transformers", "vllm", or "tensorrt".
engine_versionstr | NoneEngine version string for reproducibility (e.g. "4.47.0" for Transformers).
model_namestrModel name or path used.

Measurement methodology

FieldTypeDescription
measurement_methodology"total" | "steady_state" | "windowed"What was measured: the full run, the steady-state window after warmup, or an explicit time window.
steady_state_windowtuple[float, float] | None(start_sec, end_sec) relative to experiment start. None when methodology="total".

Core metrics

FieldTypeUnitsDescription
total_tokensinttokensTotal tokens generated across all processes.
total_energy_jfloatjoulesTotal GPU energy (summed across processes).
energy_adjusted_jfloat | NonejoulesBaseline-subtracted energy attributable to inference. None when no baseline was taken.
total_inference_time_secfloatsecondsWall time for the inference phase.
avg_tokens_per_secondfloattok/sThroughput (averaged across processes).
avg_energy_per_token_jfloatJ/tokMean energy per token.
mj_per_tok_totalfloat | NonemJ/tokMillijoules per token from total (unadjusted) energy.
mj_per_tok_adjustedfloat | NonemJ/tokMillijoules per token from baseline-adjusted energy. None when energy_adjusted_j is None.

FLOPs metrics

FieldTypeDescription
total_flopsfloatEstimated FLOPs. Derived from model config (reference metadata, not measured).
flops_per_output_tokenfloat | NoneFLOPs per decode token. None when total_flops=0 or output_tokens=0.
flops_per_input_tokenfloat | NoneFLOPs per prefill token. None when total_flops=0 or input_tokens=0.
flops_per_secondfloat | NoneFLOPs throughput (total_flops / inference_time_sec). None when time=0 or flops=0.

Energy detail

FieldTypeDescription
baseline_power_wfloat | NoneIdle GPU power in watts measured before the experiment. None when baseline measurement is disabled.
energy_per_device_jlist[float] | NonePer-GPU energy breakdown. Currently populated by the Zeus sampler only. None for NVML and CodeCarbon.
energy_breakdownEnergyBreakdown | NoneDetailed breakdown with baseline adjustment intervals.

Multi-GPU

FieldTypeDescription
multi_gpuMultiGPUMetrics | NoneMulti-GPU aggregate metrics. None for single-GPU experiments.
process_resultslist[RawProcessResult]Raw per-process measurements (single item for single-GPU).
aggregationAggregationMetadata | NoneAggregation method and quality flags (populated for multi-GPU runs).

Quality and reproducibility

FieldTypeDescription
measurement_warningslist[str]Quality warnings (e.g. short duration, thermal drift detected).
warmup_excluded_samplesint | NonePrompts excluded during warmup. None when methodology="total".
reproducibility_notesstrFixed disclaimer about NVML measurement accuracy (+/- 5%).
thermal_throttleThermalThrottleInfo | NoneGPU thermal and power throttle events during the run.
warmup_resultWarmupResult | NoneWarmup convergence result (populated when CV convergence detection is enabled).

Timing

FieldTypeDescription
start_timedatetimeEarliest process start time (UTC).
end_timedatetimeLatest process end time (UTC).

Sidecar

FieldTypeDescription
timeseriesstr | NoneRelative filename of the timeseries Parquet sidecar (e.g. "timeseries.parquet"). None when timeseries saving is disabled.
latency_statsLatencyStatistics | NoneTTFT/ITL statistics from streaming inference. None for non-streaming engines.
extended_metricsExtendedEfficiencyMetrics | NoneExtended efficiency metrics (TPOT, memory, GPU utilisation). Always present when the harness runs successfully; fields within are None when not computable.

Properties

ExperimentResult exposes two computed properties:

PropertyTypeDescription
duration_secfloatTotal experiment duration (end_time - start_time).
tokens_per_joulefloatOverall energy efficiency (total_tokens / total_energy_j). 0.0 when total_energy_j is zero.

Common patterns

Extract the headline efficiency metrics

result = run_experiment(model="gpt2", engine="transformers")

print(f"Energy (total): {result.total_energy_j:.2f} J")
print(f"Energy (adjusted): {result.energy_adjusted_j or 'N/A'}")
print(f"mJ/tok (total): {result.mj_per_tok_total:.3f}")
print(f"mJ/tok (adjusted): {result.mj_per_tok_adjusted or 'N/A'}")
print(f"Throughput: {result.avg_tokens_per_second:.1f} tok/s")
print(f"FLOPs/s: {result.flops_per_second or 'N/A'}")

Compare two results

a = run_experiment(model="gpt2", engine="transformers")
b = run_experiment(model="gpt2-medium", engine="transformers")

ratio = b.mj_per_tok_total / a.mj_per_tok_total
print(f"gpt2-medium is {ratio:.2f}x more expensive per token than gpt2")

Serialise to JSON

import json

with open("result.json", "w") as f:
json.dump(result.model_dump(mode="json"), f, indent=2, default=str)

The on-disk result.json written by run_experiment / run_study uses this same serialisation. Loading it back:

data = json.loads(Path("results/study_name/001_c0_.../result.json").read_text())
loaded = ExperimentResult(**data)

Check for quality warnings

if result.measurement_warnings:
for w in result.measurement_warnings:
print(f"Warning: {w}")

if result.thermal_throttle and result.thermal_throttle.throttle_detected:
print("Thermal throttling detected - results may be unreliable")

Pitfalls

energy_adjusted_j and mj_per_tok_adjusted are None when baseline is disabled. If measurement.baseline.enabled=False, neither field is populated. Always guard with if result.energy_adjusted_j is not None before using them.

energy_per_device_j is only populated by the Zeus sampler. With energy_sampler="nvml" or energy_sampler="codecarbon" (or "auto" resolving to either), energy_per_device_j is None. Use process_results[i].energy_metrics for per-process energy when not using Zeus.

extended_metrics fields can be None within an otherwise-present object. The ExtendedEfficiencyMetrics object is always attached but individual sub-fields (e.g. TTFT, memory bandwidth) are None when the data required to compute them was not available (non-streaming inference, no memory bandwidth counters, etc.).

flops_per_* are reference estimates, not measured values. FLOPs are estimated from model config (parameter count and sequence lengths) via AutoConfig, not from hardware counters. They are useful for relative comparisons but not for absolute roofline analysis.

Frozen model - no mutation. ExperimentResult has frozen=True. Attempting to set a field raises ValidationError. Use model_copy(update=...) to derive a modified copy.


See also

  • run_experiment - the function that returns an ExperimentResult
  • run_study - returns StudyResult containing list[ExperimentResult]
  • Results schema - the on-disk result.json schema this mirrors