How to Read LLenergyMeasure Output

When LLenergyMeasure runs a measurement, it prints a summary to the terminal and saves a detailed result file. This guide explains what each number means - in plain language.

If you have not yet run a measurement, see the Quick start first.

What You Will See

After running a measurement, the terminal prints output like this:

Result: gpt2_20260507_143208

Energy
  Total          847 J
  Baseline       12.3 W
  Adjusted       723 J

Performance
  Throughput     312 tok/s
  FLOPs          4.21e+11 (roofline, medium)

Timing
  Duration       1m 38s
  Warmup         5 prompts excluded

A full result file is also saved to the results/ folder. The sections below explain each metric.

The Experiment ID

Result: gpt2_20260507_143208

This is a unique identifier for this specific measurement. It encodes the model name and a UTC timestamp in YYYYMMDD_HHMMSS form. For full configuration provenance, see the effective_config block inside result.json - it records every setting that influenced the measurement, including engine choice, dtype, and engine defaults.

Energy Metrics

Energy is measured in joules (J) - a standard unit of energy. One joule equals one watt of power consumed for one second.

Total (J) - Raw GPU energy

The total electrical energy drawn by the GPU during the entire measurement period, from first prompt to last.

In practice: This includes both the energy for actual inference work and the energy the GPU would have consumed anyway just by being switched on ("idle power").

Baseline (W) - Idle GPU power

The power the GPU draws when doing nothing - measured before the experiment starts and used as a reference point.

Analogy: Like measuring a car's fuel consumption at idle before a test drive, so you can subtract it from the total and isolate the fuel used specifically for driving.

In practice: This is reported in watts (W), not joules, because it is a power level rather than a total energy amount.

Adjusted (J) - Net inference energy

The most meaningful energy metric: total energy minus the idle power multiplied by the duration. This represents the energy specifically attributable to running the AI model.

Formula: Adjusted = Total - (Baseline x Duration)

In practice: Use the adjusted figure when comparing models. Two models running on the same hardware for the same task may have different baseline subtractions; the adjusted figure puts them on equal footing.

Performance Metrics

Throughput (tok/s) - Output speed

How many output tokens the model generated per second, averaged across the entire experiment.

In practice: Higher throughput means faster responses. A model producing 312 tok/s completes 100 prompts roughly 3× faster than a model at 100 tok/s.

Note: Throughput is measured across all prompts in the experiment. A single short prompt may feel fast even at low throughput; the experiment-level figure reflects sustained performance.

FLOPs - Computational work

An estimate of the number of floating-point calculations the model performed. Reported in scientific notation (e.g., 4.21e+11 = 421 billion FLOPs).

The result also shows:

Method (e.g., roofline) - how the FLOPs were estimated
Confidence (e.g., medium) - how reliable the estimate is

In practice: FLOPs are most useful for comparing models of different sizes. A larger model will naturally have higher FLOPs. If two models have similar FLOPs but very different energy, that suggests a hardware or configuration efficiency difference rather than a model complexity difference.

Timing Metrics

Duration - Total experiment time

The wall-clock time from the start of the first prompt to the end of the last, including all processing time.

In practice: Duration × Baseline power gives the idle energy component (which is subtracted to produce the Adjusted energy figure).

Warmup - Excluded prompts

The number of prompts run at the start that were excluded from the reported metrics.

Why: GPUs do not immediately run at a stable temperature. The first few prompts run while the hardware is still warming up, which produces unrepresentative measurements. Warmup prompts are run first and then discarded, ensuring the reported metrics reflect steady-state operation.

In practice: If the experiment ran 100 prompts and 5 were warmup, the metrics are based on 95 prompts. The total_prompts field in the result file shows the total including warmup; the metrics are calculated from the non-warmup prompts only.

The Result File

The full result is saved as a JSON file in the results/ directory. Key fields:

Field	What it means
`total_energy_j`	Total GPU energy in joules (same as "Total" in terminal output)
`energy_adjusted_j`	Baseline-subtracted energy in joules (same as "Adjusted" in terminal output). `null` when no baseline was measured.
`baseline_power_w`	Idle GPU power in watts (same as "Baseline"). `null` when baseline disabled.
`mj_per_tok_total`	Millijoules per token from raw (unadjusted) energy.
`mj_per_tok_adjusted`	Millijoules per token from baseline-adjusted energy. The right field for cross-experiment comparisons.
`avg_tokens_per_second`	Output throughput in tokens/second (same as "Throughput").
`total_inference_time_sec`	Wall-clock inference time in seconds (same as "Duration").
`total_tokens`	Total tokens processed across all prompts.
`total_flops`	Estimated total floating-point operations.
`effective_config`	The exact configuration used (model, dtype, engine, etc.).

The effective_config section is particularly important for reproducibility - it records every setting that influenced the measurement, including defaults that were not explicitly specified.

For the full schema, see Reference: results schema.

Comparing Results Meaningfully

Raw numbers only make sense in context. Here is how to compare results fairly:

Use energy per token, not total energy. A run with 100 prompts will use roughly twice the energy of a run with 50 prompts. To compare two experiments, divide adjusted energy by total output tokens: this gives joules per token, which is comparable regardless of experiment size.

Match prompt counts and input lengths. Output token counts (and therefore energy) vary with input length. Comparing a run with 100 short prompts against a run with 100 long prompts is not a like-for-like comparison.

Note the hardware. A result from an A100 GPU is not directly comparable to a result from a consumer GPU. The result file records the GPU model in effective_config.

Check the dtype setting. Running at float16 (16-bit) typically uses less energy than float32 (32-bit). Results should use the same dtype to be comparable.

Order of Magnitude Context

To put energy figures in context (approximate, for orientation only):

Scenario	Approximate energy
One GPT-2 inference (single prompt)	~1-10 joules
100 GPT-2 inferences	~100-1,000 joules
One large model (70B) inference	~500-5,000 joules
Smartphone full charge	~15,000 joules
Boiling 1 litre of water	~330,000 joules

These figures vary significantly with hardware, prompt length, and configuration. They are intended to give a sense of scale, not precise values.

What You Will See​

The Experiment ID​

Energy Metrics​

Total (J) - Raw GPU energy​

Baseline (W) - Idle GPU power​

Adjusted (J) - Net inference energy​

Performance Metrics​

Throughput (tok/s) - Output speed​

FLOPs - Computational work​

Timing Metrics​

Duration - Total experiment time​

Warmup - Excluded prompts​

The Result File​

Comparing Results Meaningfully​

Order of Magnitude Context​

Further Reading​