Concepts in 2 minutes
LLenergyMeasure measures how implementation choices drive LLM inference efficiency. Six terms set up everything else.
Experiment vs study
An experiment is a single measurement point: one model, one engine, one dtype, one set of generation parameters. It produces one result file.
A study is a multi-experiment wrapper: a flat list of experiments (typically derived from a sweep specification), sharing execution settings and run together as a batch.
For the full config reference, see Study config.
The four layers
Engine - the inference framework that loads and runs the model.
LLenergyMeasure supports transformers, vllm, and tensorrt (with
sglang planned). Each engine runs inside its own Docker container.
Sampler - the source of energy measurements. The default is nvml
(NVIDIA Management Library); alternatives are zeus and codecarbon.
Sampler choice is independent of engine choice.
Runner - the execution environment. local runs the experiment
directly on the host; docker invokes the per-engine container.
Harness - the measurement coordinator (MeasurementHarness). It owns
the measurement window, warmup, baseline subtraction, and result
assembly. The harness is engine-agnostic; engines are thin inference
plugins.
For the architecture behind these layers, see Architecture overview.
What we measure
Energy (joules) - total GPU energy during inference, with idle-power baseline subtracted where measured. The headline answer to "how much electricity did this configuration consume?"
Throughput (tokens/s) - output tokens per second across all prompts. The headline answer to "how fast is this configuration?"
FLOPs - estimated floating-point operations for the run. Reported as a validity check (largely invariant across implementations of the same model); not a headline efficiency metric.
For the full methodology, see What we measure.