Skip to main content

Terminology reference

This is the canonical glossary for LLenergyMeasure: definitions for both the measurement domain (energy, GPU, LLM inference) and the project's own terminology (engines, runners, invariants). Cross-references use anchors within this page.

For canonical naming decisions internal to the engine-invariants pipeline (e.g. "invariant" not "rule", "validated corpus" not "vendored corpus"), see Contributing: miner pipeline and Reference: invariants corpus format.


A

Adjusted energy

Total measured GPU energy minus the baseline contribution: adjusted_energy_j = total_energy_j - baseline_power_w * total_inference_time_sec. Represents the energy specifically attributable to inference work, independent of what the GPU draws at idle. See mj_per_tok_adjusted in result.json for the per-token form.


B

Backend

Historical term for either an inference framework or an energy source. Retired in LLenergyMeasure documentation. See engine (inference framework: transformers/vllm/tensorrt) or energy sampler (energy source: NVML/Zeus/CodeCarbon).

The phrase attention backend remains standard for kernel selection (FlashAttention, xformers, sdpa, eager) and is distinct from the inference engine. See attention backend.

Attention backend

The kernel implementation used for the attention computation within an inference engine. Common values: flash_attn, flashinfer, sdpa, eager, flash_attention_2. Configured via engine-scoped parameters such as transformers.attn_implementation or vllm.attention.backend.

Attention backend is an industry-standard sub-system term (used by HuggingFace, vLLM, and TensorRT-LLM) and is distinct from the inference engine. Selecting the attention backend does not change which engine runs the model.

Baseline

The idle power the GPU draws when not running inference - analogous to a car engine idling. Measured in Watts (W) before each experiment by running a lightweight container that polls NVML without loading a model. Subtracted from the total energy to produce adjusted energy. See also thermal stabilisation.

BF16

See dtype.


C

CodeCarbon

An optional energy measurement library that estimates total system power (CPU + DRAM + GPU) by combining hardware counters and software models. Available as the codecarbon extra: pip install llenergymeasure[codecarbon]. Less reliable inside Docker containers than NVML. Falls back to zero without raising an error if hardware access fails.


D

Decode / prefill

The two phases of autoregressive LLM inference:

  • Prefill - the model processes the entire input prompt in one forward pass, populating the KV-cache. Compute is proportional to prompt length. Power draw spikes here.
  • Decode - the model generates one output token at a time, attending over the KV-cache entries from prefill and prior decode steps. Compute per token is roughly constant; the KV-cache grows with each step.

Energy consumption is distributed unevenly between the two phases. For short prompts and long outputs, decode dominates. For long prompts and short outputs (e.g. classification tasks), prefill dominates. LLenergyMeasure measures aggregate energy across both phases.

Decoder

In LLenergyMeasure configuration, decoder: refers to the sampling strategy block (whether to sample, temperature, top-p, etc.) rather than the transformer decoder stack. See sampler.

Dtype

The numerical precision format used to store model weights and activations:

dtypeBitsRelative VRAMNotes
float32321xFull precision; rarely used for inference; not supported by vLLM or TRT-LLM
float16 (FP16)160.5xStandard inference precision on older hardware
bfloat16 (BF16)160.5xPreferred on Ampere+ (A100, 4090); wider dynamic range than FP16

Dtype affects energy primarily through memory bandwidth: lower precision means smaller weights, faster memory reads, and lower power during memory-bound decode. For quantisation (INT8, W4A16), see quantisation.


E

Energy

The total electrical work done, measured in Joules. Equal to average power multiplied by duration. Energy is the headline metric for comparing inference efficiency: it accounts for both power draw and run time. Do not confuse with power, which is a rate.

See also FLOPs vs FLOPS.

Engine

The ML library responsible for loading the model and running inference. LLenergyMeasure currently supports three engines:

EngineLibraryRunner
transformersHuggingFace TransformersDocker (or local)
vllmvLLMDocker only
tensorrtTensorRT-LLMDocker only

The engine is set via engine: in the YAML or --engine on the CLI. Each engine runs inside its own Docker image (except Transformers in local mode). See also runner.


F

FLOPs vs FLOPS

Two distinct terms:

  • FLOPs (Floating Point Operations, singular or plural) - a count of mathematical operations performed. Reported in GFLOPs or TFLOPs.
  • FLOPS (Floating Point Operations Per Second) - a throughput rate (hardware peak performance). Reported in TFLOPS.

LLenergyMeasure reports FLOPs estimates (operation counts), not FLOPS (hardware throughput rates). See What we measure: FLOPs for the estimation method.


G

GPU contention

A condition where another process is using the GPU simultaneously with LLenergyMeasure, causing higher baseline power draw, latency variance, and sampling jitter. The recommended approach is to measure on a dedicated GPU with no other workloads active. See Measurement warnings for how contention manifests in practice.


I

INT8

See quantisation.


J

Joule

The SI unit of energy. One joule equals one watt applied for one second. GPU energy for a single inference is typically in the range of 0.1-100 J depending on model size, throughput, and run duration.

Conversions: 1 Wh = 3,600 J. 1 kWh = 3,600,000 J. A 300 W GPU drawing full power for 10 minutes consumes 180,000 J (50 Wh).

LLenergyMeasure reports energy in Joules (field total_energy_j) and millijoules per token (mj_per_tok_adjusted).


K

KV-cache

The key-value cache stores intermediate attention tensors from the prefill phase and prior decode steps, so the model does not recompute them for each new token. Reuse is the primary mechanism for efficient autoregressive generation.

KV-cache size grows linearly with sequence length and batch size. It is a major consumer of GPU VRAM for large models. vLLM uses paged attention to manage the KV-cache as a pool of fixed-size blocks, reducing fragmentation. vllm.engine.gpu_memory_utilization controls how much VRAM is pre-allocated for the KV-cache.

Energy relevance: a larger KV-cache increases VRAM traffic and therefore memory bandwidth energy. Prefix caching (reusing KV tensors across requests) reduces energy for repeated prompt prefixes.


N

NVML

NVIDIA Management Library - the C library that exposes GPU hardware counters, including power draw, temperature, clock speeds, and throttle state. Accessed from Python via pynvml (nvidia-ml-py). The default energy sampler in LLenergyMeasure: polls every 100 ms and integrates power over time to compute energy.

NVML power readings have approximately ±5% accuracy (hardware vendor spec). All energy figures reported by LLenergyMeasure carry this intrinsic uncertainty.

See also RAPL, CodeCarbon.


P

Power

The instantaneous rate of energy consumption, measured in Watts (W). NVML reports GPU power in Watts at 100 ms intervals. Energy is power integrated over time. A GPU drawing 300 W for 2 seconds consumes 600 J.

High power draw indicates the GPU is compute-bound; low power with low throughput may indicate memory bandwidth saturation or throttling. See thermal stabilisation.

Prefill

See decode / prefill.


Q

Quantisation

Reducing the bit-width of model weights (and optionally activations) to save VRAM and increase throughput at the cost of a small accuracy loss. Common schemes supported by LLenergyMeasure:

SchemeBitsSupported by
INT88Transformers (BitsAndBytes), TRT-LLM
W4A16_AWQ4 (weights) / 16 (activations)Transformers (BitsAndBytes), vLLM, TRT-LLM
W4A16_GPTQ4 (weights) / 16 (activations)Transformers (BitsAndBytes), vLLM, TRT-LLM
W8A168 (weights) / 16 (activations)TRT-LLM
FP88 (float)vLLM (KV cache), TRT-LLM (SM >= 8.9 only)

FP8 requires SM >= 8.9 (Ada Lovelace or Hopper). A100 (SM 8.0) does not support native FP8 inference.

For full engine-specific quantisation options and invalid combinations, see Reference: engine configuration.


R

RAPL

Running Average Power Limit - an Intel CPU hardware counter that reports cumulative CPU and DRAM energy. Not currently used by LLenergyMeasure (which measures GPU energy via NVML), but relevant context when interpreting total system energy. RAPL is used by CodeCarbon to estimate CPU+DRAM contributions.

Runner

Two senses in this codebase: (1) llem execution mode (local or docker); (2) GitHub Actions CI runner (self-hosted, ubuntu-latest). Both are correct in their contexts; readers should disambiguate from surrounding prose.

llem runner - the execution context that wraps the engine. LLenergyMeasure has two runner types:

  • local - runs the engine in the current Python process. Only supported for the Transformers engine. Suitable for development and testing without Docker.
  • docker - runs the engine inside a dedicated Docker container. Required for vLLM and TensorRT-LLM (Docker-only by design). Default for all engines in production use.

Set via runners: in the study YAML. Multi-engine studies without Docker are blocked at config load time.

CI runner - the GitHub Actions compute host (e.g. self-hosted GPU runner, ubuntu-latest). Used in contributing and CI documentation only.


S

Sampler / decoder config

In LLenergyMeasure, decoder: in the YAML configures the generation sampling strategy: whether to sample (do_sample: true) or use greedy decoding, and parameters such as temperature, top_p, top_k, repetition_penalty. Energy per token varies with sampling strategy because different strategies affect the number of attention heads consulted and the softmax computation.

Not to be confused with the energy sampler.

Energy sampler

The component that measures GPU power during inference. Three samplers:

  • NVML (default) - polls GPU power via pynvml at 100 ms intervals
  • Zeus - uses the Zeus library for per-batch energy attribution (optional extra)
  • CodeCarbon - estimates total system power including CPU and DRAM (optional extra)

Configured via energy_sampler: in the YAML. Set to null for throughput-only mode (no energy measurement).

In older docs, energy samplers were sometimes called "energy backends"; the term is retired in favour of sampler.


T

Temperature (generation)

The decoder.temperature parameter controls the randomness of token sampling. Temperature = 1.0 samples from the raw model distribution; lower values concentrate probability on high-likelihood tokens; temperature = 0 is equivalent to greedy decoding. Energy is not materially affected by temperature because the same forward pass runs regardless.

Thermal stabilisation / warmup

Two related concepts:

  • Warmup - running several inference passes before the measurement window begins, to bring the model's KV-cache and GPU compute units to steady state. LLenergyMeasure checks whether latency has stabilised using the coefficient of variation (CV) of successive prompt latencies.
  • Thermal floor wait - a configurable pause (warmup.thermal_floor_seconds, default 60 s) after the warmup prompts, during which the GPU idles and cools toward a stable baseline temperature before measurement starts.

Skipping warmup produces measurements from a cold-start state that are not representative of sustained inference. See Methodology: measurement warnings for the "warmup did not converge" warning.

Throughput

Output tokens produced per second (output_tokens_per_sec in result.json). Distinct from latency (time to first token or time per request): throughput is a rate across the full batch. Higher throughput means more tokens are produced in the same wall-clock time.

For single-sequence inference at batch_size=1 (typical on consumer hardware), throughput and latency are closely related. At higher batch sizes (vLLM continuous batching), throughput can increase significantly while per-sequence latency rises.

Token

The basic unit of LLM input and output. A tokeniser (e.g. BPE, sentencepiece) maps text to integer token IDs. Roughly 1 token ≈ 0.75 English words, but this varies by language and tokeniser. Prompt length and output length are both measured in tokens.

Energy and throughput figures in LLenergyMeasure are normalised per token (mj_per_tok_adjusted, output_tokens_per_sec) to allow comparison across different workloads.


V

Validated corpus

The engine-invariants corpus after the validation-CI gate has replayed each invariant against the live library. Stored as invariants.validated.yaml. Not to be confused with the proposed corpus (invariants.proposed.yaml), which contains declared expectations before CI observation. See Reference: invariants corpus format for the full format specification.


W

Warmup

See thermal stabilisation / warmup.


Engine-invariants pipeline terminology

For terminology specific to the miner pipeline, validation gate, and corpus naming, see Contributing: miner pipeline and Reference: invariants corpus format. Those pages are the canonical definition source and must be consulted before introducing new names in code or documentation.


See also