Benchmarking energy consumption, throughput, and FLOPs in LLM inference
A Python framework for measuring what actually matters when deploying large language models. Deployment choices—parallelism, batching, precision, and inference backend—can induce 50×+ variation in energy-per-token for the same model. This tool quantifies it across multiple backends (PyTorch, vLLM, TensorRT).
GitHub Repository · Research Findings
Contents: Overview · Quick Start · Feature Evolution · Configuration · Architecture · Citation
| Category | Metrics |
|---|---|
| Energy | GPU energy (NVML), CPU energy (RAPL), RAM energy, total system (CodeCarbon), CO₂ emissions |
| Throughput | Tokens/second, latency/token, time to first token, batch throughput |
| Compute | FLOPs/token (measured via calflops), FLOPs (analytical), peak GPU memory, device utilisation |
lem experiment <config.yaml>lem results show <experiment_id>The CLI (lem) supports grid searches over precision, batch size, parallelism, and backends for systematic deployment analysis. Any HuggingFace model works out of the box.
See the GitHub README for detailed installation and usage instructions.
Ground-up rewrite with modern patterns and production-grade backends:
vllm.gpu_memory_utilization)setup.sh, named volumes, and multi-backend profilesDeployment-focused release:
Quality assurance milestone:
User-friendly command-line interface:
Major refactoring establishing the modern codebase:
llm-bench → llm-energy-measure → llenergymeasure (final, Jan 2026)lem for convenienceNote: The package was further renamed to llenergymeasure in January 2026 with the introduction of multi-backend support.
Stable multi-model benchmarking validated on production hardware (4× A100-40GB):
Foundation release establishing the measurement pipeline:
Active development areas:
LLenergyMeasure supports three inference backends, each optimised for different deployment scenarios. All backends share the same configuration interface and measurement infrastructure, enabling direct performance comparison.
| Backend | Use Case | Strengths | Limitations |
|---|---|---|---|
| PyTorch | Research, prototyping, custom models | Native HuggingFace integration, tensor/pipeline parallelism, full model control | Lower throughput than production servers |
| vLLM | Production deployments, high-throughput serving | Continuous batching, PagedAttention, optimised KV cache, streaming support | Limited model family support vs HuggingFace |
| TensorRT | NVIDIA GPUs, latency-critical applications | Maximum single-query performance, kernel fusion, FP8 support | NVIDIA-only, longer compilation time |
The native PyTorch backend uses HuggingFace Transformers with Accelerate for distributed inference. Ideal for research and exploration.
Key features:
tp_plan="auto" for supported architectures (Llama, Mistral, Qwen, etc.)Configuration example:
backend: pytorch
pytorch:
attention_implementation: flash_attention_2 # or: sdpa, eager
torch_compile: false # PyTorch 2.0 compilation
use_cache: true # KV cache
assisted_generation: false # Speculative decoding
When to use: Model exploration, architecture research, custom model modifications, or when working with models not yet supported by vLLM/TensorRT.
The vLLM backend is a production-grade inference server optimised for high-throughput LLM serving with continuous batching and PagedAttention.
Key features:
Configuration example:
backend: vllm
vllm:
gpu_memory_utilization: 0.9 # GPU memory allocation (0.0–1.0)
max_model_len: 4096 # Maximum sequence length
enable_prefix_caching: true # Cache common prompt prefixes
enforce_eager: false # Skip CUDA graph capture (debug)
kv_cache_dtype: auto # KV cache quantisation: auto, fp8, fp8_e5m2
speculative_model: null # Speculative decoding model
When to use: Production deployments, high-throughput serving (>10 QPS), batch inference workloads, or when maximising GPU utilisation is critical.
The TensorRT backend uses NVIDIA TensorRT-LLM for maximum single-query performance with aggressive kernel fusion and low-level optimisations.
Key features:
Configuration example:
backend: tensorrt
tensorrt:
max_batch_size: 8
max_input_len: 1024
max_output_len: 512
enable_trt_overlap: true # Overlap compute and data transfer
kv_cache_free_gpu_mem_fraction: 0.9
When to use: Latency-critical applications, NVIDIA GPU deployments (A100/H100), maximum single-query performance, or when targeting specific NVIDIA architecture optimisations.
The multi-backend architecture enables controlled comparisons of inference efficiency:
This design isolates backend-level implementation effects from model-level computational requirements, enabling rigorous efficiency analysis.
| Category | Parameter | Options |
|---|---|---|
| Backend | backend |
pytorch, vllm, tensorrt |
| Precision | fp_precision |
float32, float16, bfloat16 |
| Quantisation | load_in_4bit / load_in_8bit |
bool (PyTorch); backend-specific for vLLM/TensorRT |
| Batching | batching.strategy |
static, dynamic, sorted_static, sorted_dynamic |
| Parallelism | sharding.strategy |
none, tensor_parallel, pipeline_parallel |
| Traffic | traffic_simulation.mode |
constant, poisson |
| Decoder | decoder.preset |
deterministic, standard, creative, factual |
Configs use YAML with backend-specific parameters nested under backend name (e.g., vllm.gpu_memory_utilization). The _extends directive enables inheritance—override only what changes across experiments.
The tool follows a configuration-driven, three-stage pipeline designed for reproducible distributed benchmarking.
┌────────────────────────────────────────────────────────────────────────────────┐
│ MEASUREMENT PIPELINE │
└────────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┴─────────────────────┐
▼ ▼
┌────────────────────────────────┐ ┌────────────────────────────────┐
│ 1. CONFIGURATION │ │ 2. EXECUTION │
│ ──────────────────────────── │ │ ──────────────────────────── │
│ • Model & precision │────────▶│ • HuggingFace Accelerate │
│ • Hardware sharding │ │ • Tensor/pipeline parallelism │
│ • Generation parameters │ │ • Barrier synchronisation │
│ • YAML inheritance │ │ • Per-process metric tracking │
└────────────────────────────────┘ └───────────────┬────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ GPU 0 │ │ GPU 1 │ │ GPU N │
│ ─────── │ │ ─────── │ │ ─────── │
│ Energy │ │ Energy │ │ Energy │
│ Tokens │ │ Tokens │ │ Tokens │
│ Memory │ │ Memory │ │ Memory │
│ FLOPs │ │ FLOPs │ │ FLOPs │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└─────────────────────┼─────────────────────┘
▼
┌────────────────────────────────┐
│ 3. AGGREGATION │
│ ──────────────────────────── │
│ • Late aggregation pattern │
│ • Raw per-GPU results saved │
│ • Flexible post-hoc analysis │
│ • CSV/JSON export │
└────────────────────────────────┘
Declarative YAML configuration with inheritance via _extends enables reproducible experiments without code changes.
Configuration Inheritance:
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONFIGURATION INHERITANCE │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────┐
│ base.yaml │ Base configuration with sensible defaults
│ ─────────────────── │
│ model: llama-3.2-3B │
│ precision: float16 │
│ batch_size: 16 │
│ num_gpus: 1 │
└──────────┬──────────┘
│
│ _extends: base.yaml
▼
┌─────────────────────┐ ┌─────────────────────┐
│ multi-gpu.yaml │ │ quantized.yaml │
│ ─────────────────── │ │ ─────────────────── │
│ num_gpus: 4 │ │ quantization: 4bit │
│ sharding: tensor_ │ │ precision: null │
│ parallel │ │ │
└──────────┬──────────┘ └─────────────────────┘
│
│ _extends: multi-gpu.yaml
▼
┌─────────────────────┐
│ experiment.yaml │ Final experiment: inherits all, overrides batch
│ ─────────────────── │
│ batch_size: 32 │
│ traffic: poisson │
└─────────────────────┘
Example Configuration:
# === CORE ===
config_name: llama-3.2-3b-vllm-benchmark
model_name: meta-llama/Llama-3.2-3B
backend: vllm # pytorch | vllm | tensorrt
# === PRECISION ===
fp_precision: float16 # float32 | float16 | bfloat16
# === QUANTIZATION (PyTorch) ===
quantization:
load_in_4bit: false
load_in_8bit: false
# === GPU SETUP ===
gpus: [0, 1, 2, 3]
num_processes: 4
# === BATCHING ===
batching:
strategy: sorted_dynamic # static | dynamic | sorted_static | sorted_dynamic
batch_size: 32
max_tokens_per_batch: 4096 # Token budget for dynamic strategies
# === TRAFFIC SIMULATION ===
traffic_simulation:
enabled: true
mode: poisson # constant | poisson
target_qps: 10.0
# === DECODER ===
decoder:
preset: deterministic # deterministic | standard | creative | factual
max_new_tokens: 256
# === SHARDING (Multi-GPU) ===
sharding:
strategy: tensor_parallel # none | tensor_parallel | pipeline_parallel
num_shards: 4
# === BACKEND-SPECIFIC CONFIG ===
vllm:
gpu_memory_utilization: 0.9
enable_prefix_caching: true
max_model_len: 4096
All configuration is Pydantic-validated at load time—invalid configs fail fast with clear error messages. Backend-specific parameters are nested under the backend name.
The runner orchestrates multi-GPU inference via HuggingFace Accelerate with precise lifecycle management.
Execution Flow:
┌─────────────────────────────────────────────────────────────────────────────┐
│ EXECUTION LIFECYCLE │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ 1. INITIALISATION │
│ ─────────────────────────────────────────────────────────────────────── │
│ • Load model onto GPU(s) with specified sharding strategy │
│ • Configure distributed backend (NCCL/Gloo) │
│ • Initialise CodeCarbon energy tracker │
│ • Set up per-process metric collectors │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ 2. WARM-UP (results discarded) │
│ ─────────────────────────────────────────────────────────────────────── │
│ • 3 dummy forward passes │
│ • Triggers CUDA lazy initialisations │
│ • Stabilises GPU clock frequencies │
│ • Populates KV cache │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ 3. MEASUREMENT │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ for each batch in dataloader: │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ barrier_sync() # Synchronise all GPUs │ │ │
│ │ │ start_batch_timer() │ │ │
│ │ │ outputs = model.generate(batch, **gen_config) │ │ │
│ │ │ stop_batch_timer() │ │ │
│ │ │ record_tokens(outputs) # Per-process counting │ │ │
│ │ │ record_memory() # Peak GPU memory │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ • Energy tracked continuously via CodeCarbon │
│ • Each GPU process maintains independent metrics │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ 4. COLLECTION │
│ ─────────────────────────────────────────────────────────────────────── │
│ • Stop CodeCarbon tracker │
│ • Gather per-GPU metrics via distributed primitives │
│ • Compute FLOPs (see estimation pipeline below) │
│ • Save raw results to JSON (one file per GPU) │
└─────────────────────────────────────────────────────────────────────────────┘
The late aggregation pattern is central to the design philosophy: raw per-GPU results are preserved before any aggregation.
┌─────────────────────────────────────────────────────────────────────────────┐
│ LATE AGGREGATION PATTERN │
└─────────────────────────────────────────────────────────────────────────────┘
Raw Results (preserved) Aggregated Results
───────────────────────── ─────────────────────────
┌─────────────────────┐
│ gpu_0_results.json │───┐
│ • energy: 0.0023 kWh│ │
│ • tokens: 1024 │ │
│ • memory: 12.4 GB │ │
└─────────────────────┘ │
│
┌─────────────────────┐ │ ┌─────────────────────────────────┐
│ gpu_1_results.json │───┼────▶│ experiment_summary.json │
│ • energy: 0.0021 kWh│ │ │ ─────────────────────────────── │
│ • tokens: 1024 │ │ │ • total_energy: 0.0089 kWh │
│ • memory: 11.8 GB │ │ │ • total_tokens: 4096 │
└─────────────────────┘ │ │ • tokens_per_second: 142.3 │
│ │ • energy_per_token: 2.17e-6 kWh │
┌─────────────────────┐ │ │ • peak_memory: 12.4 GB │
│ gpu_2_results.json │───┤ │ • flops_per_token: 1.2e9 │
│ • energy: 0.0024 kWh│ │ └─────────────────────────────────┘
│ • tokens: 1024 │ │
│ • memory: 12.1 GB │ │
└─────────────────────┘ │
│
┌─────────────────────┐ │
│ gpu_3_results.json │───┘
│ • energy: 0.0021 kWh│
│ • tokens: 1024 │
│ • memory: 11.9 GB │
└─────────────────────┘
Benefits:
─────────
✓ Debug anomalous GPU behaviour
✓ Re-aggregate for different analyses
✓ Full reproducibility from raw data
✓ Identify load imbalances
The tool collects three categories of metrics, each with multiple measurement sources.
┌─────────────────────────────────────────────────────────────────────────────┐
│ METRIC COLLECTION │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
│ ENERGY │ │ THROUGHPUT │ │ COMPUTE │
│ ───────────────────── │ │ ───────────────────── │ │ ───────────────────── │
│ │ │ │ │ │
│ ┌───────────────────┐ │ │ ┌───────────────────┐ │ │ ┌───────────────────┐ │
│ │ GPU Energy (NVML) │ │ │ │ Tokens/second │ │ │ │ FLOPs/token │ │
│ │ Per-device Joules │ │ │ │ End-to-end rate │ │ │ │ (see pipeline) │ │
│ └───────────────────┘ │ │ └───────────────────┘ │ │ └───────────────────┘ │
│ │ │ │ │ │
│ ┌───────────────────┐ │ │ ┌───────────────────┐ │ │ ┌───────────────────┐ │
│ │ CPU Energy (RAPL) │ │ │ │ Latency/token │ │ │ │ Peak GPU Memory │ │
│ │ Package + DRAM │ │ │ │ Mean, P50, P99 │ │ │ │ Per-device max │ │
│ └───────────────────┘ │ │ └───────────────────┘ │ │ └───────────────────┘ │
│ │ │ │ │ │
│ ┌───────────────────┐ │ │ ┌───────────────────┐ │ │ ┌───────────────────┐ │
│ │ RAM Energy │ │ │ │ Time to First │ │ │ │ Device Util % │ │
│ │ System memory │ │ │ │ Token (TTFT) │ │ │ │ Compute + memory │ │
│ └───────────────────┘ │ │ └───────────────────┘ │ │ └───────────────────┘ │
│ │ │ │ │ │
│ ┌───────────────────┐ │ │ ┌───────────────────┐ │ │ │
│ │ CO₂ Emissions │ │ │ │ Batch Throughput │ │ │ │
│ │ Grid carbon int. │ │ │ │ Requests/second │ │ │ │
│ └───────────────────┘ │ │ └───────────────────┘ │ │ │
│ │ │ │ │ │
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘
CodeCarbon Native timing calflops + fallbacks
FLOPs estimation uses a three-strategy fallback chain for robustness across different model architectures.
┌─────────────────────────────────────────────────────────────────────────────┐
│ FLOPs ESTIMATION PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Input: Model │
│ (architecture, params, seq_len) │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Strategy 1: calflops │
│ ───────────────────────────────── │
│ • Traces actual computation graph │
│ • Most accurate for supported │
│ architectures │
│ • Handles custom attention │
└──────────────────┬──────────────────┘
│
┌─────────────┴─────────────┐
│ │
✓ Success ✗ Failure
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────────────────────────┐
│ Return FLOPs │ │ Strategy 2: Analytical │
└─────────────────┘ │ ───────────────────────────────── │
│ • Architecture-specific formulas │
│ • Transformer: 2 × params × tokens │
│ • Accounts for attention, FFN │
└──────────────────┬──────────────────┘
│
┌─────────────┴─────────────┐
│ │
✓ Success ✗ Failure
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────────────────────────┐
│ Return FLOPs │ │ Strategy 3: Parameter-based │
└─────────────────┘ │ ───────────────────────────────── │
│ • Fallback: 2 × params × tokens │
│ • Works for any model │
│ • Less accurate but guaranteed │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────┐
│ Return FLOPs │
└─────────────────┘
src/llenergymeasure/
├── cli/ # Typer CLI with subcommand modules
│ ├── experiment.py # Experiment execution commands
│ ├── campaign.py # Multi-config campaign orchestration
│ ├── config.py # Config validation and display
│ └── results.py # Results management and export
├── config/
│ ├── loader.py # YAML parsing with _extends inheritance
│ ├── validation.py # Pydantic schemas for all backends
│ ├── presets.py # Decoder presets (deterministic, creative, etc.)
│ └── introspection.py # Parameter provenance tracking
├── core/
│ ├── runner.py # Distributed inference orchestration
│ ├── energy.py # CodeCarbon integration, energy backends
│ ├── flops.py # Three-strategy FLOPs estimation
│ └── metrics.py # Throughput and latency collectors
├── backends/
│ ├── pytorch/ # Native PyTorch backend with TP/PP
│ ├── vllm/ # vLLM backend with continuous batching
│ └── tensorrt/ # TensorRT backend with kernel fusion
├── domain/
│ ├── config.py # ExperimentConfig, backend-specific configs
│ ├── results.py # InferenceResults, EnergyMetrics, etc.
│ └── enums.py # Precision, ShardingStrategy, BatchingMode, Backend
├── orchestration/
│ ├── orchestrator.py # ExperimentOrchestrator with DI
│ ├── campaign.py # Campaign orchestration for multi-config runs
│ ├── context.py # ExperimentContext lifecycle management
│ └── scheduler.py # Daemon mode, interval/time-based scheduling
└── results/
├── persistence.py # JSON/CSV save and load
├── aggregation.py # Late aggregation logic
└── export.py # Result formatting and export
Several components support pluggable strategies for flexible experimentation:
| Subsystem | Strategies | Purpose |
|---|---|---|
| Backend | pytorch, vllm, tensorrt | Production-grade inference engines with different optimisation profiles |
| Batching | static, dynamic, sorted_static, sorted_dynamic | MLPerf-aligned request aggregation with optional token budgets |
| Traffic | constant, poisson | Simulate production load patterns at configurable QPS |
| Sharding | none, tensor_parallel, pipeline_parallel | Distribute model layers across GPUs |
| FLOPs | calflops → analytical → parameter-based | Three-strategy fallback for robust estimation |
| Decoder | deterministic, standard, creative, factual | Preset sampling configurations |
This architecture enables measuring how backend choice, parallelism, batching, and precision interact—factors the research found can induce 4-6× variation in energy consumption within realistic deployment constraints, with total variation exceeding 50× across the full parameter space.
If you use this tool in research, please cite:
Baker, H. (2025). The Implementation Gap: Inducing Variation in LLM Inference-time Energy Efficiency for Fixed Computational Workloads. Masters of Data Science for Public Policy thesis, Hertie School.
Last updated: January 2026
Powered by Jekyll and Minimal Light theme.