import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
Multi-engine implementation-parameter study
This is the flagship tutorial. We'll measure how four representative implementation choices - numerical precision, batching, attention backend, KV-cache reuse - affect the energy and efficiency profile of the same model across Transformers, vLLM, and TensorRT-LLM. By the end you'll have a structured result you can chart, and you'll know how to design your own implementation-parameter sweeps.
These four are illustrative, not exhaustive. Every parameter declared by an engine is exposed programmatically and can be swept; we picked four that exercise different layers of the stack. The point of the exercise is the workflow, not the parameter list.
The framing matters. This is the question llem is built to answer:
given a fixed open-source model, how do downstream implementation
choices shape its inference cost? Most adjacent tools optimise the
other axis - "given a fixed implementation, how do models compare?"
That's a useful question too, but it's not this one.
Compute time: ~30 minutes on a single A100-class GPU once Docker images are pulled, plus ~5 minutes of one-time TensorRT-LLM engine compilation on first run. The compiled engine is cached, so subsequent runs of the same TRT-LLM cell skip compilation.
Prerequisites
- A working install of
llenergymeasure- see How to install. - Docker + NVIDIA Container Toolkit operational and
llem doctorpassing - see Docker setup. - All three engine images either built locally or pullable from GHCR - see Contributing > Development for the build pattern.
- ~30 GB of free disk for caches (model weights + TRT-LLM compiled engines).
- You've completed the
first-measurement tutorial so the
shape of
llem runandresult.jsonis familiar.
Step 1 - Sketch the question
Before writing config, sketch what you expect. This is a research discipline, not a measurement step.
The four parameters in this tutorial are chosen because they exercise different layers of the inference stack and have plausible - but not pre-assumed - energy effects:
| Parameter | What it changes | Expected effect on energy |
|---|---|---|
dtype (float16 vs bfloat16) | Per-element precision of weights/activations | Probably small for a 0.5B model; bf16 may shift error patterns without changing FLOPs |
| Batch size / max_num_seqs / max_batch_size | How many prompts share a kernel call | Larger batches → higher GPU utilisation → lower J/token (until VRAM saturates) |
Attention backend (sdpa vs flash_attention_2; flash_attn vs flashinfer) | Kernel implementation of attention | Flash-attn-style backends typically reduce HBM traffic → lower energy |
| KV-cache reuse (vLLM prefix caching; TRT-LLM block reuse) | Whether prefix tokens are recomputed across overlapping prompts | Workload-dependent - large effect if prompts share prefix, near-zero otherwise |
We don't pre-assume the answers. The point of measurement is to find them. But writing them down before running is what separates "I ran a benchmark" from "I tested a hypothesis."
Step 2 - Read the study config
The shipped config lives at
configs/tutorials/tutorial-multi-engine.yaml.
Walk through it section by section.
study_name: tutorial-multi-engine
runners:
transformers: docker
vllm: docker
tensorrt: docker
All three engines in Docker. runners is what pins each engine to its
isolated image, so the host doesn't need to import any engine. This is
the only correct way to compare engines side-by-side - without
isolation, library-version conflicts (e.g. transformers ↔ vllm pinned
versions) cross-contaminate the measurement.
:::caution Multi-engine studies require Docker
Running a multi-engine study without Docker raises a PreFlightError before any inference starts. Each engine needs its own isolated image to prevent library-version conflicts from affecting the measurements. See Docker setup if your environment is not yet configured.
:::
task:
model: Qwen/Qwen2.5-0.5B
random_seed: 42
dataset:
source: aienergyscore
n_prompts: 30
order: interleaved
max_input_tokens: 256
max_output_tokens: 256
Same model + same prompts + same token budget across every cell of the sweep. This is what makes cross-cell comparison legitimate - only the varied parameters differ.
max_input_tokens and max_output_tokens control the FLOPs budget.
If you let prompt and generation lengths float, you can't separate
"this implementation is more efficient" from "this implementation
generated less text."
measurement:
energy_sampler: auto
baseline:
enabled: true
duration_seconds: 30.0
warmup:
enabled: true
n_warmup: 3
thermal_floor_seconds: 30.0
energy_sampler: auto probes the host and picks the highest-fidelity
sampler available (NVML → Zeus → CodeCarbon, in that order). The
30-second baseline measures idle GPU power before each experiment so
the result file's Adjusted energy figure reflects only the inference
work - see How to interpret results.
The sweep section is where the implementation parameters live:
sweep:
# 1. Numerical precision - applies to all three engines.
transformers.dtype: [float16, bfloat16]
vllm.dtype: [float16, bfloat16]
tensorrt.dtype: [float16, bfloat16]
# 2. Batching strategy - engine-native parameter names.
transformers.batch_size: [4, 16]
vllm.engine.max_num_seqs: [64, 256]
tensorrt.max_batch_size: [4, 16]
# 3. Attention backend - measurable energy effect on prefill-heavy work.
transformers.attn_implementation: [sdpa, flash_attention_2]
vllm.attention.backend: [flash_attn, flashinfer]
# 4. KV-cache reuse - affects steady-state throughput and energy.
vllm.engine.enable_prefix_caching: [true, false]
tensorrt.kv_cache_reuse:
- {}
- tensorrt.kv_cache_config.enable_block_reuse: true
tensorrt.kv_cache_config.free_gpu_memory_fraction: 0.9
A subtle but important point: each axis is engine-scoped by its
key prefix (transformers., vllm., tensorrt.). When the sweep
expander processes an experiment cell whose engine is vllm, only
vllm.* axes are applied to it; the transformers.* and
tensorrt.* axes are skipped. This is what makes cross-engine sweeps
sensible - you don't end up with vllm.engine.max_num_seqs=64 mixed
into a Transformers experiment.
The tensorrt.kv_cache_reuse group illustrates a dependent group:
within-group entries are alternatives (unioned, not crossed), while
the group as a whole is crossed against other axes. The empty {}
is the baseline; the second entry sets two related fields together.
This is the right way to sweep parameters that travel in pairs.
Step 3 - Dry-run, then run
Before kicking off the real run, validate the config and see how many experiments will actually execute:
llem run configs/tutorials/tutorial-multi-engine.yaml --dry-run
Dry-run is a CLI-only path; the Python API does not currently expose an
equivalent flag on run_study.
The dry-run resolves the sweep, applies engine-scoped filtering, deduplicates equivalent cells, and prints a manifest. You should see something like:
Study: tutorial-multi-engine
Resolved: 36 experiments (84 expanded → 36 after dedup)
Per-engine breakdown:
transformers: 16 (dtype × batch_size × attn × cycles)
vllm: 16 (dtype × max_num_seqs × attn × prefix_caching)
tensorrt: 4 (dtype × max_batch_size × kv_cache_reuse - 1 cycle)
VRAM estimate (per engine): ~4 GB peak (Qwen2.5-0.5B in bf16)
Estimated wall-clock: 28 min (excluding TRT-LLM first-build ~5 min)
Sample output above; your numbers will differ depending on host resolution of the Cartesian product, dedup hit rate, and TRT-LLM compilation cache state.
If the resolved count and per-engine breakdown match your expectation, launch the real run:
llem run configs/tutorials/tutorial-multi-engine.yaml
from llenergymeasure import run_study
study_result = run_study("configs/tutorials/tutorial-multi-engine.yaml")
print(f"Completed {study_result.summary.completed} experiments")
You'll see a progress indicator with experiment counters and the
running cell's identifier. Each result lands in
results/tutorial-multi-engine_<timestamp>/<NNN_cN_*>/result.json.
Step 4 - Inspect the manifest and a single result
After the run completes, the study directory looks roughly like this:
results/tutorial-multi-engine_2026-05-07T14-32-08/
├── manifest.json # study-level: timing, config, completion
├── 001_c0_qwen-transformers_a1b2c3.../ # one experiment cell
│ ├── result.json # all metrics + resolved config
│ ├── effective_config.json # final config used (after expansion)
│ └── timeseries.parquet # GPU power/thermal/memory samples
├── 002_c0_qwen-transformers_d4e5f6.../
└── ... (36 cells)
manifest.json is the study-level record. It contains the resolved
experiment list, study timing, completion status per cell, and the
effective study-level config. This is what you load when you want to
reason about the study as a whole rather than one cell.
A single result.json looks like (truncated for readability):
{
"experiment_id": "qwen-transformers-bf16-bs16-fa2-2026-05-07T14-32-08",
"engine": "transformers",
"model": "Qwen/Qwen2.5-0.5B",
"total_tokens": 7680,
"total_inference_time_sec": 9.8,
"avg_tokens_per_second": 783.7,
"total_energy_j": 891.4,
"baseline_power_w": 12.3,
"mj_per_tok_total": 116.1,
"mj_per_tok_adjusted": 100.4,
"total_flops": 1.18e+12,
"flops_per_output_token": 1.54e+8,
"energy_breakdown": { "...": "..." },
"effective_config": { "...": "..." }
}
Sample numbers above; real values depend on hardware + prompt sample. The structure is stable.
The two energy-per-token figures are the headline:
mj_per_tok_total- millijoules per output token, raw GPU energymj_per_tok_adjusted- same, with idle baseline subtracted
For cross-cell comparison the adjusted figure is the right pick - it isolates inference work from the cost of having a GPU plugged in. The full reasoning is on the methodology page and the energy-measurement explanation.
Step 5 - Compare across engines in Python
Loading and grouping results uses the public API. Drop this snippet into a Python file alongside your study directory:
import json
from collections import defaultdict
from pathlib import Path
study_dir = Path("results/tutorial-multi-engine_2026-05-07T14-32-08")
# Load every result.json under the study.
results = []
for p in study_dir.glob("*/result.json"):
with p.open() as f:
results.append(json.load(f))
# Group by (engine, dtype) and average mJ/token (adjusted).
groups: dict[tuple[str, str], list[float]] = defaultdict(list)
for r in results:
key = (r["engine"], r["effective_config"][r["engine"]]["dtype"])
groups[key].append(r["mj_per_tok_adjusted"])
print(f"{'engine':<14} {'dtype':<10} {'mJ/tok (adj, mean)':>20} {'n':>4}")
for (engine, dtype), values in sorted(groups.items()):
mean = sum(values) / len(values)
print(f"{engine:<14} {dtype:<10} {mean:>20.2f} {len(values):>4}")
You should see something like:
engine dtype mJ/tok (adj, mean) n
tensorrt bfloat16 72.4 2
tensorrt float16 68.9 2
transformers bfloat16 102.3 8
transformers float16 108.7 8
vllm bfloat16 84.1 8
vllm float16 86.5 8
Sample numbers above; the ordering of magnitudes is what's directionally meaningful, not the precise values. On A100 you typically see TRT-LLM lowest, vLLM middle, Transformers highest for the per-token energy figure - but the gap between dtypes is often within noise for a 0.5B model.
To go further: group by (engine, batch_size_effective) and plot
mJ/token vs batch size, or pivot on attn_implementation /
attention.backend to compare attention kernels within each engine.
The result.json schema makes this kind of analysis a few lines of
Python.
Step 6 - What you've learned and where to go next
You've now exercised the full llem workflow:
- Designing an implementation-parameter sweep with engine-scoped axes and dependent groups
- Resolving that sweep with
--dry-runto validate the experiment count and VRAM estimates before running - Running a multi-engine study with isolated Docker per engine
- Inspecting both the study-level manifest and individual result.json files
- Comparing results in Python using the universal output metrics (mJ/token, tokens/sec) that allow cross-engine comparison
The shape of every research workflow with llem looks like this. The
specifics - which parameters, which model, which task - change with
your question.
Sister recipes (How-to)
- Run with vLLM (Docker) - single-engine recipe
- Run with TensorRT-LLM (Docker) - single-engine recipe
- Interpret results - field-by-field walkthrough of
result.json - Troubleshoot - when a cell fails or a metric looks wrong
Reference
- Study config - full sweep / runner / measurement field listing
- CLI - every
llem runflag (resume, fail-fast, etc.) - Engine configuration - per-engine parameter spaces
Conceptual depth (Explanation)
- Methodology - warmup, baseline, thermal management
- What we measure - energy / throughput / FLOPs
- Parameter discovery - how the engine-introspected parameter spaces are mined
- Comparison context - relationship to MLPerf, AI Energy Score