import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
Run an experiment with TensorRT-LLM (Docker)
TensorRT-LLM compiles models into optimised TensorRT engines, then runs inference against those engines. The first run compiles the engine (several minutes); subsequent runs with the same config load the cached engine and start inference immediately.
:::caution TensorRT-LLM requires Docker and an Ampere-or-newer GPU The engine runs inside a Docker container and requires an NVIDIA GPU with SM >= 7.5 (Turing or newer). FP8 quantisation requires SM >= 8.9 (Ada Lovelace or newer); on A100 (SM 8.0), use INT8 or W4A16_AWQ instead. :::
Prerequisites
llenergymeasureinstalled (host-side orchestrator)- Docker + NVIDIA Container Toolkit - see Docker setup
- TensorRT-LLM Docker image built or pullable from GHCR - see Contributing > Development
- NVIDIA GPU with SM >= 7.5 (Turing or newer; e.g. RTX 2000-series, A100, H100)
1. Create a config file
Minimal:
engine: tensorrt
task:
model: meta-llama/Llama-2-7b-hf
dataset:
source: aienergyscore
n_prompts: 50
runners:
tensorrt: docker
With explicit quantisation and engine caching:
engine: tensorrt
task:
model: meta-llama/Llama-2-7b-hf
dataset:
source: aienergyscore
n_prompts: 50
runners:
tensorrt: docker
tensorrt:
max_batch_size: 8
dtype: bfloat16
quant_config:
quant_algo: W4A16_AWQ
from llenergymeasure import run_experiment
result = run_experiment(
model="meta-llama/Llama-2-7b-hf",
engine="tensorrt",
n_prompts=50,
)
print(result)
2. Run the experiment
llem run experiment.yaml
from llenergymeasure import run_experiment
result = run_experiment("experiment.yaml")
What happens:
- Pre-flight checks run: Docker CLI, NVIDIA Container Toolkit, GPU visibility, SM-version check.
- The TensorRT-LLM Docker image is pulled on first run
(
ghcr.io/henrycgbaker/llenergymeasure/tensorrt:v0.9.0). - The container compiles the TensorRT engine from the model weights. First run only - this takes several minutes. Progress is shown in the terminal.
- The compiled engine is cached on disk
(
~/.cache/tensorrt_llminside the container, mounted from the host). - Inference runs against the compiled engine.
- Results are printed to stdout and saved to
results/.
:::tip Engine caching The compiled engine is keyed to your config (model, dtype, max_batch_size, tp_size, etc.). Running the same experiment config again skips compilation and starts inference immediately. Changing any compile-time parameter triggers a new build. :::
HF pre-quantised checkpoints
TensorRT-LLM cannot load Hugging Face AWQ or GPTQ checkpoints directly:
the weight-key layout in HF's serialisation differs from what TensorRT-LLM
expects, so model load raises KeyError: 'weight'.
Pre-flight catches this and refuses the run with an actionable error. To
benchmark a pre-quantised checkpoint, convert it once with trtllm-build
and point the experiment at the build output:
trtllm-build \
--checkpoint_dir <path-to-converted-checkpoint> \
--output_dir /shared/engines/qwen2.5-7b-awq
task:
model: Qwen/Qwen2.5-7B-Instruct-AWQ # original HF id, for tokenizer + metadata
tensorrt:
engine_path: /shared/engines/qwen2.5-7b-awq
With engine_path set, the pre-flight gate is skipped because the engine
is already in TensorRT-LLM's native format.
3. Read the results
The output format matches other engines. The result file includes
engine: tensorrt and a build_metadata section with engine compilation
time, GPU architecture, and TRT-LLM version. See
How to interpret results for a field-by-field
walkthrough.
Related
- Tutorial: Your first measurement - start here if you've never run
llem - How to: run with vLLM - sister recipe for the vLLM engine
- Reference: engine configuration - every TensorRT-LLM-specific config field
- Reference: invariants (TensorRT) - mined parameter constraints