import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

Run an experiment with TensorRT-LLM (Docker)

TensorRT-LLM compiles models into optimised TensorRT engines, then runs inference against those engines. The first run compiles the engine (several minutes); subsequent runs with the same config load the cached engine and start inference immediately.

:::caution TensorRT-LLM requires Docker and an Ampere-or-newer GPU The engine runs inside a Docker container and requires an NVIDIA GPU with SM >= 7.5 (Turing or newer). FP8 quantisation requires SM >= 8.9 (Ada Lovelace or newer); on A100 (SM 8.0), use INT8 or W4A16_AWQ instead. :::

Prerequisites

llenergymeasure installed (host-side orchestrator)
Docker + NVIDIA Container Toolkit - see Docker setup
TensorRT-LLM Docker image built or pullable from GHCR - see Contributing > Development
NVIDIA GPU with SM >= 7.5 (Turing or newer; e.g. RTX 2000-series, A100, H100)

1. Create a config file

Minimal:

engine: tensorrt
task:
  model: meta-llama/Llama-2-7b-hf
  dataset:
    source: aienergyscore
    n_prompts: 50
runners:
  tensorrt: docker

With explicit quantisation and engine caching:

engine: tensorrt
task:
  model: meta-llama/Llama-2-7b-hf
  dataset:
    source: aienergyscore
    n_prompts: 50
runners:
  tensorrt: docker
tensorrt:
  max_batch_size: 8
  dtype: bfloat16
  quant_config:
    quant_algo: W4A16_AWQ

from llenergymeasure import run_experiment

result = run_experiment(
    model="meta-llama/Llama-2-7b-hf",
    engine="tensorrt",
    n_prompts=50,
)
print(result)

2. Run the experiment

llem run experiment.yaml

from llenergymeasure import run_experiment

result = run_experiment("experiment.yaml")

What happens:

Pre-flight checks run: Docker CLI, NVIDIA Container Toolkit, GPU visibility, SM-version check.
The TensorRT-LLM Docker image is pulled on first run (ghcr.io/henrycgbaker/llenergymeasure/tensorrt:v0.9.0).
The container compiles the TensorRT engine from the model weights. First run only - this takes several minutes. Progress is shown in the terminal.
The compiled engine is cached on disk (~/.cache/tensorrt_llm inside the container, mounted from the host).
Inference runs against the compiled engine.
Results are printed to stdout and saved to results/.

:::tip Engine caching The compiled engine is keyed to your config (model, dtype, max_batch_size, tp_size, etc.). Running the same experiment config again skips compilation and starts inference immediately. Changing any compile-time parameter triggers a new build. :::

HF pre-quantised checkpoints

TensorRT-LLM cannot load Hugging Face AWQ or GPTQ checkpoints directly: the weight-key layout in HF's serialisation differs from what TensorRT-LLM expects, so model load raises KeyError: 'weight'.

Pre-flight catches this and refuses the run with an actionable error. To benchmark a pre-quantised checkpoint, convert it once with trtllm-build and point the experiment at the build output:

trtllm-build \
  --checkpoint_dir <path-to-converted-checkpoint> \
  --output_dir /shared/engines/qwen2.5-7b-awq

task:
  model: Qwen/Qwen2.5-7B-Instruct-AWQ   # original HF id, for tokenizer + metadata
tensorrt:
  engine_path: /shared/engines/qwen2.5-7b-awq

With engine_path set, the pre-flight gate is skipped because the engine is already in TensorRT-LLM's native format.

3. Read the results

The output format matches other engines. The result file includes engine: tensorrt and a build_metadata section with engine compilation time, GPU architecture, and TRT-LLM version. See How to interpret results for a field-by-field walkthrough.

Tutorial: Your first measurement - start here if you've never run llem
How to: run with vLLM - sister recipe for the vLLM engine
Reference: engine configuration - every TensorRT-LLM-specific config field
Reference: invariants (TensorRT) - mined parameter constraints

Prerequisites​

1. Create a config file​

2. Run the experiment​

HF pre-quantised checkpoints​

3. Read the results​

Related​

Prerequisites

1. Create a config file

2. Run the experiment

HF pre-quantised checkpoints

3. Read the results

Related