import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
Run an experiment with vLLM (Docker)
This recipe runs a single measurement against the vLLM engine. Use it
when you want to measure inference under vLLM's continuous-batching
runtime rather than HuggingFace transformers.
:::caution vLLM requires Docker
The vLLM engine runs inside a Docker container. Attempting to run vLLM
without Docker raises a PreFlightError at preflight. Ensure Docker and
the NVIDIA Container Toolkit are installed before proceeding.
:::
Prerequisites
llenergymeasureinstalled (host-side orchestrator)- Docker + NVIDIA Container Toolkit - see Docker setup
- vLLM Docker image built or pullable from GHCR - see Contributing > Development
1. Create a config file
Create experiment.yaml:
engine: vllm
task:
model: gpt2
dataset:
source: aienergyscore
n_prompts: 50
runners:
vllm: docker
from llenergymeasure import run_experiment
result = run_experiment(
model="gpt2",
engine="vllm",
n_prompts=50,
)
print(result)
2. Run the experiment
llem run experiment.yaml
from llenergymeasure import run_experiment
result = run_experiment("experiment.yaml")
What happens:
- Pre-flight checks run: Docker CLI, NVIDIA Container Toolkit, GPU visibility inside container, CUDA/driver compatibility.
- The vLLM Docker image is pulled on first run
(
ghcr.io/henrycgbaker/llenergymeasure/vllm:v0.9.0). - The container launches, runs the experiment, and streams results back.
- Results are printed to stdout and saved to
results/.
3. Read the results
The output format matches the Transformers track. The key difference is
engine: vllm in the experiment ID and result file. See
How to interpret results for the field-by-field
walkthrough.
Related
- Tutorial: Your first measurement - start here if you've never run
llem - How to: run with TensorRT-LLM - sister recipe for the TRT-LLM engine
- Reference: engine configuration - every vLLM-specific config field