Skip to main content

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

Run an experiment with vLLM (Docker)

This recipe runs a single measurement against the vLLM engine. Use it when you want to measure inference under vLLM's continuous-batching runtime rather than HuggingFace transformers.

:::caution vLLM requires Docker The vLLM engine runs inside a Docker container. Attempting to run vLLM without Docker raises a PreFlightError at preflight. Ensure Docker and the NVIDIA Container Toolkit are installed before proceeding. :::

Prerequisites

1. Create a config file

Create experiment.yaml:

engine: vllm
task:
model: gpt2
dataset:
source: aienergyscore
n_prompts: 50
runners:
vllm: docker
from llenergymeasure import run_experiment

result = run_experiment(
model="gpt2",
engine="vllm",
n_prompts=50,
)
print(result)

2. Run the experiment

llem run experiment.yaml
from llenergymeasure import run_experiment

result = run_experiment("experiment.yaml")

What happens:

  1. Pre-flight checks run: Docker CLI, NVIDIA Container Toolkit, GPU visibility inside container, CUDA/driver compatibility.
  2. The vLLM Docker image is pulled on first run (ghcr.io/henrycgbaker/llenergymeasure/vllm:v0.9.0).
  3. The container launches, runs the experiment, and streams results back.
  4. Results are printed to stdout and saved to results/.

3. Read the results

The output format matches the Transformers track. The key difference is engine: vllm in the experiment ID and result file. See How to interpret results for the field-by-field walkthrough.