Engine extensibility
The inference stack evolves quickly: vLLM, TRT-LLM, and SGLang each ship multiple releases per quarter, and new engines appear regularly. Adding one to LLenergyMeasure should require as little bespoke code as possible. This page lists exactly what a contributor must produce and what the pipeline generates automatically.
For the underlying protocol contract that makes this possible, see Harness-plugin model. For how schemas and invariants stay current when a version bumps, see Auto-refresh pipeline.
The contract
A new engine implements the EnginePlugin Protocol defined in
src/llenergymeasure/engines/protocol.py:35.
The protocol is @runtime_checkable, so any class that provides the required
methods satisfies it without inheritance.
The six required methods are:
| Method | Responsibility |
|---|---|
load_model(config, on_substep) | Load weights into GPU memory; return opaque model object |
run_warmup_prompt(config, model, prompt) | Run one warmup inference; return latency ms (or 0.0 to use kernel-only warmup) |
run_inference(config, model, prompts) | Run batch inference; return InferenceOutput |
cleanup(model) | Release GPU memory |
check_hardware(config) | Return compatibility errors (empty list when compatible); must never raise or allocate |
capture_observed_params(config, model, output) | Return dict of effective engine/sampling params for observed-config tracking |
The harness calls these in a fixed order. The plugin never interacts with the energy sampler, FLOPs estimator, or result model.
What a contributor must produce
The following six items are required for a new engine. Items 1-4 are manual; items 5-6 are generated by the auto-refresh pipeline after the rest is in place.
1. Plugin class
Create src/llenergymeasure/engines/<engine>/plugin.py with a class
implementing EnginePlugin. The three existing engines are concrete examples:
TransformersEngine(src/llenergymeasure/engines/transformers/plugin.py:23) - CV-based warmup, batchedmodel.generate(), HuggingFace weight loading.VLLMEngine(src/llenergymeasure/engines/vllm/plugin.py:29) - single-pass kernel warmup (return 0.0), OpenAI-compatible server, Docker-only.TensorRTEngine(src/llenergymeasure/engines/tensorrt/plugin.py:120) - engine compilation outside the NVML window, TRT-LLM executor pattern.
2. Dockerfile
Create docker/Dockerfile.<engine> for the per-engine container. The
transformers engine shows the pattern:
docker/Dockerfile.transformers.
Key requirements:
- Multi-stage build with a
runtimetarget (used by CI for caching). - Pin the engine library version via an
ARG; the version is sourced fromengine_versions/<engine>.yamlat build time and by Renovate for automated bumps. - Install
llenergymeasurewith the relevant extras ([vllm],[tensorrt], etc.) so the plugin and sampler dependencies are present.
3. Engine declaration
Add the new engine to the Engine enum in
src/llenergymeasure/config/ssot.py:41:
class Engine(str, Enum):
TRANSFORMERS = "transformers"
VLLM = "vllm"
TENSORRT = "tensorrt"
SGLANG = "sglang" # new
The Engine enum is the single source of truth for engine identifiers
throughout the codebase. The CI matrix in engine-pipeline.yml derives the
fan-out from this enum automatically.
4. Engine config model
Add a Pydantic config model in
src/llenergymeasure/config/engine_configs.py.
Existing models (TransformersConfig, VLLMConfig, TensorRTConfig) show the
shape: a top-level config class that composes sampling, scheduling, and
engine-specific sub-models.
The model is what users put under the engine_config: key in their study YAML.
Fields should mirror the native engine parameters; the schema discovery
pipeline (item 6 below) will verify alignment.
5. Invariants YAML (auto-generated)
src/llenergymeasure/engines/<engine>/invariants.proposed.yaml and
invariants.validated.yaml are produced by the miner pipeline running inside
the engine container. Do not hand-author these - run the pipeline or let CI
generate them on the first engine-pipeline PR. See
Auto-refresh pipeline and
Contributing: miner pipeline.
6. Schema JSON (auto-generated)
src/llenergymeasure/engines/<engine>/schema.discovered.json is produced by
the schema introspector running inside the engine container. Same policy as
invariants - generated, not authored. The introspectors live in
scripts/engine_introspectors/.
What is automated
Once items 1-4 exist and a PR is opened, the engine-pipeline CI surface
(engine-pipeline.yml + _engine-invariants-cell.yml +
_engine-schemas-cell.yml) handles:
- Schema discovery - runs the engine introspector inside the container,
writes
schema.discovered.json. - Invariant mining - runs the miner inside the container, produces
invariants.proposed.yaml, validates against the rule corpus to produceinvariants.validated.yaml. - Generated documentation - regenerates
docs/user/generated/invariants-<engine>.md,curation-<engine>.md, andschema-<engine>.mdfrom the artefact files. - Bot writeback -
llem-ci-botcommits the regenerated files to the PR branch so they are never stale when a PR merges. - Image build and cache - builds the engine image and pushes a cached layer to GHCR so subsequent runs start from a warm cache.
The weekly schedule trigger in engine-pipeline.yml also runs a no-cache drift
detection rebuild every Monday, so version drift is caught even without a PR.
What stays manual
- The plugin class (
plugin.py) - framework-specific inference code cannot be generated. - The Dockerfile - base image selection, multi-stage structure, and library version pinning require human judgement.
- The Pydantic config model - field selection (which engine knobs to expose to users) is a product decision.
- Edge-case handling - hardware checks, graceful degradation, compatibility errors for unsupported configurations. These are engine-specific.
Worked forecast: SGLang as the next engine
SGLang is the planned next engine. Given the current contract, the delivery checklist is:
Manual work:
src/llenergymeasure/engines/sglang/plugin.py-SGLangEngineclass implementingEnginePlugin. SGLang uses a server model similar to vLLM, soVLLMEngineis the closest reference;return 0.0fromrun_warmup_promptto use kernel-only warmup.docker/Dockerfile.sglang- base image from SGLang's official release container; version pinned viaARG SGLANG_VERSION; sources version fromengine_versions/sglang.yaml.Engine.SGLANG = "sglang"inssot.pyand aSGLangConfigPydantic model inengine_configs.py.
Automated after the above:
- Invariants YAML - miner pipeline generates on first CI run.
- Schema JSON - introspector generates on first CI run.
- CI matrix fan-out -
engine-pipeline.ymlpicks upsglangfrom the SSOT automatically; no new workflow files needed. - Generated documentation -
invariants-sglang.md,curation-sglang.md,schema-sglang.mdwritten by the bot.
The manual surface is three files. Everything downstream is automated.
Why this matters
Keeping measurement methodology current with upstream engine APIs requires that per-engine bespoke work stay small. The harness-plugin boundary achieves this: engine authors write inference code, not measurement code; methodology authors update the harness once, not three times.