Engine extensibility

The inference stack evolves quickly: vLLM, TRT-LLM, and SGLang each ship multiple releases per quarter, and new engines appear regularly. Adding one to LLenergyMeasure should require as little bespoke code as possible. This page lists exactly what a contributor must produce and what the pipeline generates automatically.

For the underlying protocol contract that makes this possible, see Harness-plugin model. For how schemas and invariants stay current when a version bumps, see Auto-refresh pipeline.

The contract

A new engine implements the EnginePlugin Protocol defined in src/llenergymeasure/engines/protocol.py:35. The protocol is @runtime_checkable, so any class that provides the required methods satisfies it without inheritance.

The six required methods are:

Method	Responsibility
`load_model(config, on_substep)`	Load weights into GPU memory; return opaque model object
`run_warmup_prompt(config, model, prompt)`	Run one warmup inference; return latency ms (or `0.0` to use kernel-only warmup)
`run_inference(config, model, prompts)`	Run batch inference; return `InferenceOutput`
`cleanup(model)`	Release GPU memory
`check_hardware(config)`	Return compatibility errors (empty list when compatible); must never raise or allocate
`capture_observed_params(config, model, output)`	Return dict of effective engine/sampling params for observed-config tracking

The harness calls these in a fixed order. The plugin never interacts with the energy sampler, FLOPs estimator, or result model.

What a contributor must produce

The following six items are required for a new engine. Items 1-4 are manual; items 5-6 are generated by the auto-refresh pipeline after the rest is in place.

1. Plugin class

Create src/llenergymeasure/engines/<engine>/plugin.py with a class implementing EnginePlugin. The three existing engines are concrete examples:

TransformersEngine (src/llenergymeasure/engines/transformers/plugin.py:23) - CV-based warmup, batched model.generate(), HuggingFace weight loading.
VLLMEngine (src/llenergymeasure/engines/vllm/plugin.py:29) - single-pass kernel warmup (return 0.0), OpenAI-compatible server, Docker-only.
TensorRTEngine (src/llenergymeasure/engines/tensorrt/plugin.py:120) - engine compilation outside the NVML window, TRT-LLM executor pattern.

2. Dockerfile

Create docker/Dockerfile.<engine> for the per-engine container. The transformers engine shows the pattern: docker/Dockerfile.transformers.

Key requirements:

Multi-stage build with a runtime target (used by CI for caching).
Pin the engine library version via an ARG; the version is sourced from engine_versions/<engine>.yaml at build time and by Renovate for automated bumps.
Install llenergymeasure with the relevant extras ([vllm], [tensorrt], etc.) so the plugin and sampler dependencies are present.

3. Engine declaration

Add the new engine to the Engine enum in src/llenergymeasure/config/ssot.py:41:

class Engine(str, Enum):
    TRANSFORMERS = "transformers"
    VLLM = "vllm"
    TENSORRT = "tensorrt"
    SGLANG = "sglang"  # new

The Engine enum is the single source of truth for engine identifiers throughout the codebase. The CI matrix in engine-pipeline.yml derives the fan-out from this enum automatically.

4. Engine config model

Add a Pydantic config model in src/llenergymeasure/config/engine_configs.py. Existing models (TransformersConfig, VLLMConfig, TensorRTConfig) show the shape: a top-level config class that composes sampling, scheduling, and engine-specific sub-models.

The model is what users put under the engine_config: key in their study YAML. Fields should mirror the native engine parameters; the schema discovery pipeline (item 6 below) will verify alignment.

5. Invariants YAML (auto-generated)

src/llenergymeasure/engines/<engine>/invariants.proposed.yaml and invariants.validated.yaml are produced by the miner pipeline running inside the engine container. Do not hand-author these - run the pipeline or let CI generate them on the first engine-pipeline PR. See Auto-refresh pipeline and Contributing: miner pipeline.

6. Schema JSON (auto-generated)

src/llenergymeasure/engines/<engine>/schema.discovered.json is produced by the schema introspector running inside the engine container. Same policy as invariants - generated, not authored. The introspectors live in scripts/engine_introspectors/.

What is automated

Once items 1-4 exist and a PR is opened, the engine-pipeline CI surface (engine-pipeline.yml + _engine-invariants-cell.yml + _engine-schemas-cell.yml) handles:

Schema discovery - runs the engine introspector inside the container, writes schema.discovered.json.
Invariant mining - runs the miner inside the container, produces invariants.proposed.yaml, validates against the rule corpus to produce invariants.validated.yaml.
Generated documentation - regenerates docs/user/generated/invariants-<engine>.md, curation-<engine>.md, and schema-<engine>.md from the artefact files.
Bot writeback - llem-ci-bot commits the regenerated files to the PR branch so they are never stale when a PR merges.
Image build and cache - builds the engine image and pushes a cached layer to GHCR so subsequent runs start from a warm cache.

The weekly schedule trigger in engine-pipeline.yml also runs a no-cache drift detection rebuild every Monday, so version drift is caught even without a PR.

What stays manual

The plugin class (plugin.py) - framework-specific inference code cannot be generated.
The Dockerfile - base image selection, multi-stage structure, and library version pinning require human judgement.
The Pydantic config model - field selection (which engine knobs to expose to users) is a product decision.
Edge-case handling - hardware checks, graceful degradation, compatibility errors for unsupported configurations. These are engine-specific.

Worked forecast: SGLang as the next engine

SGLang is the planned next engine. Given the current contract, the delivery checklist is:

Manual work:

src/llenergymeasure/engines/sglang/plugin.py - SGLangEngine class implementing EnginePlugin. SGLang uses a server model similar to vLLM, so VLLMEngine is the closest reference; return 0.0 from run_warmup_prompt to use kernel-only warmup.
docker/Dockerfile.sglang - base image from SGLang's official release container; version pinned via ARG SGLANG_VERSION; sources version from engine_versions/sglang.yaml.
Engine.SGLANG = "sglang" in ssot.py and a SGLangConfig Pydantic model in engine_configs.py.

Automated after the above:

Invariants YAML - miner pipeline generates on first CI run.
Schema JSON - introspector generates on first CI run.
CI matrix fan-out - engine-pipeline.yml picks up sglang from the SSOT automatically; no new workflow files needed.
Generated documentation - invariants-sglang.md, curation-sglang.md, schema-sglang.md written by the bot.

The manual surface is three files. Everything downstream is automated.

Why this matters

Keeping measurement methodology current with upstream engine APIs requires that per-engine bespoke work stay small. The harness-plugin boundary achieves this: engine authors write inference code, not measurement code; methodology authors update the harness once, not three times.

The contract​

What a contributor must produce​

1. Plugin class​

2. Dockerfile​

3. Engine declaration​

4. Engine config model​

5. Invariants YAML (auto-generated)​

6. Schema JSON (auto-generated)​

What is automated​

What stays manual​

Worked forecast: SGLang as the next engine​

Why this matters​