Skip to main content

Engine extensibility

The inference stack evolves quickly: vLLM, TRT-LLM, and SGLang each ship multiple releases per quarter, and new engines appear regularly. Adding one to LLenergyMeasure should require as little bespoke code as possible. This page lists exactly what a contributor must produce and what the pipeline generates automatically.

For the underlying protocol contract that makes this possible, see Harness-plugin model. For how schemas and invariants stay current when a version bumps, see Auto-refresh pipeline.

The contract

A new engine implements the EnginePlugin Protocol defined in src/llenergymeasure/engines/protocol.py:35. The protocol is @runtime_checkable, so any class that provides the required methods satisfies it without inheritance.

The six required methods are:

MethodResponsibility
load_model(config, on_substep)Load weights into GPU memory; return opaque model object
run_warmup_prompt(config, model, prompt)Run one warmup inference; return latency ms (or 0.0 to use kernel-only warmup)
run_inference(config, model, prompts)Run batch inference; return InferenceOutput
cleanup(model)Release GPU memory
check_hardware(config)Return compatibility errors (empty list when compatible); must never raise or allocate
capture_observed_params(config, model, output)Return dict of effective engine/sampling params for observed-config tracking

The harness calls these in a fixed order. The plugin never interacts with the energy sampler, FLOPs estimator, or result model.

What a contributor must produce

The following six items are required for a new engine. Items 1-4 are manual; items 5-6 are generated by the auto-refresh pipeline after the rest is in place.

1. Plugin class

Create src/llenergymeasure/engines/<engine>/plugin.py with a class implementing EnginePlugin. The three existing engines are concrete examples:

2. Dockerfile

Create docker/Dockerfile.<engine> for the per-engine container. The transformers engine shows the pattern: docker/Dockerfile.transformers.

Key requirements:

  • Multi-stage build with a runtime target (used by CI for caching).
  • Pin the engine library version via an ARG; the version is sourced from engine_versions/<engine>.yaml at build time and by Renovate for automated bumps.
  • Install llenergymeasure with the relevant extras ([vllm], [tensorrt], etc.) so the plugin and sampler dependencies are present.

3. Engine declaration

Add the new engine to the Engine enum in src/llenergymeasure/config/ssot.py:41:

class Engine(str, Enum):
TRANSFORMERS = "transformers"
VLLM = "vllm"
TENSORRT = "tensorrt"
SGLANG = "sglang" # new

The Engine enum is the single source of truth for engine identifiers throughout the codebase. The CI matrix in engine-pipeline.yml derives the fan-out from this enum automatically.

4. Engine config model

Add a Pydantic config model in src/llenergymeasure/config/engine_configs.py. Existing models (TransformersConfig, VLLMConfig, TensorRTConfig) show the shape: a top-level config class that composes sampling, scheduling, and engine-specific sub-models.

The model is what users put under the engine_config: key in their study YAML. Fields should mirror the native engine parameters; the schema discovery pipeline (item 6 below) will verify alignment.

5. Invariants YAML (auto-generated)

src/llenergymeasure/engines/<engine>/invariants.proposed.yaml and invariants.validated.yaml are produced by the miner pipeline running inside the engine container. Do not hand-author these - run the pipeline or let CI generate them on the first engine-pipeline PR. See Auto-refresh pipeline and Contributing: miner pipeline.

6. Schema JSON (auto-generated)

src/llenergymeasure/engines/<engine>/schema.discovered.json is produced by the schema introspector running inside the engine container. Same policy as invariants - generated, not authored. The introspectors live in scripts/engine_introspectors/.

What is automated

Once items 1-4 exist and a PR is opened, the engine-pipeline CI surface (engine-pipeline.yml + _engine-invariants-cell.yml + _engine-schemas-cell.yml) handles:

  • Schema discovery - runs the engine introspector inside the container, writes schema.discovered.json.
  • Invariant mining - runs the miner inside the container, produces invariants.proposed.yaml, validates against the rule corpus to produce invariants.validated.yaml.
  • Generated documentation - regenerates docs/user/generated/invariants-<engine>.md, curation-<engine>.md, and schema-<engine>.md from the artefact files.
  • Bot writeback - llem-ci-bot commits the regenerated files to the PR branch so they are never stale when a PR merges.
  • Image build and cache - builds the engine image and pushes a cached layer to GHCR so subsequent runs start from a warm cache.

The weekly schedule trigger in engine-pipeline.yml also runs a no-cache drift detection rebuild every Monday, so version drift is caught even without a PR.

What stays manual

  • The plugin class (plugin.py) - framework-specific inference code cannot be generated.
  • The Dockerfile - base image selection, multi-stage structure, and library version pinning require human judgement.
  • The Pydantic config model - field selection (which engine knobs to expose to users) is a product decision.
  • Edge-case handling - hardware checks, graceful degradation, compatibility errors for unsupported configurations. These are engine-specific.

Worked forecast: SGLang as the next engine

SGLang is the planned next engine. Given the current contract, the delivery checklist is:

Manual work:

  1. src/llenergymeasure/engines/sglang/plugin.py - SGLangEngine class implementing EnginePlugin. SGLang uses a server model similar to vLLM, so VLLMEngine is the closest reference; return 0.0 from run_warmup_prompt to use kernel-only warmup.
  2. docker/Dockerfile.sglang - base image from SGLang's official release container; version pinned via ARG SGLANG_VERSION; sources version from engine_versions/sglang.yaml.
  3. Engine.SGLANG = "sglang" in ssot.py and a SGLangConfig Pydantic model in engine_configs.py.

Automated after the above:

  1. Invariants YAML - miner pipeline generates on first CI run.
  2. Schema JSON - introspector generates on first CI run.
  3. CI matrix fan-out - engine-pipeline.yml picks up sglang from the SSOT automatically; no new workflow files needed.
  4. Generated documentation - invariants-sglang.md, curation-sglang.md, schema-sglang.md written by the bot.

The manual surface is three files. Everything downstream is automated.

Why this matters

Keeping measurement methodology current with upstream engine APIs requires that per-engine bespoke work stay small. The harness-plugin boundary achieves this: engine authors write inference code, not measurement code; methodology authors update the harness once, not three times.