Skip to main content

Engine introspection pipelines

Each engine in LLenergyMeasure exposes two complementary introspection pipelines: one that discovers the engine's typed parameter schema, and one that mines the engine's validator behaviour. Both pipelines share the same shape - probe, then introspect or mine, then write a deterministic artefact - but compute different things. Together they produce the full per-engine inventory: the typed surface plus the constraints that govern it.

Why both pipelines exist

An engine surfaces information about its parameters across two channels. The typed schema (Pydantic models, dataclass fields, msgspec structs) tells you which parameters exist, what types they accept, and what defaults they ship with. The validator behaviour (validator methods, _verify_args calls, conditional raises) tells you which combinations of parameter values are rejected, normalised, or warned about. Schema discovery extracts the first; invariant mining extracts the second.

Either pipeline alone would understate the engine. Without schema discovery the runtime cannot align user fields with the engine's actual parameter surface; without invariant mining it cannot reject invalid combinations before paying the cost of engine initialisation. Running both per engine, against the pinned upstream library version, is what makes the engine first-class as a measurement axis.

The shared shape

Both pipelines follow the same four-stage workflow. The probe gates the work: if the landmark is missing, downstream stages skip. The producer either introspects or mines. The validate stage exists for invariant mining only (schema discovery is deterministic by construction). The writeback stage emits one artefact per pipeline, committed back to the PR branch by the bot.

Per-stage comparison:

StageSchema discoveryInvariant mining
ProbeLandmark check (engine + class symbols importable)Landmark check (engine + class symbols importable)
ProducerInspect typed APIs: inspect.signature, Pydantic model_json_schema(), dataclasses.fields(), msgspec.json.schema()AST walk of validator methods + dynamic Cartesian probing + type-system lifting
ValidateDeterministic by construction; no validate stageReplay each rule against the live library inside the engine container; classify outcomes
Output (per-engine)engines/<engine>/schema.discovered.jsonengines/<engine>/invariants.proposed.yaml + engines/<engine>/invariants.validated.yaml
Format spec (reference)Schema discovered formatInvariants corpus format
Per-engine digest (auto-generated)reference/engines/schema-<engine>.md, reference/engines/curation-<engine>.mdreference/engines/invariants-<engine>.md

Both pipelines run inside the engine's Docker image. The host has no engine libraries (import transformers, import vllm, import tensorrt_llm all fail by design), so every engine's introspection must run in the matching container.

Schema discovery in depth

Schema discovery introspects the engine's native Python API surface to produce a typed inventory of all configurable parameters. The introspector imports the live library, walks its config classes, and emits a deterministic JSON envelope.

What the producer does

The introspector is engine-specific: each engine has a module under scripts/engine_introspectors/ that knows how to walk its own config surface. The shared envelope and helpers live in scripts/engine_introspectors/_common.py.

Determinism

Schema discovery is deterministic by construction. The introspector reads the library's own type annotations and Pydantic schemas; the same library version always yields the same JSON. The LLENERGY_DISCOVERY_FROZEN_AT environment variable additionally pins the discovered_at timestamp to a stable anchor (typically the author date of the most recent commit touching any input path) so CI re-runs do not produce a fresh wallclock timestamp on every invocation.

What the artefact contains

Top-level envelope, two parameter sections (engine_params, sampling_params), and a discovery_limitations list documenting fields that introspection could not recover. The full reference is schema discovered format.

What consumes it

  • scripts/check_pydantic_matches_discovered.py - the drift checker; flags Pydantic fields in engine_configs.py with no corresponding discovered entry.
  • scripts/generate_curation_doc.py, scripts/generate_schema_doc.py - the doc generators; build the docs/reference/engines/{schema,curation}-{engine}.md digests from the loaded schema.
  • The runtime parameter-discovery layer reads the schema at config-validation time. See parameter discovery.

Change classification

When discovery re-runs against a bumped library version, the diff is classified by scripts/diff_discovered_schemas.py:

Change typeClassificationExample
Field addedsafeNew enable_chunked_prefill parameter
Description updatedsafeDocstring clarification
Default changedsafegpu_memory_utilization: 0.9 -> 0.95
Type widenedsafeint -> int | None
Field removedbreakingDeprecated parameter dropped
Type narrowedbreakingint | None -> int
Enum value removedbreakingQuantisation mode dropped

Metadata fields (discovered_at, engine_commit_sha, image_ref, base_image_ref) are excluded from classification because they change on every run.

Invariant mining in depth

Invariant mining extracts validation rules from engine library source code by combining static AST analysis, dynamic combinatorial probing, and type-system lifting. The output is a corpus of invariants - one constraint per rule - that the runtime uses to reject invalid configs before engine initialisation.

Component overview

Three producers, then merge, then replay against the live library:

  • The static miner walks the AST of validator methods.
  • The dynamic miner instantiates config classes with combinatorial probe values and observes raise / no-raise patterns.
  • The lift modules extract constraints directly from type-system metadata (Pydantic FieldInfo, msgspec Meta, stdlib Literal).

Their outputs land in staging, then build_corpus.py merges and deduplicates by fingerprint. validate_invariants.py replays every rule against the live library inside the engine container; confirmed rules ship in the validated YAML, quarantined rules land in _staging/_failed_*.yaml.

Static miner

The static miner reads engine library source via inspect.getsource() plus ast.parse() and walks the AST of known validator methods. It does not call constructors or run the validator methods. The library is still imported (to get source file paths), but no config classes are instantiated.

Why AST walking is necessary: pure dynamic introspection cannot recover the shape of cross-field predicates. The dynamic miner sees the message "num_beams should be divisible by num_beam_groups" but cannot determine that the underlying check is num_beams % num_beam_groups != 0. The static miner reads the predicate structure directly from the AST.

For each if body in a validator method, the miner runs five pattern detectors. Each targets a specific source pattern and emits a rule of a specific severity:

DetectorPattern matchedEmitted severity
ConditionalRaiseDetectorif X: raise SomeException(msg)error
ConditionalSelfAssignDetectorif X: self.A = B (silent normalisation)dormant
ConditionalWarningsWarnDetectorif X: warnings.warn(msg)warn
ConditionalLoggerWarningDetectorif X: logger.warning(msg)warn
MinorIssuesDictAssignDetectorHF-specific: if X: minor_issues[key] = msgdormant

Three filters guard against false positives: the predicate must reference a public field via self.<field>, self-assign targets must be public fields, and a representative kwargs_positive dict must be synthetically derivable from the predicate.

Static miner depth is fixed at 1: it walks one level of helper calls (for example WatermarkingConfig.validate, SynthIDTextWatermarkingConfig.validate) but does not trace through general function calls in the validator body. This avoids unbounded call-graph traversal while capturing the most common engine validation patterns.

Dynamic miner

The dynamic miner instantiates config classes with combinatorial probe values and observes raise / no-raise patterns. It then runs predicate inference on the resulting table of (kwargs, error_message) rows.

Small clusters (for example three fields, three values each) get full Cartesian coverage; large clusters fall back to Hypothesis's from_type value generator with a fixed seed. Hypothesis is used only as a deterministic value generator, not as a property-based test runner. The miner pipeline must be deterministic: the same library version plus miner code must produce the same corpus. Randomness would break Renovate-driven library-bump diffs.

Given the probe-row table, the dynamic miner infers one rule per distinct error-message class using seven predicate templates (in order of preference):

TemplateExampleFires when
Cross-field divisibilitya % b != 0error rows align with divisibility failure
Cross-field comparisona > berror rows align with comparison
Cross-field equality gatea == V AND b == Werror rows correlate with combined field values
Type allowlisttype(a) not in {T1, T2}error rows correlate with field type
Single-field rangea < 0error rows correlate with one field crossing a threshold
Single-field equalitya == Verror rows correlate with one field having a specific value
Value allowlista not in {v1, v2, ...}error rows correlate with field value not in a set

The dynamic miner errs toward recall: when multiple templates fit the evidence, it emits all plausible candidates. The validation-CI gate prunes false positives downstream.

Lift modules

The three lift modules extract constraints from type-system metadata without requiring probe rounds. They are independent stages that run alongside AST walking and probing.

Type-system axisLift moduleEngines using it
pydantic.BaseModel / pydantic.dataclasses_pydantic_lift.pyvLLM (27 pydantic-dataclasses); TRT-LLM (TrtLlmArgs, including Literal-typed enum fields)
msgspec.Struct_msgspec_lift.pyvLLM (SamplingParams)
stdlib @dataclass_dataclass_lift.pytransformers (GenerationConfig, BitsAndBytesConfig); vLLM (EngineArgs, 175 fields); TRT-LLM (BuildConfig, QuantConfig)

The Pydantic lift walks model_json_schema() and FieldInfo.metadata (Pydantic v2), emitting one rule per annotated-types constraint or Literal[...] allowlist found on a field. The msgspec lift walks msgspec.inspect.type_info() and per-field Constraints objects, mapping Meta(ge=, le=, ...) to the same operator vocabulary as the Pydantic lift. The dataclass lift walks dataclasses.fields() and extracts Literal[a, b, c] annotations - plain stdlib dataclasses carry no numeric-bound metadata, so it is limited to value-allowlist rules.

Per-engine miner comparison

The three engines have structurally different config surfaces, which determines which miners each uses:

EngineStatic minerDynamic minerLift modules
transformersGenerationConfig.validate(), BitsAndBytesConfig.post_init(); ~1700 LoC walkedCartesian cluster probingdataclass_lift (GenerationConfig, BitsAndBytesConfig)
vLLMSamplingParams._verify_args(); ~20 validator methodsCartesian + Hypothesis supplementpydantic_lift (27 vllm.config.* classes); msgspec_lift (SamplingParams); dataclass_lift (EngineArgs)
TRT-LLMBaseLlmArgs.validate_*(); ~11 validator methodsskipped (constructor yields zero raises)pydantic_lift (TrtLlmArgs); dataclass_lift (BuildConfig, QuantConfig)

TRT-LLM has no dynamic miner because empirical probing of TrtLlmArgs(**kwargs) constructors produced zero raises: TRT-LLM performs construction-time validation in a much more permissive way than transformers or vLLM. Its constraints are primarily enforced in validator methods (covered by the static miner) and at engine build time (hardware-gated, not corpus rules).

Build corpus: merge and dedup

build_corpus.py is the orchestration entrypoint. It runs all miners, collects staging files, merges them, deduplicates by fingerprint, and calls the validation-CI gate.

The deduplication key is (engine, severity, match_fields). Two rules with the same fingerprint are treated as the same constraint discovered by two independent paths (cross-validation). The merger keeps one rule with the primary added_by source and records the secondary source in cross_validated_by.

When static and dynamic miners both emit a rule with the same fingerprint, fields are merged by source preference:

FieldSource that wins
match.fields predicatestatic miner (more specific operators)
message_templatedynamic miner (real library text)
observed_messagesdynamic miner (real captured emissions)
kwargs_positive / kwargs_negativestatic miner (derived from conditional)
miner_source.line_at_scanstatic miner (real source line)
referencesunion (all evidence preserved)
idfirst source's id is canonical

Validation-CI gate

The validation-CI gate runs after merge. For every rule, it replays kwargs_positive and kwargs_negative against the live library inside the engine's Docker container, then checks three contracts:

  • positive_raises - CaptureBuffers.exception_type must not be None after running with kwargs_positive.
  • message_template_match - CaptureBuffers.exception_message must contain rule.message_template (the static fragment, with template variables removed).
  • negative_does_not_raise - running with kwargs_negative must produce a CaptureBuffers with exception_type is None.

Exit codes from validate_invariants.py:

  • 0 - all rules confirmed.
  • 1 - one or more divergences; validated YAML still written for diagnostic purposes.
  • 2 - hard error (corpus malformed, engine not importable).

The full format spec for the corpus YAMLs the pipeline produces is invariants corpus format.

Predicate-inference template coverage

The seven dynamic-miner templates were derived empirically from the transformers corpus. When the static miner encounters an AST predicate it cannot translate, it logs the dropped sub-clause (without failing). A monthly audit of the unparsed-predicate log drives empirical template expansion - templates are only added when a real rule shape appears three or more times.

The templates not adopted from Daikon's full library (linear arithmetic ternary z = ax + by + c, sortedness, sequence-equality) cover scientific-computing trace patterns not seen in engine config classes.

Renovate-driven refresh loop (parallel re-fire)

Library version bumps trigger both pipelines automatically. Renovate watches the engine SSOT (engine_versions/<engine>.yaml) plus the docker/Dockerfile.* files and opens a PR bumping the relevant version fields. The PR fans out to both cells in parallel; each cell probes, then runs its producer, then writes its artefact. The bot commits the combined artefacts back to the PR branch and posts a diff summary as a PR comment. A maintainer reviews and merges.

Cross-pipeline state lives on PR labels. The last cell to finish performs an atomic writeback covering both pipelines' artefacts plus the regenerated docs digests, in a single push. There is no separate summariser workflow.

When the bumped library version falls outside a miner's pinned envelope (miner_pins.{static|dynamic|discovery} in the SSOT), the producer raises MinerVersionMismatchError and CI fails. This is intentional: it forces a maintainer to update the miner against the new library version before the corpus is regenerated. The full structural CI mechanics, including the per-cell artefact contract and the human checkpoint after digest review, live in pipeline architecture.

Fail-loud import contract (shared across pipelines)

Both pipelines depend on the same fail-loud contract. Every miner and introspector module must resolve its version envelope from the engine SSOT and validate it at import time. This is a structural contract, not a guideline.

# Every *_miner.py must resolve its envelope from the engine's SSOT:
from scripts.engine_miners._ssot import load_miner_pin

_envelope = load_miner_pin("transformers", "static") # SpecifierSet

# And call this at import time:
check_installed_version(
"transformers",
importlib.metadata.version("transformers"),
_envelope,
)

The envelope itself lives in engine_versions/{engine}.yaml under miner_pins.{static|dynamic|discovery} - one pin per producer role. There is no per-module TESTED_AGAINST_VERSIONS constant; Renovate updates the SSOT and every producer reads through load_miner_pin.

If the installed library version falls outside the envelope, the producer raises MinerVersionMismatchError - a hard CI failure. If an expected class or method is missing from the library source (for example, a class was renamed in a library refactor), it raises MinerLandmarkMissingError - also a hard CI failure.

A previous extractor that swallowed ImportError and returned [] silently degraded into "no rules found for this engine", which masked a broken extractor. The fail-loud contract makes that impossible. The behaviour is pinned in place by _fixpoint_test.py, which synthesises one malformed rule per gate-soundness check (positive_raises, message_template_match, negative_does_not_raise) and asserts the validation-CI gate records a divergence for each. Removing any of the three checks fails the fixpoint test loudly.

See also