Skip to main content

Invariants corpus format

Format specification for the YAML invariants corpus: top-level envelope, per-rule fields, valid values, schema-version history, and example.

The corpus is the artefact emitted by the invariant-mining pipeline. For the conceptual treatment of how the pipeline produces it (and how it parallels schema discovery), see engine introspection pipelines. For the runtime side that consumes the corpus at config-validation time, see parameter discovery.


File locations

src/llenergymeasure/engines/{engine}/
├── invariants.proposed.yaml Maintainer-seeded corpus, post-mining
└── invariants.validated.yaml CI-validated overlay, post-validate-replay

src/llenergymeasure/engines/{engine}/_staging/ (gitignored, miner-only)
├── {engine}_static_miner.yaml Per-miner staging output (not committed)
├── {engine}_dynamic_miner.yaml
└── _failed_validation_{engine}.yaml Quarantined rules

src/llenergymeasure/config/engine_invariants/
└── loader.py Runtime consumer

The two corpus files form a lifecycle pair: the miner pipeline writes the proposed YAML, then validate_invariants.py replays each invariant inside the engine's Docker image and writes the validated YAML. The runtime loader overlays validated observations onto the proposed corpus, so consumers see CI-confirmed behaviour where available and the declared shape elsewhere.


Top-level corpus envelope

schema_version: 1.0.0 # Major version must match loader's SUPPORTED_MAJOR_VERSION
engine: transformers # Engine name; must match a known engine key
engine_version: 4.56.0 # Library version the corpus was mined and validated against
mined_at: '2026-04-25T18:01:18Z' # ISO 8601 timestamp of last full mine run
invariants:
- ... # List of rule entries

Rule schema: annotated example

Below is a complete rule from the transformers corpus with every field annotated.

- id: transformers_beam_search_num_beams_not_divisible_by_num_beam_groups
# Unique identifier for this rule.
# Convention: {engine}_{subject_field}_{condition_slug}
# Used as the key in divergence reports and error messages.

engine: transformers
# Engine this rule applies to.
# Must match the corpus envelope's engine field.

library: transformers
# Python package name (importlib.metadata.version uses this).
# Matches engine for single-library engines;
# may differ for engines that alias a library.

invariant_under_test: "GenerationConfig.__init__ flags `num_beams` (num beams not divisible by num beam groups)"
# Human-readable description of what library behaviour this rule captures.
# Format: {NativeType}.{method} flags {field} ({condition})

severity: error
# One of: error | warn | dormant
# error - engine raises; loader rejects before initialisation
# warn - engine announces; loader warns the user
# dormant - engine silently normalises; loader annotates the config

native_type: transformers.GenerationConfig
# The fully-qualified class name the rule's predicate applies to.
# Used by the validation-CI gate to know which class to instantiate.

miner_source:
path: transformers/generation/configuration_utils.py
# Relative path within the library's source tree.
method: __init__
# Method name where the AST detector found this rule.
line_at_scan: 361
# Line number in the source at mined_at time.
# Will drift when the library is updated; used for human inspection only.

match:
engine: transformers
# Must match rule.engine (redundant, for grep-ability).
fields:
transformers.sampling.num_beams:
not_divisible_by: "@num_beam_groups"
# Field paths are dotted, resolved against ExperimentConfig.
# Operator: not_divisible_by - fires when a % b != 0.
# @num_beam_groups is a @field_ref: resolved as a sibling field.

kwargs_positive:
num_beams: 2
num_beam_groups: 3
# kwargs that trigger the rule (should cause the engine to raise/warn/normalise).
# Passed directly to the native_type constructor in the validation-CI gate.
# 2 is not divisible by 3, so the rule fires.

kwargs_negative:
num_beams: 4
num_beam_groups: 2
# kwargs that do NOT trigger the rule (should pass cleanly).
# 4 is divisible by 2, so the rule does not fire.

expected_outcome:
outcome: error
# One of: dormant_silent | dormant_announced | warn | error | pass
emission_channel: none
# How the engine signals the issue.
# One of: warnings_warn | logger_warning | logger_warning_once |
# minor_issues_dict | none | runtime_exception
normalised_fields: []
# For dormant rules: which fields the engine silently normalises.
# Empty for error rules.

message_template: >
`num_beams` has to be divisible by `num_beam_groups`, but got
`num_beams`={declared_value} and `num_beam_groups`={declared_value}.
# The static fragment of the library's error message.
# Used by the validation-CI gate's message_template_match check:
# the gate asserts this fragment appears in the live library's exception message.
# Template variables ({declared_value}, {effective_value}, etc.) are
# substituted when the rule fires at validation time.

references:
- "transformers.GenerationConfig.__init__() - observed via construction-time ValueError"
# Human-readable provenance citations. Free-form strings.
# Useful for tracking down the library source line that motivated the rule.

added_by: dynamic_miner
# Provenance: which pipeline component produced this rule.
# See AddedBy values below.

added_at: '2026-04-25'
# Date (YYYY-MM-DD) when this rule was added to the corpus.

cross_validated_by: []
# Optional. Other miner sources that independently emitted a rule with the
# same fingerprint (engine + severity + match.fields).
# Set by build_corpus.py when two miners agree; empty for single-source rules.

Field reference

id

Unique identifier for the rule. Used in error messages, divergence reports, and logs. Convention:

{engine}_{native_type_slug}_{condition_slug}

Examples:

  • transformers_beam_search_num_beams_not_divisible_by_num_beam_groups
  • vllm_sampling_temperature_out_of_range
  • tensorrt_quantization_fp8_not_supported_on_sm80

severity

ValueWhen it firesUser sees
errorEngine would raise at construction / validate timeValueError (surfaced as Pydantic ValidationError) before initialisation
warnEngine announces a suboptimal setting but proceedsWarning message
dormantEngine silently normalises or ignores the fieldAnnotation: "field X will be coerced to Y"

expected_outcome.outcome

ValueMeaning
dormant_silentEngine silently normalises; no user-visible emission
dormant_announcedEngine writes to minor_issues dict / logger, but config runs
warnEngine calls warnings.warn(...) or equivalent
errorEngine raises at construct / validate time
passPredicate matched but engine handles it cleanly (positive-reference rules)

expected_outcome.emission_channel

ValueMeaning
warnings_warnPython warnings.warn(...)
logger_warningstdlib logger .warning(...)
logger_warning_oncestdlib logger .warning_once(...)
minor_issues_dictHF's internal minor_issues dict (user-observable via strict-mode raise or log)
noneNo user-visible emission (silent coercion or bare raise with no warning prefix)
runtime_exceptionException raised at engine construct / runtime

added_by

The provenance of the rule - which pipeline component produced it.

added_by valueSource
static_minerAST walking of validator methods
dynamic_minerCombinatorial probing (raise/no-raise observation)
pydantic_liftmodel_json_schema() + FieldInfo.metadata
msgspec_liftmsgspec.inspect.type_info() + Meta constraints
dataclass_liftdataclasses.fields() + Literal[...] annotations
manual_seedHand-written by a maintainer (pipeline-failure debt; use sparingly, add justification comment)
runtime_warningProposed by the feedback loop from observed logger.warning_once emissions (needs human generalisation before landing)
observed_collisionProposed by the feedback loop from config-hash collision detection (needs human generalisation before landing)

miner_source

The {path, method, line_at_scan} record pointing back to the library source.

  • path: relative path within the library's source tree (e.g. transformers/generation/configuration_utils.py).
  • method: the method name where the detector found this rule.
  • line_at_scan: line number at the time of mining. This will drift when the library is updated; it is for human inspection only, not machine comparison.

cross_validated_by

When two or more miners independently emit a rule with the same fingerprint (same engine + severity + match.fields), build_corpus.py keeps one rule as primary (added_by = primary source) and records the secondary source in cross_validated_by. Cross-validation is evidence that the rule is real: independent paths agree.


Match predicate operators

Full reference for the match.fields operator keys:

Operator keyFires whenNotes
"=="/ "equals"field == valueWord and symbol forms are aliases
"!=" / "not_equal"field != valueNone-safe (does not fire if field is None)
"<"field < valueNone-safe
"<="field <= valueNone-safe
">"field > valueNone-safe
">="field >= valueNone-safe
"in"field in [v1, v2, ...]Spec must be list/tuple/set
"not_in"field not in [v1, v2, ...]None-safe; spec must be list/tuple/set
"present"field is not None
"absent"field is None
"type_is"type(field).__name__ in name_setAccepts string or list of strings
"type_is_not"type(field).__name__ not in name_setNone-safe
"divisible_by"field % divisor == 0Both operands must be non-bool ints; b=0 → False
"not_divisible_by"field % divisor != 0Both operands must be non-bool ints; b=0 → False

Bare value shorthand

A bare value (not a dict) in the match.fields spec is shorthand for equality:

# These two are equivalent:
transformers.sampling.num_beams: 1
transformers.sampling.num_beams:
"==": 1

@field_ref cross-field references

Any operator value that starts with @ is resolved as a field reference:

transformers.sampling.num_beams:
not_divisible_by: "@num_beam_groups"
# "@num_beam_groups" resolves as a sibling:
# config.transformers.sampling.num_beam_groups

transformers.sampling.num_beams:
not_divisible_by: "@transformers.sampling.num_beam_groups"
# Dotted ref resolves from the config root.
# Equivalent to the sibling form when the parent namespace is the same.

manual_seed rules

manual_seed rules are hand-written by a maintainer. They exist for coverage gaps where the miner pipeline cannot mechanically derive the constraint (e.g. a type-check rule in a library method the static miner does not walk, or a constraint that requires understanding semantics the AST cannot express).

manual_seed is pipeline-failure debt: the right long-term fix is extending the miner to cover the gap. Each manual_seed entry should carry a justification comment explaining why the miner cannot cover it and what would be needed to close the gap.

- id: bitsandbytes_load_in_4bit_and_8bit_mutually_exclusive
...
added_by: manual_seed
# Justification: BitsAndBytesConfig.__init__ checks load_in_4bit AND load_in_8bit
# in the same branch, but the dynamic miner's BNB cluster was not added in the
# refactor (scope decision). Extend the dynamic miner with a bitsandbytes_quant
# cluster to close this gap.

Schema version history

VersionChanges
1.0.0Initial release. added_by as single string, cross_validated_by optional list, mined_at top-level field. Replaces pre-1.0 walked_at field name.

Corpus invariants

The CI pipeline enforces these invariants on every corpus file:

  1. schema_version major must equal SUPPORTED_MAJOR_VERSION in loader.py.
  2. Every added_by value must be in the AddedBy Literal.
  3. Every severity value must be in {"error", "warn", "dormant"}.
  4. Every expected_outcome.outcome must be in Outcome.
  5. Every expected_outcome.emission_channel must be in EmissionChannel.
  6. All id values within one corpus file must be unique.
  7. Dormant rules must converge to a stable fixpoint (verified by _fixpoint_test.py).
  8. Vendor-CI gate: kwargs_positive must trigger the rule, message_template must match, kwargs_negative must not trigger.

See also