Changelog
All notable changes to this project are documented here.
Format follows Keep a Changelog.
Versioning follows Semantic Versioning (0.x pre-release series).
Minor version bumps (0.x.0) mark milestone completions. Breaking changes can occur between any 0.x release.
Unreleased
v0.10.0 - TBD
Post-v0.9.0 work: engine-coupling restructure, engine-invariants pipeline, Docusaurus docs site, Z-engines layout, and CI hardening.
Breaking Changes
-
engine: pytorchrenamed toengine: transformersthroughout YAML, CLI, and Python API. Thepytorchidentifier has been renamed totransformers- the engine runs HuggingFace Transformers.generate(). PyTorch is the tensor substrate, not the engine, and renaming aligns withpip install transformersand the library that owns the inference API.Migrate with:
sed -i 's/engine: pytorch/engine: transformers/g; s/^pytorch:/transformers:/g' your-study.yamlAffected: YAML engine value, YAML section key,
PyTorchConfigclass,ENGINE_PYTORCHconstant,[pytorch]extra,LLEM_RUNNER_PYTORCH/LLEM_IMAGE_PYTORCHenv vars, Docker image tags. Preserved (PyTorch the library - unchanged):import torch,torch_dtype,pytorch/pytorch:*base image,PYTORCH_VERSIONbuild args,torch_compile_backendfield. (#261) -
backend:field and--backendflag renamed toengine:in YAML configs, CLI, and result JSON. Aligns terminology with how vLLM, TRT-LLM, and HuggingFace use "engine" natively.Migrate with:
sed -i 's/^\(\s*\)backend:/\1engine:/g' your-study.yamlAffected: YAML field, CLI flag (
-bbecomes-e), result JSON fields"backend"and"backend_version", Python symbolsBackendPlugin,BackendError,BACKEND_*constants,get_backend(),detect_default_backend(). (#260) -
tensorrt.tp_sizerenamed totensorrt.tensor_parallel_sizeto matchTrtLlmArgsnative naming.transformers.tp_sizeis unchanged (follows theaccelerateconvention). (#269) -
Typed-field curation for engine configs. Applies the maximalist rubric "type anything with a plausible energy/throughput/latency path" to each engine's Pydantic surface. Dropped fields remain settable via YAML (
extra="allow"passthrough unless noted). (#270)Transformers: drops
revision(reproducibility metadata) andtrust_remote_code(security toggle); addsallow_tf32,autocast_enabled,autocast_dtype,low_cpu_mem_usage.vLLM: drops
sampling.max_tokensandbeam_search.max_tokens(duplicates ofExperimentConfig.max_output_tokens); addsnum_scheduler_steps,max_seq_len_to_capture,distributed_executor_backend; replaces flat speculative fields with nestedVLLMSpeculativeConfig.TensorRT-LLM: drops
engine_path,TensorRTCalibConfig,TensorRTBuildCacheConfig,sampling.return_perf_metrics, andbackend: Literal["trt"]; addspipeline_parallel_sizeandmax_num_tokens. -
Engines (vLLM, TensorRT-LLM) now run exclusively inside Docker. Host extras
[vllm]and[tensorrt]removed. Only[transformers]remains host-installable. (#498) -
dtype:anddecoder:fields migrated into per-engine sub-configs. Top-levelExperimentConfig.dtypeandExperimentConfig.decoderhave moved to each engine's own configuration section. (#290, #291) -
--dtypeand--batch-sizeCLI flags removed. Both fields are now set via YAML config only. (#292) -
precision:field renamed todtype:with standard value strings (e.g.float16,bfloat16instead of the prior enum). (#196)
Added
llem doctorCLI command reports per-engine image status (OK / MISMATCH / UNVERIFIED / UNREACHABLE) and exits non-zero on mismatch for CI gating. (#256)- Host/container schema fingerprint verification: Docker images stamped at build time with a
llem.expconf.schema.fingerprintOCI label. Mismatches abort with a rebuild hint. Bypassable viaLLEM_SKIP_IMAGE_CHECK=1. (#256) SchemaLoaderclass (llenergymeasure.config.SchemaLoader) reads vendored engine schemas viaimportlib.resourceswith per-instance caching and major-version envelope validation. (#268)- Engine parameter discovery script (
scripts/discover_engine_schemas.py) introspects installed engine packages inside their Docker images. Supportsvllm,tensorrt,transformers, and--all. (#266) - Vendored engine parameter schemas at
src/llenergymeasure/engines/{vllm,tensorrt,transformers}/. Regenerate withmake discover-schema ENGINE=<engine>. (#266) - Per-engine sub-package layout (
src/llenergymeasure/engines/<engine>/) co-locating runtime data, schema JSON, and engine invariants YAML. (#570) - Per-engine SSOT for library version pins (
engine_versions/) used by Renovate, Dockerfiles, and the invariant-mining pipeline. (#477) - Engine invariants mining pipeline: static and dynamic miners for all three engines extract validation rules as a reproducible corpus. (#375, #434, #444)
- Vendor-replay CI gate validates corpus against live engine packages; TensorRT gate runs on self-hosted GPU runner. (#414, #440, #447)
probeprimitive for binary miner reusability check. (#482)ConfigProbeprotocol and per-engineprobe_config()implementations. (#293)- Configurable per-experiment timeout via
study_execution.experiment_timeout_seconds(default 600 s), replacing the previousmax(n_prompts * 2, 600)heuristic. Both local and Docker paths honour the same field. (#250) - Disk-persisted baseline power cache with configurable strategy and TTL enforcement. (#242, #243)
- Per-study JSONL log capturing runtime warnings and container stderr. (#395)
llem report-gapscommand proposes corpus rules from runtime observations. (#397)- Study robustness features: circuit breaker, resume-on-failure, GPU locks, container lifecycle management. (#214)
- Live per-experiment progress display with Rich panels and sub-bullet heartbeats. (#152, #165)
.env-based runtime config and configurabledevice_mapdefault. (#275)trust_remote_codeopt-in viaLLEM_TRUST_REMOTE_CODEenv var. (#274)- TRT-LLM build cache configurable via
LLEM_TRT_BUILD_CACHE_{ENABLED,DIR}env vars. (#277) - Tensor parallelism fields (
tp_plan,tp_size) for the Transformers engine. (#161) - Cross-field operators in vendored-rules loader. (#410)
- Docusaurus documentation site at
website/serving user, methodology, API, and architecture docs. (#566) - Per-engine discovered-schema Markdown digest rendered to
docs/. (#560) - Architecture documentation suite in
docs/architecture/. (#433) - Per-engine engine-invariants and engine-schemas CI workflows with cross-pipeline coordination (consolidated from predecessor mine + vendor + parameter-discovery workflows). (#484, #486)
- Engine-pipeline orchestrator (
engine-pipeline.yml) as single reusable workflow entry point. (#514, #573) - Cloudflare Pages PR preview deploy workflow. (#575)
- SSOT audit trail and GHCR image retention policies. (#546)
Changed
- Re-typed
tensorrt.backendasLiteral["trt", "pytorch", "_autodeploy"] | None(reverses a prior incorrect curation-pass drop;Nonelets TRT-LLM auto-pick the runtime path). (#276) - Engine-invariants pipeline consolidated from separate mine + vendor + parameter-discovery workflows into a single orchestrated flow with sequential downstream pipelines. (#484, #573)
study_executionfield names updated (execution fields renamed,reverse/latin_squareordering modes added). (#190)- Dataset restructured into nested
DatasetConfigsub-model. (#195) OutputConfigextracted fromExperimentConfigas a separate sub-model. (#203)EnergyConfigflattened toenergy_sampler+gpu_telemetryfields. (#201)study_namefield replaces genericnamefield in study configs. (#182)n_promptsdefault reduced to 50;max_output_tokensdefault bumped to 256. (#175, #213)- Renovate customManager retargeted from Dockerfile ARGs to
engine_versions/SSOT. (#481) - First-party
Dockerfile.vllmandDockerfile.tensorrtreplaced with upstream-direct images plus volume mounts. (#509)
Fixed
ImportError: cuKernelGetNamewhen importingtensorrt_llm: LD_LIBRARY_PATH ordering placed the bundled compat CUDA 12.2 library ahead of the host-driver mount. Fixed by prepending/usr/local/cuda/compat/libso the host-driver mount takes precedence. (#264)- Miner
added_attimestamp lost on re-mine; f-stringmessage_templatefields now rendered correctly. (#523) Dockerfile.transformersstale references to the old[pytorch]extra and header comments corrected. (#265)- Config hash mismatch in Docker study runs resolved. (#176)
- Non-matching engine sections stripped correctly during multi-engine grid expansion. (#171)
- Docker auto-elevation enforced for multi-engine studies. (#172)
- Baseline cache path resolved before Docker bind-mount. (#248)
Removed
- Internal helper
llenergymeasure.study.runner._calculate_timeout(replaced by direct config reads). (#529) - First-party
Dockerfile.vllmandDockerfile.tensorrtengine images. (#509) - Predecessor CI workflows:
auto-mine.yml,vendor-tensorrt.yml,vendor-vllm.yml,parameter-discovery.yml, and predecessors. (#483, #485)
v0.9.0 - 2026-03-20
Docker infrastructure, vLLM engine, TensorRT-LLM engine, package restructure, test hardening, and CI.
Added
- NVML GPU memory residual check before experiment dispatch (threshold 1 GB), preventing stale-process contamination. (#24, #26)
- Docker runner infrastructure: container lifecycle management, volume mounts, GPU index resolution. (#27, #124)
- Docker pre-flight environment checks. (#28)
- TensorRT-LLM Docker image rewrite with CUDA 12.6.2 upgrade. (#114)
TensorRTConfigexpanded to full TRT-LLM parameter schema. (#115)mpiruninjection for TensorRT-LLM tensor parallelism. (#116)BackendPlugin.validate_configprotocol method. (#121)TensorRTBackendimplementation registered inget_backend(). (#122)TensorRTConfig.engine_pathfor pre-compiled engine loading. (#143)- 9-layer import-linter architecture enforcement in CI. (#135, #144)
Changed
- Package restructured with file moves, import rewrites, and layer boundary fixes. (#133, #134)
- Prompt loading moved outside the NVML measurement window. (#145)
- Shared backend helpers extracted; dead warmup code removed. (#140)
- Test suite restructured;
importorskipguards added for optional dependencies. (#137, #138)
Fixed
acceleraterestored as a[pytorch]optional dependency (accidentally dropped). (#132)- Runner mode auto-detection (local vs Docker) on startup. (#146)
- Silent
NVMLError, payload detection, and emptygpu_indicesguard. (#141)
Removed
- Dead code, stale type annotations, and unused dependencies. (#130)
v0.8.0 - 2026-02-27
Multi-experiment study sweeps.
Added
run_study()public API for multi-experiment studies. (#23)StudyConfigwith sweep grammar (grid and cycle ordering). (#23)- YAML-driven parameter sweeps across models, engines, and precisions. (#23)
StudyRunnerwith sequential experiment dispatch. (#23)- Study-level aggregation and result collection. (#23)
- Manifest-based progress tracking with resume support. (#23)
v0.7.0 - 2026-02-27
First end-to-end single-experiment release.
Added
run_experiment()public API. (#22)ExperimentConfigtoExperimentResultpipeline. (#22)- Energy measurement via CodeCarbon and Zeus backends. (#22)
- Extended metrics: TPOT, TEI, memory efficiency. (#22)
- Streaming latency measurement (TTFT / ITL). (#22)
- Results persistence in Parquet format. (#22)
Historical (pre-0.x)
The entries below predate the current 0.x versioning scheme introduced in early 2026. They describe the research prototype and early CLI rewrites that were restructured and re-versioned starting from v0.1.0. Version numbers v1.x and v2.0.0 referenced here are legacy labels from that era; they do not correspond to any published release under the current scheme. The 2026-03-04 history reset remapped these to sequential 0.x tags (v0.1.0-v0.6.0) for consistency with the current versioning scheme.
v0.6.0 (2025-12-29) - formerly v1.16.0
Production-ready containerisation with full GPU support and streamlined developer experience.
Added
- Multi-stage Dockerfile with
nvidia/cuda:12.4.1-runtime-ubuntu22.04base image (builder, runtime, and dev stages). - Docker Compose profiles separating production and development workflows (
lem-app,lem-dev). - VS Code devcontainer configuration with GPU passthrough and Ruff/Pylance extensions.
- Makefile targets for common Docker operations (
make docker-build,make experiment,make datasets).
Changed
- CI workflow reliability improved with concurrency groups preventing parallel releases.
- Dev container runs as root, eliminating permission complexity with virtual environments.
Fixed
- Docker CUDA 12.4 base image aligned with host driver requirements.
- Volume permission errors resolved by running dev containers as root.
- Deprecated
torch_dtypeparameter replaced withdtypein model loading. - Removed obsolete
TRANSFORMERS_CACHEenvironment variable (superseded byHF_HOME). - CodeCarbon pandas
FutureWarningsuppressed. nvidia-smiGPU utilisation parsing handles[N/A]values gracefully.
v0.5.0 (2025-12-21) - formerly v1.15.0
Comprehensive test coverage ensuring reliability across all components.
Added
- End-to-end CLI tests (8 tests) validating complete benchmark workflows.
- Integration tests (47 tests) covering non-GPU workflows.
- Methodology documentation (
docs/methodology.md) explaining measurement approach.
Changed
- Total test count: 416 passing tests (unit + integration + e2e).
- All tests run without GPU access using mocked/simulated data.
Removed
requirements.txt(306 frozen packages) - all dependencies now managed via Poetry lockfile.
v0.4.0 (2025-12-21) - formerly v1.13.0
User-friendly command-line interface replacing legacy entry points.
Added
- Typer-based CLI (
lem) with subcommands:experiment,aggregate,config validate,config show,results list,results show,datasets. ExperimentOrchestratorwith protocol-based dependency injection.ExperimentContextdataclass for runtime state management.- Accelerate launcher with configurable retry logic.
- 25 CLI tests and 27 orchestration unit tests.
Removed
- Legacy
MAIN_*.pyentry points (6 files).
v0.3.0 (2025-12-20) - formerly v1.10.0
Major architectural refactor establishing clean module boundaries.
Breaking Changes
- Package renamed:
llm-benchtolem. All imports now usellenergymeasure.
Added
- Energy backend plugin registry with automatic CodeCarbon registration.
FlopsEstimatorwith three-strategy fallback chain (calflops, architecture, parameter estimate), each returning a confidence level.- Results aggregation with temporal overlap detection and GPU attribution verification.
- Export functionality for CSV and JSON formats.
- 296 unit tests covering all new modules.
Changed
- Replaced
print()statements with Loguru structured logging.
v0.2.0 (2025-05-17) - formerly v1.0.0
Research phase complete - stable multi-model benchmarking validated on production hardware.
Added
- Multi-model experiment support with scenario-based configuration.
- Experiment suite CSV export with consistent naming conventions.
- Failed experiment detection with cycle tracking and automatic retry.
- Minimum output token enforcement for comparable generation lengths.
- Large model stability improvements (gradient checkpointing, CUDA cache clearing).
- Data wrangling pipelines for experiment result analysis (Pandas-based).
- Plotting functionality for efficiency metrics visualisation.
- FLOPs caching preventing redundant calculations.
v0.1.0 (2025-03-22) - formerly v0.5.0
Core measurement functionality establishing the foundation for all subsequent development.
Added
- Distributed results aggregation across multiple GPUs with per-process JSON files.
- FLOPs calculation with quantisation awareness and
calflopsintegration. - Robust process cleanup with signal handlers and distributed barrier synchronisation.
- Optimum benchmark integration for standardised measurements.
Changed
- Distributed execution stability improved: proper NCCL initialisation and teardown.
- Major directory restructuring separating config, core, and result handling.