Skip to main content

Engine configuration

This page documents the per-engine YAML configuration surface. Each experiment selects exactly one engine via the top-level engine: field and configures it through a same-named block (transformers:, vllm:, tensorrt:). The fields documented below are the ones declared on the engine's Pydantic model in src/llenergymeasure/config/engine_configs.py; unknown fields are forwarded to the underlying engine via extra="allow", so newer engine parameters work without an llenergymeasure release.

For study-level controls (sweeps, runners, cycles, output) see study-config.md. For the auto-generated parameter inventories (full type tables straight from engine introspection) see the per-engine schema pages in this section.

Top-level shape

A single experiment YAML has three blocks plus the engine selector and an optional engine-specific section:

task:
model: gpt2
dataset:
source: aienergyscore
n_prompts: 100
order: interleaved
max_input_tokens: 256
max_output_tokens: 256
random_seed: 42

engine: transformers

measurement:
warmup:
enabled: true
n_warmup: 5
baseline:
enabled: true
duration_seconds: 30.0
energy_sampler: auto

transformers:
batch_size: 4
dtype: bfloat16
attn_implementation: sdpa

# Optional
sampling_preset: deterministic # deterministic | standard | creative | factual
lora:
adapter_id: org/lora-adapter
merge_weights: false
passthrough_kwargs:
trust_remote_code: true

The engine-specific section must match the engine: field; mixing engine: vllm with a transformers: section is a configuration error (models.py:441-464). When engine: is set without a matching section, the engine's own defaults are used.

dtype lives inside the engine section (transformers.dtype, vllm.dtype, tensorrt.dtype) because each engine accepts a different subset of dtypes (ssot.py:156-160). There is no top-level dtype: field.

runners: and images: are study-level fields and not valid in a single-experiment YAML; they belong on StudyConfig (models.py:740-759). See study-config.md.

Common fields (all engines)

These fields are declared on ExperimentConfig and apply identically across engines.

task:

FieldTypeDefaultDescriptionSource
modelstr (required)-HuggingFace model ID or local pathmodels.py:265-270
dataset.sourcestraienergyscoreBuilt-in dataset alias or .jsonl pathmodels.py:198-203
dataset.n_promptsint >= 1100Number of prompts to load or generatemodels.py:204-209
dataset.orderinterleaved | grouped | shuffledinterleavedPrompt ordering strategymodels.py:210-216
max_input_tokensint >= 1 | null256Input truncation cap; null disablesmodels.py:276-284
max_output_tokensint >= 1 | null256Output token budget; null generates to EOS or context limitmodels.py:285-293
random_seedint42Per-experiment seed for inference RNG and dataset orderingmodels.py:294-297

measurement:

FieldTypeDefaultDescriptionSource
warmup.enabledbooltrueEnable warmup phase before measurementmodels.py:79
warmup.n_warmupint >= 15Number of full-length warmup promptsmodels.py:81-85
warmup.thermal_floor_secondsfloat >= 30.060.0Minimum post-warmup wait for thermal stabilisationmodels.py:86-90
warmup.convergence_detectionboolfalseEnable adaptive CV-based convergence (additive to n_warmup)models.py:93-96
warmup.cv_thresholdfloat [0.01, 0.5]0.05CV target for convergencemodels.py:97-102
warmup.max_promptsint >= 520Safety cap for CV modemodels.py:103-107
warmup.window_sizeint >= 33Sliding window size for CV calculationmodels.py:108-112
warmup.min_promptsint >= 15Minimum prompts before checking convergencemodels.py:113-117
baseline.enabledbooltrueMeasure idle GPU power before experimentsmodels.py:142
baseline.duration_secondsfloat [5.0, 120.0]30.0Baseline measurement windowmodels.py:143-148
baseline.strategycached | validated | freshvalidatedCaching strategy for the baseline measurementmodels.py:149-156
baseline.cache_ttl_secondsfloat >= 60.07200.0Cached baseline lifetime (cached/validated only)models.py:157-164
baseline.validation_intervalint >= 15Re-validate every N experiments (validated only)models.py:165-171
baseline.drift_thresholdfloat [0.01, 0.50]0.10Drift fraction that triggers re-measurement (validated only)models.py:172-180
energy_samplerauto | nvml | zeus | codecarbon | nullautoEnergy sampler; null disables energy measurementmodels.py:320-327

Top-level optional fields

FieldTypeDefaultDescriptionSource
sampling_presetdeterministic | standard | creative | factual | nullnullMerges preset values into the active engine's sampling: section at parse time. Explicit YAML values take precedence.models.py:367-375, ssot.py:27-32
lora.adapter_idstr | nullnullHuggingFace Hub adapter ID (mutually exclusive with adapter_path)models.py:232
lora.adapter_pathstr | nullnullLocal path to adapter weightsmodels.py:233
lora.merge_weightsboolfalseMerge adapter weights into base model at load timemodels.py:234-236
passthrough_kwargsdict | nullnullExtra kwargs forwarded to the engine; keys must not collide with ExperimentConfig fieldsmodels.py:395-399

Transformers engine (transformers:)

Loads a model via AutoModelForCausalLM.from_pretrained() and generates with model.generate(). All fields default to null meaning "use the engine's own default". Unknown fields under transformers: are forwarded to the underlying HuggingFace APIs (engine_configs.py:94).

Parameters

FieldTypeDefaultDescriptionSource
batch_sizeint >= 11Prompts per forward passengine_configs.py:100-104
dtypefloat32 | float16 | bfloat16bfloat16Model compute dtypeengine_configs.py:110-113
attn_implementationsdpa | flash_attention_2 | flash_attention_3 | eagersdpaAttention kernelengine_configs.py:119-124
torch_compileboolfalseEnable torch.compileengine_configs.py:130-133
torch_compile_modestrdefaultdefault | reduce-overhead | max-autotune. Requires torch_compile=true.engine_configs.py:134-137
torch_compile_backendstrinductortorch.compile backend. Requires torch_compile=true.engine_configs.py:138-141
load_in_4bitboolfalseBitsAndBytes 4-bit quantisationengine_configs.py:147-150
load_in_8bitboolfalseBitsAndBytes 8-bit quantisation (mutually exclusive with load_in_4bit)engine_configs.py:151-154
bnb_4bit_compute_dtypefloat16 | bfloat16 | float32float32Compute dtype for 4-bit. Requires load_in_4bit=true.engine_configs.py:155-158
bnb_4bit_quant_typenf4 | fp4nf44-bit quantisation type. Requires load_in_4bit=true.engine_configs.py:159-162
bnb_4bit_use_double_quantboolfalseDouble quantisation. Requires load_in_4bit=true.engine_configs.py:163-166
use_cachebooltrueEnable KV cache during generationengine_configs.py:172-175
cache_implementationstatic | offloaded_static | sliding_windowdynamicKV cache strategy; static enables CUDA graphs. Requires use_cache to be true or unset.engine_configs.py:176-179
num_beamsint >= 11Beam search width (1 = greedy/sampling)engine_configs.py:185-189
early_stoppingboolfalseStop beam search when all beams hit EOSengine_configs.py:190-193
length_penaltyfloat1.0Beam length penalty (>1 favours shorter, <1 longer)engine_configs.py:194-197
no_repeat_ngram_sizeint >= 00Prevent n-gram repetition (0 = disabled)engine_configs.py:203-207
prompt_lookup_num_tokensint >= 1 | nullnullPrompt-lookup speculative decoding tokensengine_configs.py:213-217
device_mapstrautoDevice placement strategyengine_configs.py:223-226
max_memorydictnullPer-device memory limits, e.g. {0: "10GiB", cpu: "50GiB"}engine_configs.py:227-230
low_cpu_mem_usageboolfalseLoad weights incrementally to reduce peak CPU RAMengine_configs.py:248-251
allow_tf32bool | nullnullAllow TF32 matmul on Ampere+engine_configs.py:236-239
autocast_enabledboolfalseEnable torch.autocast during generationengine_configs.py:240-243
autocast_dtypefloat16 | bfloat16bfloat16AMP dtype (used when autocast_enabled=true)engine_configs.py:244-247
tp_planauto | nullnullNative HF tensor parallelism plan (HF >= 4.50). Mutually exclusive with device_map; requires torchrun launch.engine_configs.py:257-264
tp_sizeint >= 1WORLD_SIZETensor parallel ranks. Used only when tp_plan is set.engine_configs.py:265-273
sampling.*sub-confignullmodel.generate() sampling kwargs (see below)engine_configs.py:279-285

transformers.sampling: sub-section

Maps to model.generate() kwargs. Field names mirror HuggingFace's native conventions (top_k=0 for disabled, do_sample controls greedy vs sampling) (engine_configs.py:41-79).

FieldTypeDefaultDescriptionSource
temperaturefloat [0.0, 2.0]HF defaultSampling temperature (0 = greedy)engine_configs.py:54-56
do_samplebool | nullHF defaultEnable sampling. Greedy is inferred from temperature=0 when null.engine_configs.py:57-62
top_kint >= 050HF convention: 0 = disabledengine_configs.py:63-67
top_pfloat [0.0, 1.0]1.0Nucleus sampling threshold (1.0 = disabled)engine_configs.py:68-70
repetition_penaltyfloat [0.1, 10.0]1.0Repetition penalty (1.0 = no penalty)engine_configs.py:71-73
min_pfloat [0.0, 1.0]nullMinimum probability filterengine_configs.py:74-76
min_new_tokensint >= 1HF defaultMinimum output tokensengine_configs.py:77-79

vLLM engine (vllm:)

vLLM exposes a two-API surface (vllm.LLM() constructor and SamplingParams); the configuration mirrors that split. Beam search uses a separate vllm.beam_search: block, mutually exclusive with vllm.sampling: (engine_configs.py:842-850).

Top-level vLLM fields

FieldTypeDefaultDescriptionSource
dtypefloat16 | bfloat16 | autobfloat16Model dtype; auto infers from weights. float32 is not supported.engine_configs.py:817-823
engine.*sub-confignullvllm.LLM() constructor argumentsengine_configs.py:824-827
sampling.*sub-confignullvllm.SamplingParams() argumentsengine_configs.py:828-833
beam_search.*sub-confignullvllm.BeamSearchParams() arguments (mutually exclusive with sampling)engine_configs.py:834-840

vllm.engine: sub-section

Loaded once at model initialisation. Unknown fields are forwarded to vllm.LLM() (engine_configs.py:434).

FieldTypeDefaultDescriptionSource
gpu_memory_utilizationfloat [0.0, 1.0)0.9GPU memory fraction reserved for KV cacheengine_configs.py:440-447
swap_spacefloat >= 0.04 GiBCPU swap for KV cache offloadengine_configs.py:448-455
cpu_offload_gbfloat >= 0.00CPU RAM in GiB to offload model weights toengine_configs.py:456-463
block_size8 | 16 | 3216KV cache block size in tokensengine_configs.py:469-475
kv_cache_dtypeauto | fp8 | fp8_e5m2 | fp8_e4m3autoKV cache storage dtype; fp8 halves VRAM on Ampere+engine_configs.py:476-482
enforce_eagerboolfalseDisable CUDA graphs, always use eager modeengine_configs.py:488-494
enable_chunked_prefillboolfalseChunk large prefills across scheduler iterationsengine_configs.py:495-501
max_num_seqsint >= 1256Max concurrent sequences per scheduler iterationengine_configs.py:507-514
max_num_batched_tokensint >= 1autoMax tokens processed per scheduler iteration. Must be >= max_model_len when both are set.engine_configs.py:515-522
max_model_lenint >= 1model defaultMax sequence length (input + output)engine_configs.py:523-530
num_scheduler_stepsint >= 11Multi-step scheduling: decode N steps before returning to schedulerengine_configs.py:531-535
tensor_parallel_sizeint >= 11Number of GPUs to shard the model acrossengine_configs.py:541-547
pipeline_parallel_sizeint >= 11Pipeline parallel stagesengine_configs.py:548-552
distributed_executor_backendmp | raympMulti-GPU executor backendengine_configs.py:553-556
enable_prefix_cachingboolfalseAutomatic prefix caching for shared prompt prefixesengine_configs.py:562-565
quantizationawq | gptq | fp8 | fp8_e5m2 | fp8_e4m3 | marlin | bitsandbytesnullQuantisation method (requires pre-quantised checkpoint)engine_configs.py:566-571
max_seq_len_to_captureint >= 18192Maximum sequence length eligible for CUDA graph captureengine_configs.py:577-581
speculative_config.*sub-confignullSpeculative decoding (see below)engine_configs.py:587-593
offload_group_sizeint >= 00Groups of layers for CPU offloadingengine_configs.py:599-603
offload_num_in_groupint >= 11Layers offloaded per groupengine_configs.py:604-608
offload_prefetch_stepint >= 01Prefetch steps ahead for CPU offloadengine_configs.py:609-613
offload_paramslist[str]nullSpecific parameter names to offloadengine_configs.py:614-617
disable_custom_all_reduceboolfalseDisable custom all-reduce for multi-GPUengine_configs.py:623-626
kv_cache_memory_bytesint >= 1nullAbsolute KV cache size; mutually exclusive with gpu_memory_utilizationengine_configs.py:632-639
compilation_configdictnullFull passthrough to vLLM CompilationConfig (no validation)engine_configs.py:645-651
attention.*sub-confignullAttention backend selection (see below)engine_configs.py:657-660

vllm.engine.speculative_config: sub-section

Mirrors vLLM's native speculative_config shape; unknown fields are forwarded (engine_configs.py:353).

FieldTypeDefaultDescriptionSource
modelstrnullDraft model name or pathengine_configs.py:355-358
num_speculative_tokensint >= 1nullTokens to draft per speculative stepengine_configs.py:359-363
methodstrnullSpeculative method (e.g. draft_model, ngram, medusa, eagle)engine_configs.py:364-371

vllm.engine.attention: sub-section

Maps to vLLM's AttentionConfig. All fields default to null (vLLM auto-selects). Unknown fields are forwarded (engine_configs.py:382).

FieldTypeDefaultDescriptionSource
backendstrautoAttention backend (flash_attn, flashinfer, ...)engine_configs.py:384-387
flash_attn_versionintautoFlash attention versionengine_configs.py:388-391
flash_attn_max_num_splits_for_cuda_graphintautoMax splits for CUDA graph with flash attentionengine_configs.py:392-395
use_prefill_decode_attentionbooltrueUse prefill-decode attentionengine_configs.py:396-399
use_prefill_query_quantizationboolfalseQuantise queries during prefillengine_configs.py:400-403
use_cudnn_prefillboolfalseUse cuDNN for prefillengine_configs.py:404-407
disable_flashinfer_prefillboolfalseDisable FlashInfer for prefillengine_configs.py:408-411
disable_flashinfer_q_quantizationboolfalseDisable FlashInfer query quantisationengine_configs.py:412-415
use_trtllm_attentionboolfalseUse TensorRT-LLM attention backendengine_configs.py:416-419
use_trtllm_ragged_deepseek_prefillboolfalseUse TRT-LLM ragged DeepSeek prefillengine_configs.py:420-423

vllm.sampling: sub-section

Maps to vllm.SamplingParams(). max_tokens is intentionally absent; it is bridged from task.max_output_tokens at execution time (engine_configs.py:702-704). top_k follows vLLM's -1 for disabled convention.

FieldTypeDefaultDescriptionSource
temperaturefloat [0.0, 2.0]vLLM defaultSampling temperature (0 = greedy)engine_configs.py:710-712
top_kint-1-1 = disabled (vLLM convention)engine_configs.py:713-716
top_pfloat [0.0, 1.0]1.0Nucleus sampling thresholdengine_configs.py:717-719
repetition_penaltyfloat [0.1, 10.0]1.0Repetition penaltyengine_configs.py:720-722
min_pfloat [0.0, 1.0]nullMinimum probability filterengine_configs.py:723-725
min_tokensint >= 00Minimum output tokens before EOS allowedengine_configs.py:728-732
presence_penaltyfloat [-2.0, 2.0]0.0Penalises tokens that appear at allengine_configs.py:733-741
frequency_penaltyfloat [-2.0, 2.0]0.0Penalises tokens proportional to frequencyengine_configs.py:742-750
ignore_eosboolfalseContinue generating past EOS (forces full max_tokens generation)engine_configs.py:751-757
nint >= 11Number of output sequences per promptengine_configs.py:758-762

vllm.beam_search: sub-section

Mutually exclusive with vllm.sampling:. When set, the engine uses BeamSearchParams instead of SamplingParams (engine_configs.py:842-850). max_tokens is bridged from task.max_output_tokens.

FieldTypeDefaultDescriptionSource
beam_widthint >= 1vLLM defaultNumber of beamsengine_configs.py:779-783
length_penaltyfloat1.0Length penalty (>1 favours shorter, <1 longer)engine_configs.py:784-787
early_stoppingboolfalseStop when beam_width complete sequences are foundengine_configs.py:788-791

TensorRT-LLM engine (tensorrt:)

TensorRT-LLM compiles a model into an optimised engine on first use; subsequent runs reuse the cached engine. Compile-time fields are baked into the engine and changing one invalidates the cache. The nested sub-configs mirror TRT-LLM's own API split: quant_config, kv_cache_config, scheduler_config, sampling (engine_configs.py:991-995).

Note on the internal backend field. TRT-LLM's own tensorrt.backend field selects the runtime mode within TRT-LLM (trt, pytorch, _autodeploy). It is distinct from the top-level engine: selector that picks transformers / vLLM / TensorRT-LLM. The field name preserves TRT-LLM's native vocabulary (engine_configs.py:1069-1079).

Compile-time parameters

Changing any of these triggers a fresh engine build.

FieldTypeDefaultDescriptionSource
max_batch_sizeint >= 18Maximum batch size the engine acceptsengine_configs.py:1025-1029
tensor_parallel_sizeint >= 11Number of GPUs to shard acrossengine_configs.py:1030-1038
pipeline_parallel_sizeint >= 11Pipeline parallel stagesengine_configs.py:1039-1043
max_input_lenint >= 11024Maximum input sequence lengthengine_configs.py:1044-1048
max_seq_lenint >= 12048Maximum total sequence length (input + output)engine_configs.py:1049-1053
max_num_tokensint >= 1autoMaximum tokens the engine handles per iterationengine_configs.py:1054-1058
dtypefloat16 | bfloat16autoModel compute dtype. float32 is not supported.engine_configs.py:1059-1064
fast_buildboolfalseReduced-optimisation build for faster compilationengine_configs.py:1065-1068
backendtrt | pytorch | _autodeployTRT-LLM auto-picksTRT-LLM's internal runtime selector. trt is the AOT-compiled engine (best steady-state); pytorch is TRT-LLM's eager runtime (no compile, supports newer architectures); _autodeploy is experimental. Respects TLLM_USE_TRT_ENGINE.engine_configs.py:1069-1079

tensorrt.quant_config: sub-section

Quantisation is applied at engine compile time; changing any field here triggers a recompile. Uses TRT-LLM's native QuantAlgo enum names (engine_configs.py:858-887).

FieldTypeDefaultDescriptionSource
quant_algosee belownull (no quantisation)Quantisation algorithmengine_configs.py:868-883
kv_cache_quant_algoFP8 | INT8nullKV cache quantisation algorithmengine_configs.py:884-887

Valid quant_algo values:

ValueDescriptionSource
FP8FP8 weight + activation quantisation. Requires SM >= 8.9 (Ada Lovelace or Hopper); not supported on A100 (SM 8.0).engine_configs.py:870
INT8INT8 smooth quantisationengine_configs.py:871
W4A16_AWQ4-bit AWQ weights, FP16 activationsengine_configs.py:872
W4A16_GPTQ4-bit GPTQ weights, FP16 activationsengine_configs.py:873
W8A168-bit weights, FP16 activationsengine_configs.py:874
W8A16_GPTQ8-bit GPTQ weights, FP16 activationsengine_configs.py:875
W4A8_AWQ4-bit AWQ weights, INT8 activationsengine_configs.py:876
NO_QUANTExplicitly disable quantisationengine_configs.py:877

tensorrt.kv_cache_config: sub-section

FieldTypeDefaultDescriptionSource
enable_block_reuseboolfalseEnable KV cache block reuse across requestsengine_configs.py:895-898
free_gpu_memory_fractionfloat [0.0, 1.0]0.9Fraction of free GPU memory to allocate for KV cacheengine_configs.py:899-904
max_tokensint >= 1autoMaximum total tokens in the KV cacheengine_configs.py:905-909
host_cache_sizeint >= 00Host (CPU) cache size in bytes for KV cache offloading (0 = disabled)engine_configs.py:910-914

tensorrt.scheduler_config: sub-section

FieldTypeDefaultDescriptionSource
capacity_scheduling_policyGUARANTEED_NO_EVICT | MAX_UTILIZATION | STATIC_BATCHGUARANTEED_NO_EVICTScheduling capacity policyengine_configs.py:922-932

Policy semantics:

  • GUARANTEED_NO_EVICT - guarantees no request eviction; may reduce throughput.
  • MAX_UTILIZATION - maximises GPU utilisation; may evict requests under memory pressure.
  • STATIC_BATCH - fixed batch size; useful for reproducible benchmarking.

tensorrt.sampling: sub-section

Maps to tensorrt_llm.SamplingParams. top_k uses TRT-LLM's 0-for-disabled convention (matches HuggingFace, not vLLM) (engine_configs.py:935-981).

FieldTypeDefaultDescriptionSource
temperaturefloat [0.0, 2.0]TRT defaultSampling temperature (0 = greedy)engine_configs.py:949-951
top_kint >= 0TRT defaultTRT-LLM convention: 0 = disabledengine_configs.py:952-956
top_pfloat [0.0, 1.0]1.0Nucleus sampling thresholdengine_configs.py:957-959
repetition_penaltyfloat [0.1, 10.0]1.0Repetition penaltyengine_configs.py:960-962
min_pfloat [0.0, 1.0]nullMinimum probability filterengine_configs.py:963-965
min_tokensint >= 00Minimum output tokens before EOS allowedengine_configs.py:968-972
nint >= 11Number of output sequences per promptengine_configs.py:973-977
ignore_eosboolfalseContinue generating past EOSengine_configs.py:978-981

Validation rules

The Pydantic models enforce these rules at config-load time. The full catalogue of invalid combinations, including engine-runtime invariants mined from upstream libraries, is in invalid-combos.md.

Cross-engine

  • The engine-specific section must match engine: (models.py:441-464).
  • passthrough_kwargs keys must not collide with ExperimentConfig field names (models.py:466-481).

Transformers

  • load_in_4bit and load_in_8bit are mutually exclusive (engine_configs.py:291-298).
  • torch_compile_mode and torch_compile_backend require torch_compile=true (engine_configs.py:300-307).
  • bnb_4bit_* fields require load_in_4bit=true (engine_configs.py:309-318).
  • cache_implementation requires use_cache to be true or unset (engine_configs.py:320-327).
  • tp_plan and device_map are mutually exclusive (engine_configs.py:329-337).
  • attn_implementation flash_attention_2/flash_attention_3 requires dtype float16 or bfloat16 (models.py:487-508).

vLLM

  • kv_cache_memory_bytes and gpu_memory_utilization are mutually exclusive (engine_configs.py:666-674).
  • max_num_batched_tokens must be >= max_model_len when both are set (engine_configs.py:676-689).
  • beam_search and sampling sections are mutually exclusive (engine_configs.py:842-850).
  • dtype: float32 is rejected by the field's Literal type (engine_configs.py:817-823).

TensorRT-LLM

  • dtype: float32 is rejected by the field's Literal type (engine_configs.py:1059-1064).
  • quant_algo: FP8 requires SM >= 8.9 (Ada Lovelace or Hopper); not supported on A100 (SM 8.0). Hardware-side check; runtime error.

Engine x dtype matrix

From ssot.DTYPE_SUPPORT (ssot.py:156-160):

Enginefloat32float16bfloat16
transformersyesyesyes
vllmnoyesyes
tensorrtnoyesyes

Worked examples

Minimal Transformers experiment

task:
model: gpt2
engine: transformers

All other fields fall back to defaults: aienergyscore dataset, 100 prompts, 256-token input/output caps, bfloat16, sdpa attention, auto energy sampler.

Transformers with quantisation and compilation

task:
model: meta-llama/Llama-2-7b-hf
dataset:
n_prompts: 50

engine: transformers

transformers:
dtype: bfloat16
batch_size: 4
load_in_4bit: true
bnb_4bit_compute_dtype: bfloat16
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true
attn_implementation: flash_attention_2

vLLM with prefix caching and FP8 KV cache

task:
model: meta-llama/Llama-2-7b-hf
max_input_tokens: 1024
max_output_tokens: 256

engine: vllm

vllm:
dtype: bfloat16
engine:
gpu_memory_utilization: 0.9
enable_prefix_caching: true
kv_cache_dtype: fp8
block_size: 16
sampling:
temperature: 0.0
n: 1

vllm.beam_search: replaces vllm.sampling:; the two are mutually exclusive.

task:
model: gpt2

engine: vllm

vllm:
engine:
enforce_eager: false
beam_search:
beam_width: 4
length_penalty: 1.0
early_stopping: false

TensorRT-LLM with AWQ quantisation

task:
model: meta-llama/Llama-2-7b-hf

engine: tensorrt

tensorrt:
dtype: bfloat16
max_batch_size: 8
max_input_len: 1024
max_seq_len: 2048
tensor_parallel_size: 1
quant_config:
quant_algo: W4A16_AWQ
kv_cache_config:
free_gpu_memory_fraction: 0.9
enable_block_reuse: true
scheduler_config:
capacity_scheduling_policy: GUARANTEED_NO_EVICT

Sampling preset (preset values merged into the engine's sampling section)

task:
model: gpt2
engine: transformers
sampling_preset: deterministic # sets temperature: 0.0 under transformers.sampling

sampling_preset is expanded at parse time into the active engine's sampling: sub-section. Explicit YAML values take precedence over preset values (models.py:405-433).

See also