Skip to main content

TensorRT-LLM Engine Schema

Engine version: 0.21.0
Discovered at: 2026-05-06T20:20:57+02:00
Discovery method: TrtLlmArgs.model_json_schema() + dataclasses.fields(SamplingParams)
Schema version: 1.0.0

Summary: 60 engine parameters, 47 sampling parameters.

Discovery limitations

  • engine_params - BuildConfig is not a Pydantic model; appears as Optional[object] in the schema Affected fields: build_config
  • sampling_params - SamplingParams is a dataclass; no per-field descriptions

Engine Parameters

FieldTypeDefaultDescription
modelstring-The path to the model checkpoint or the model name from the Hugging Face Hub.
tokenizer`stringNone`-
tokenizer_modeLiteral['auto', 'slow']autoThe mode to initialize the tokenizer.
skip_tokenizer_initbooleanfalseWhether to skip the tokenizer initialization.
trust_remote_codebooleanfalseWhether to trust the remote code.
tensor_parallel_sizeinteger1The tensor parallel size.
dtypestringautoThe data type to use for the model.
revision`stringNone`-
tokenizer_revision`stringNone`-
pipeline_parallel_sizeinteger1The pipeline parallel size.
context_parallel_sizeinteger1The context parallel size.
gpus_per_node`integerNone`-
moe_cluster_parallel_size`integerNone`-
moe_tensor_parallel_size`integerNone`-
moe_expert_parallel_size`integerNone`-
enable_attention_dpbooleanfalseEnable attention data parallel.
cp_config`objectNone`-
load_formatLiteral['auto', 'dummy']autoThe format to load the model.
enable_lorabooleanfalseEnable LoRA.
⚠️ max_lora_rank`integerNone`-
⚠️ max_lorasinteger4The maximum number of LoRA.
⚠️ max_cpu_lorasinteger4The maximum number of LoRA on CPU.
lora_config`LoraConfigNone`-
enable_prompt_adapterbooleanfalseEnable prompt adapter.
max_prompt_adapter_tokeninteger0The maximum number of prompt adapter tokens.
quant_config`QuantConfigNone`-
kv_cache_configKvCacheConfig-KV cache config.
enable_chunked_prefillbooleanfalseEnable chunked prefill.
guided_decoding_backend`stringNone`-
batched_logits_processorOptional[tensorrt_llm.sampling_params.BatchedLogitsProcessor]-Batched logits processor.
iter_stats_max_iterations`integerNone`-
request_stats_max_iterations`integerNone`-
peft_cache_config`PeftCacheConfigNone`-
scheduler_configSchedulerConfig-Scheduler config.
cache_transceiver_config`CacheTransceiverConfigNone`-
speculative_config`LookaheadDecodingConfigMedusaDecodingConfigEagleDecodingConfig
batching_type`BatchingTypeNone`-
normalize_log_probsbooleanfalseNormalize log probabilities.
max_batch_size`integerNone`-
max_input_len`integerNone`-
max_seq_len`integerNone`-
max_beam_width`integerNone`-
max_num_tokens`integerNone`-
gather_generation_logitsbooleanfalseGather generation logits.
num_postprocess_workersinteger0The number of processes used for postprocessing the generated tokens, including detokenization.
postprocess_tokenizer_dir`stringNone`-
reasoning_parser`stringNone`-
garbage_collection_gen0_thresholdinteger20000Threshold for Python garbage collection of generation 0 objects.Lower values trigger more frequent garbage collection.
⚠️ decoding_configOptional[DecodingConfig]-The decoding config.
backend`stringNone`-
⚠️ auto_parallelbooleanfalseEnable auto parallel mode.
⚠️ auto_parallel_world_size`integerNone`-
enable_tqdmbooleanfalseEnable tqdm for progress bar.
workspace`stringNone`-
enable_build_cacheUnion[tensorrt_llm.llmapi.build_cache.BuildCacheConfig, bool]falseEnable build cache.
extended_runtime_perf_knob_config`ExtendedRuntimePerfKnobConfigNone`-
calib_config`CalibConfigNone`-
embedding_parallel_modestringSHARDING_ALONG_VOCABThe embedding parallel mode.
fast_buildbooleanfalseEnable fast build.
build_configOptional[tensorrt_llm.builder.BuildConfig]-Build config.

Sampling Parameters

FieldTypeDefaultDescription
end_id`intNone`-
pad_id`intNone`-
max_tokensint32
bad`strlist[str]None`
bad_token_ids`list[int]None`-
stop`strlist[str]None`
stop_token_ids`list[int]None`-
include_stop_str_in_outputboolfalse
embedding_bias`TensorNone`-
logits_processor`LogitsProcessorlist[LogitsProcessor]None`
apply_batched_logits_processorboolfalse
nint1
best_of`intNone`-
use_beam_searchboolfalse
top_k`intNone`-
top_p`floatNone`-
top_p_min`floatNone`-
top_p_reset_ids`intNone`-
top_p_decay`floatNone`-
seed`intNone`-
temperature`floatNone`-
min_tokens`intNone`-
beam_search_diversity_rate`floatNone`-
repetition_penalty`floatNone`-
presence_penalty`floatNone`-
frequency_penalty`floatNone`-
length_penalty`floatNone`-
early_stopping`intNone`-
no_repeat_ngram_size`intNone`-
min_p`floatNone`-
beam_width_array`list[int]None`-
logprobs`intNone`-
prompt_logprobs`intNone`-
return_context_logitsboolfalse
return_generation_logitsboolfalse
exclude_input_from_outputbooltrue
return_encoder_outputboolfalse
return_perf_metricsboolfalse
additional_model_outputs`list[AdditionalModelOutput]None`-
lookahead_config`LookaheadDecodingConfigNone`-
guided_decoding`GuidedDecodingParamsNone`-
ignore_eosboolfalse
detokenizebooltrue
add_special_tokensbooltrue
truncate_prompt_tokens`intNone`-
skip_special_tokensbooltrue
spaces_between_special_tokensbooltrue