Skip to main content

Running on a single or consumer GPU

LLenergyMeasure is designed for multi-GPU datacentre hardware, but many researchers work with a single consumer card. This page covers the key knobs for making runs productive on 8-24 GB VRAM, and is honest about where the team has not yet validated specific numbers.


Choosing a model that fits your VRAM

Rule of thumb: a model in full precision (float32) requires approximately 4 * parameter_count_billions GB of VRAM just for weights (e.g. a 7B model needs ~28 GB). In float16/bfloat16 it halves to ~2 GB per billion parameters. With 4-bit quantisation, roughly ~0.5 GB per billion.

VRAMFloat16 / BF16 limit4-bit quantised limit
8 GB~3B parameters~13B parameters
16 GB~7B parameters~30B parameters
24 GB~12B parameters~45B parameters

TODO: needs hardware capture on 24 GB / 16 GB / 8 GB cards - the numbers above are rule-of-thumb estimates and do not account for KV-cache, activations, or framework overhead. Validate on RTX 4090, RTX 4080, and RTX 3080 before treating as exact.

If a model does not fit, LLenergyMeasure will raise an OOM error before measurement starts (on Transformers) or refuse to initialise (on vLLM). Neither wastes a partial measurement.


Reducing n_prompts for shorter runs

The default task.dataset.n_prompts: 100 runs 100 prompts through the engine. On a fast datacentre GPU this takes 30-90 seconds. On a consumer card with a large model it may take several minutes.

For quick iteration, reduce to 10-25 prompts:

task:
dataset:
n_prompts: 10

Or via the CLI flag:

llem run study.yaml --n-prompts 10

Fewer prompts reduces statistical stability. For publication-quality measurements, use at least 50-100 prompts and set study_execution.n_cycles: 3 to run the full workload three times and report the median.


Quantisation options

Quantisation reduces VRAM by compressing model weights to lower precision. The available options depend on the engine:

Transformers engine - BitsAndBytes:

engine: transformers
transformers:
load_in_4bit: true # ~0.5 GB per billion params
# or
load_in_8bit: true # ~1 GB per billion params

4-bit and 8-bit are mutually exclusive. 4-bit uses AWQ/GPTQ-style compression with optional NF4 data type. Energy measurements on quantised models reflect real inference cost - the GPU still executes the dequantised computation.

vLLM engine - pre-quantised checkpoints:

engine: vllm
vllm:
engine:
quantization: awq # or gptq

vLLM requires a pre-quantised model checkpoint (e.g. TheBloke/*-AWQ on HuggingFace Hub). It does not quantise on the fly.

TensorRT-LLM engine - build-time quantisation:

engine: tensorrt
tensorrt:
quant_config:
quant_algo: W4A16_AWQ # or INT8, W4A16_GPTQ, W8A16

TRT-LLM compiles a quantised engine at build time. FP8 is NOT supported on A100 or consumer Ada Lovelace below SM 8.9.

For full quantisation options and valid combinations, see Reference: engine configuration.


Thermal stabilisation on consumer cards

Consumer GPUs (RTX series) have smaller heatsinks and higher thermal density than datacentre cards (A100, H100). The warmup and thermal-floor wait behaviour differs in practice:

  • Shorter time-to-throttle: consumer cards often reach thermal limits within 60-90 seconds of sustained load, whereas A100s can sustain full load indefinitely with adequate rack cooling.
  • More aggressive boost clocking: consumer GPUs boost above their base clock and then throttle back. This creates more latency variance during warmup, which means the "warmup did not converge" warning is more common on consumer hardware.
  • Thermal floor wait: the default warmup.thermal_floor_seconds: 60.0 may be too short on a system that throttled during warmup. If you see persistent convergence failures, try 90-120 seconds.
warmup:
thermal_floor_seconds: 90.0
max_prompts: 20

For the "Warmup did not converge" warning and what it means for your results, see Methodology: measurement warnings.

TODO: validate specific thermal profiles on RTX 4090 / 4080 / 3080. Document typical time-to-throttle and recommended thermal_floor_seconds per card class.


Engine trade-offs at low VRAM

Not all engines are equal at low VRAM. A short guide:

Transformers - lowest barrier. Runs in Docker with --gpus all or locally if you have PyTorch installed. Supports BitsAndBytes quantisation. Slowest throughput at batch_size=1 (the low-VRAM setting). Good for model compatibility and flexibility.

vLLM - higher throughput via continuous batching and paged attention, but pre-allocates most VRAM for the KV cache at startup. Reduce vllm.engine.gpu_memory_utilization (e.g. 0.7) to leave headroom:

vllm:
engine:
gpu_memory_utilization: 0.7
max_model_len: 2048 # also cap KV cache size

TensorRT-LLM - highest throughput but requires 5-15 minutes of engine compilation on first run. Most beneficial for models you will run repeatedly. Less flexible than Transformers for ad-hoc measurement.

Rule of thumb for consumer hardware:

  • Quick model survey: use Transformers with 4-bit quantisation.
  • Sustained throughput measurement: use vLLM with reduced gpu_memory_utilization.
  • Repeated runs of a fixed config: invest in TRT-LLM compilation.

TODO: document observed throughput ratios between engines on consumer cards (RTX 4090 Transformers vs vLLM for 7B models). Target: concrete numbers from hardware capture.


See also