Skip to main content

For serving open-weights models

If you run open-weights inference - on your own hardware, rented GPUs, or a CI runner - LLenergyMeasure measures the efficiency of your serving stack. The methodology that supports academic comparisons is the same methodology you can apply to evaluate and tune your own deployment.


What you can answer

For your model on your hardware: which configuration choices most reduce energy and per-token cost. Implementation parameters are discovered programmatically from each engine, not curated to a shortlist, so the answer space stays as wide as the engine's own configuration surface.


Where to start


Reading the outputs for deployment decisions

Academic comparisons emphasise relative numbers - is engine A more efficient than engine B for this model. Deployment decisions also need absolute numbers: joules per token converts into watts at your serving load, which converts into electricity cost and capacity planning.

mj_per_tok_adjusted is the most portable per-token figure - idle GPU draw is subtracted, so what remains is the energy your inference work itself is responsible for. Multiply by your daily token volume for a deployment-shaped energy figure. The full output-reading guide is how-to: interpret results.


Operator-specific guidance is on the roadmap

Worked examples for the questions a serving team actually asks - "which quant for Llama-3?", "is migrating to vLLM worth it for my workload?", "how much energy does my reasoning model spend on tokens the user never sees?" - are not yet written. The broader product-positioning conversation is tracked in issue #626.