Masters of Data Science for Public Policy Thesis, 2025 Hertie School of Governance · Advisor: Prof. Lynn Kaack · Data Science Thesis Award 2025
Download Thesis (PDF) · GitHub: LLenergyMeasure
The adoption of large language models across digital services has led to increased scrutiny over their energy costs. Data-centre demand from AI workloads is projected to more than quadruple by 2030, potentially accounting for 9-12% of total energy demand in the US. Crucially, inference-time consumption now dominates AI energy usageâGoogle and Meta report 60-70% of their AI-driven energy consumption is inference-related, with Amazon Web Services reporting 80-90% of their ML cloud compute is inference-related. Accurately estimating and reducing inference-time energy is vital to broader AI sustainability efforts.
Within the ML community, attention to AIâs environmental impact has grown substantially. The Green AI movement, formalised as an efficiency-oriented alternative to the prevailing Red AI paradigm of âperformance at any cost,â has catalysed research quantifying the environmental impacts of machine learning. Complementary Sustainable AI frameworks advocate for holistic lifecycle-wide approaches to AI sustainability, spanning training, inference, and deployment.
A common simplifying assumption is to take the number of FLOPs (floating-point operations) required per generated token as a proxy for inference-time energy costs. This assumption is closely aligned with parameter-counting heuristics that dominate model selection processes and continues to influence emerging AI policy discourse.
However, FLOPs are analytically computed as deterministic functions of model architecture and input-output characteristics:
FLOPs = f(number of parameters, input length, output length)
FLOPs quantify the number of arithmetic operations required to generate a given output, but they do not capture the energy efficiency with which those operations are executed. While correlated, FLOPs and inference-time energy consumption are conceptually distinct. Energy consumption is shaped both by:
The core insight: Two systems with the same theoretical compute requirement may exhibit markedly different energy profiles. Implementation-level factorsâhow models are deployedâcan induce substantial variation in energy consumption, even when FLOPs counts remain constant.
To illustrate: the same model, with identical FLOPs-per-token, can exhibit dramatically different energy profiles:
| Configuration | FLOPs per Token | Energy per Token | Ratio |
|---|---|---|---|
| Single GPU, batch=1, FP32 | X | 1.0Ă (baseline) | â |
| Single GPU, batch=32, FP16 | X | 0.15Ă | 6.7Ă more efficient |
| 4 GPUs, batch=1, FP32 | X | 4.2Ă | 4.2Ă less efficient |
Same model, same FLOPs, wildly different energy costs. How you deploy matters as much as what you deploy.
FLOP-counting narrows policy attention to immutable model attributes, conceptually restricting the scope of intervention to the moment of model selection. This framing overlooks downstream system-level implementation decisions and tradeoffs that shape the energy efficiency of real-world deployments. From a benchmarking perspective, the neglect of implementation-level variation translates to a lack of standardised test-time controlsâcreating opportunities for motivated actors to present artificially efficient performance metrics by testing under unrealistic configurations.
This study presents an empirical analysis of LLM inference-time efficiency, holding computational workload constant (ensuring fixed FLOPs-per-token) and measuring energy-per-token under a range of deployment configurations.
The configuration space is explored through a within-model grid search across 2,112 unique configurations:
| Component | Details |
|---|---|
| Models | LLaMA 3.2-1B and LLaMA 3.2-3B (lightweight, open-weight LLMs) |
| Task | Causal language modelling with fixed 500-token input/output sequences |
| Dataset | 128 prompts from Hugging Faceâs AI Energy Score benchmark (WikiText, OSCAR, UltraChat) |
The grid search varied the following implementation parameters, selected based on accessibility to LLM service providers and prior evidence of energy effect:
All experiments ran on a shared single-node server:
Energy measurements were obtained via CodeCarbon software profilers, reflecting the emerging standard in ML energy analysis (±10-15% accuracy). The full configuration space was executed twice across non-contiguous experimental cycles over separate weeks to mitigate environmental noise from operating on a shared server.
Each run was initiated via a fresh command-line invocation to fully reinitialise the distributed environment, with three dummy forward passes to trigger lazy initialisations (discarded from measurement).
Through a comprehensive grid search of implementation parameters, this research demonstrates substantial variability in energy efficiency:
| Model | Mean (kWh/token) | Max/Min Fold | 95/5 Percentile Fold | CV (%) |
|---|---|---|---|---|
| LLaMA-3.2-1B | 1.44Ă10â»â¶ | 516.5Ă | 61.0Ă | 127.7% |
| LLaMA-3.2-3B | 2.64Ă10â»â¶ | 293.5Ă | 51.0Ă | 123.5% |
Both models exhibit positively skewed energy outcome distributions, with long right tails capturing high-energy outliers. While the two models show considerable overlap in their energy profilesâhighlighting that deployment choices can outweigh architectural complexityâthe 3B model incurs a higher median energy-per-token.
Notably, when normalised by each modelâs mean, the 1B modelâs coefficient of variation (127.7%) exceeds that of the 3B (123.5%), indicating that smaller models can be just as sensitive to implementation-level decisions.
Distribution of energy outcomes showing substantial variability within each model.
While the explored parameter space includes many impractical configurations, simulating six plausible deployment scenarios yields more actionable figures:
| Scenario Type | 1B Fold Range | 3B Fold Range |
|---|---|---|
| Realistic deployments | 4.3Ă (CV: 61%) | 4.9Ă (CV: 58%) |
Even within realistic production constraints, implementation choices induce 4-5Ă variation in energy-per-token.
To contextualise what these energy costs mean in practice, the table below compares the number of 300-token LLM responses equivalent to familiar energy expenditures (for the 3B model):
| Appliance | Usage | Least Efficient | Most Efficient |
|---|---|---|---|
| iPhone 16 | Full charge | 7 responses | 19 responses |
| MacBook Pro (M3) | Full charge | 40 responses | 116 responses |
| WiFi Router | 24 hours | 64 responses | 186 responses |
| HD Streaming | 1 hour | 35 responses | 103 responses |
| Google Search | 1 query | 0.13 responses | 0.39 responses |
| Electric Kettle | 1 litre boil | 129 responses | 67 responses |
Note: These figures are illustrative, based on estimated energy values for consumer appliances.
Increasing the number of processes over which model layers are distributed leads to a moderately super-linear increase in energy-per-token. Tensor parallelism has the largest effect of all tested parameters, with both models more than doubling energy consumption with the addition of a single extra process, then scaling up to 4Ă and 6Ă under 3 and 4 processes respectively.
This behaviour reflects a naive unoptimised implementation and is consistent with known limitations when model size is small relative to available GPU capacity. As (underutilised) GPUs are added:
Both modelsâ variance in energy outcomes across the entire grid search also grows with parallelism, reflecting both scale and increased instability from communication-heavy, fragmented scheduling. Leveraging more sophisticated parallel execution strategies would likely alter these scaling dynamics.
Normalised throughput across different batch sizes.
Batch size reveals a steep inverse relationship with energy-per-token: in the small-batch regime, fixed overheads dominate, while beyond a certain point the curve plateaus, reflecting compute and memory-bandwidth saturation. The lower bound is reached for both models at around 10% of the double-batch baseline (a 10Ă range, excluding single batches).
Energy-per-token falls as throughput rises, demonstrating that batching-driven efficiency gains stem primarily from throughput improvements. Larger batches improve device utilisation and partially offset inefficiencies from parallelism.
Batch size moderates the effect of tensor parallelism on energy efficiency.
Effect of numerical precision on normalised energy consumption.
Energy-per-token does not decrease monotonically with precision reductions:
This finding highlights that quantisation benefits depend heavily on how well the inference framework is optimised for lower-precision operations.
Decoding strategy exhibited minimal direct energy impactâall curves remained within 5% of baseline:
This suggests LLM providers can select sampling strategies based on application-specific criteria (output quality, diversity) rather than energy efficiency concerns.
The relationship between latency and energy consumption varies by precision level.
Energy efficiency steadily degrades as simulated latency increases, with larger declines under bursty conditionsâalbeit with modest overall effect sizes. Because this latency models exogenous communication delays (network congestion, scheduling jitter) rather than compute bottlenecks, the efficiency loss reflects reduced throughput: GPUs remain powered longer but perform less useful work per unit time.
Burstiness further exacerbates inefficiency, though the relationship is not uniform. Smaller burst sizes (5-8 queries) exhibit consistently higher inefficiencies, whereas larger bursts (10-20 queries) resemble constant latency profilesâsuggesting modern GPUs exploit burst-locality to regain uninterrupted computation stretches.
Prior work has considered throughput and device utilisation rates as determinants of inference-time energy efficiency. This study proposes that together they capture computational and non-computational aspects of energy efficiency in deployed LLM systems.
The convex relationship between throughput and energy-per-token.
For systems characterised by throughput-inefficiency (operating at low throughput, within their throughput-power Pareto frontier), small throughput increases yield large reductions in energy-per-token. Beyond an inflection pointâindicating proximity to the systemâs throughput-efficiency frontierâfurther throughput gains no longer improve energy efficiency, but instead require proportionate increases in power draw.
This convex non-linearity allows the optimal throughput configuration set to be conceptualised as a constrained optimisation problem under service-level objective (SLO) conditions.
Device utilisation captures a different set of non-computational inefficiencies. In efficient deployments, utilisation should remain consistently high; persistent underutilisation indicates specific sources of energy wasteâas is often the case in deployed systems where utilisation rates have been reported as low as 23-27%.
From a policy perspective, throughput and device utilisation offer more insightful characterisation of runtime energy dynamics than analytical metrics like FLOPs, and provide a deployment-sensitive basis for benchmarking deployed systems.
Deployment decisions substantially affect inference-time energy outcomesâto an extent under-acknowledged in Sustainable AI discourse. This study shows that variation induced by implementation can well exceed that attributable to model size alone. LLM energy efficiency should be conceptualised not at the level of abstract model architectures, but as a property of implemented systems shaped by concrete deployment choices.
While the precise mechanisms driving inference-time energy consumption are complex, this study identifies key tunable parameters: batching, tensor parallelism, and numerical precision. Even amongst this initial set, interaction effects and performance tradeoffs underscore the need for joint optimisation frameworks:
| Parameter | Energy Impact | Tradeoff |
|---|---|---|
| Batch size | High (10Ă) | Responsiveness vs efficiency |
| Tensor parallelism | Very high (6Ă) | Scale vs coordination overhead |
| Precision/quantisation | Moderate (35-40%) | Quality vs efficiency |
| Decoding strategy | Minimal (<5%) | Quality/diversity concerns only |
Aggressive quantisation and deterministic decoding degrade generation quality, while larger batch sizes reduce responsiveness for real-time applications but improve efficiencyâmore so in highly-distributed inference environments.
Prevailing conceptualisations of LLM energy efficiency emphasise analytical over empirical measures, and model attributes over system dynamics. Early benchmarking frameworks neglect the role of implementation-level factors as necessary controls.
To illustrate the extent of possible distortion by motivated actors at test-time: comparing over-optimised configurations (designed to maximise throughput without practical constraints) to production-like deployments suggests unconstrained test-time optimisation can yield energy costs around 10-13% of production valuesâa substantial risk for misleading efficiency claims.
FLOP-counting narrows attention to immutable model attributes, conceptually restricting intervention scope to the moment of model selection. This framing overlooks downstream system-level implementation decisions and tradeoffs that shape the energy efficiency of real-world deployments. Future research should:
The neglect of implementation-level variation translates to a lack of standardised test-time controls. This gap creates opportunities for motivated actors to present artificially efficient performance metrics by testing under unrealistic configurations. Consumers and policymakers are left with little insight into the true energy cost of querying an LLM.
Recommendations:
The substantial variability demonstrated here highlights opportunities for energy optimisation through informed deployment choicesânot just model selection. Practitioners should:
The measurement tool developed for this research is being actively expanded. See the tool documentation for usage details and feature evolution, or the GitHub repository for the source code.
Current expansion areas:
Future research directions:
Full academic thesis available as PDF.
Powered by Jekyll and Minimal Light theme.