Skip to main content

Results schema

Reference for everything llem writes to disk after a measurement. Three artefacts ship per study: a per-experiment result.json, a study-level manifest.json, and an optional timeseries.parquet sidecar.

For a guided walkthrough of how to read these files (with worked examples), see How to interpret results. For the methodology behind each metric, see What we measure and Energy measurement.

Output layout

A study run produces a directory tree like this:

results/
└── <study-name>_<UTC-timestamp>/
├── manifest.json # study-level checkpoint + summary
├── 001_c0_<model>-<engine>_<hash>/ # one experiment cell
│ ├── result.json # all metrics + resolved config
│ ├── effective_config.json # final config used (post-expansion)
│ └── timeseries.parquet # GPU power/thermal/memory samples
├── 002_c0_.../
├── ...
└── _study-artefacts/
├── equivalence_groups.json # dedup equivalence groups
└── baseline_cache_<key>.json # per-engine baseline cache

<UTC-timestamp> is ISO-8601 (e.g. 2026-05-07T14-32-08). Cell directory names encode <NNN>_c<cycle>_<model>-<engine>_<config-hash> so they sort sensibly and you can tell sibling cycles apart at a glance.

result.json - per-experiment record

The scientific record. One JSON file per experiment cell. Schema version 3.0.

Identification

FieldTypeDescription
schema_versionstrResult schema version (currently "3.0")
experiment_idstrUnique experiment identifier ({model}_{YYYYMMDD_HHMMSS} for single experiments; study-level cells inherit a richer per-cell identifier)
measurement_config_hashstrSHA-256[:16] of ExperimentConfig with environment fields excluded; same hash -> logically identical experiments
llenergymeasure_versionstr | nullPackage version that produced this result
enginestrInference engine: transformers | vllm | tensorrt
engine_versionstr | nullEngine library version (e.g. 4.57.0 for transformers)
model_namestrModel name or HuggingFace path used

Measurement methodology

FieldTypeDescription
measurement_methodology"total" | "steady_state" | "windowed"Which slice of the run produced the headline metrics
warmup_excluded_samplesint | nullPrompts excluded during warmup; null when methodology = "total"
reproducibility_notesstrFree-text caveats (default mentions NVML accuracy +/-5 %, thermal drift)

Aggregate metrics

These are the totals across all processes / GPUs (post-aggregation, post-warmup-exclusion when applicable).

FieldTypeDescription
total_tokensintTotal output tokens generated across all prompts
total_energy_jfloatTotal GPU energy in joules (raw, no baseline subtraction)
total_inference_time_secfloatTotal wall-clock inference time
avg_tokens_per_secondfloatThroughput: total_tokens / total_inference_time_sec
avg_energy_per_token_jfloatEnergy per output token in joules

Per-token energy (millijoules)

FieldTypeDescription
mj_per_tok_totalfloat | nullMillijoules per token from raw (unadjusted) energy
mj_per_tok_adjustedfloat | nullMillijoules per token from baseline-adjusted energy. null when no baseline was measured. This is the right field for cross-experiment comparisons.

:::note Why adjusted beats total for comparisons mj_per_tok_adjusted subtracts idle GPU power before dividing by token count. Two experiments running on hardware with different idle power (or at different thermal states) will show a spurious difference in mj_per_tok_total even when inference is identical. See Energy measurement for the full reasoning. :::

FLOPs

total_flops is an estimate (not measurable directly during inference). The derived per-token / per-second fields are null when the divisor is zero.

FieldTypeDescription
total_flopsfloatTotal FLOPs estimate for this experiment
flops_per_output_tokenfloat | nullFLOPs per decode token. null if total_flops = 0 or output_tokens = 0
flops_per_input_tokenfloat | nullFLOPs per prefill token
flops_per_secondfloat | nullFLOPs throughput (total_flops / inference_time_sec)

Baseline (idle GPU power)

FieldTypeDescription
baseline_power_wfloat | nullIdle GPU power in watts, measured before this experiment
energy_adjusted_jfloat | nullTotal energy minus baseline_power_w x total_inference_time_sec. The "net inference work" energy figure.
energy_per_device_jlist[float] | nullPer-GPU energy breakdown (length = num_processes)

For the methodology that motivates baseline subtraction, see Methodology > Baseline power.

Sidecar reference

FieldTypeDescription
timeseriesstr | nullRelative filename of the timeseries sidecar (e.g. "timeseries.parquet"); null when output.save_timeseries: false

Effective config (sibling file)

effective_config.json lives next to result.json in each experiment directory. It contains the fully resolved ExperimentConfig - every parameter value used, including engine defaults that were not explicitly specified. This is what reproduces the experiment.

manifest.json - study-level checkpoint

Written and updated as a study runs (resume support reads from it). Once the study completes, manifest's summary field is essentially the same as the returned StudyResult.summary.

Top-level

FieldTypeDescription
study_namestr | nullStudy name (used in directory naming)
study_design_hashstr | null16-char SHA-256 of the resolved experiment list (execution block excluded). Same YAML -> same hash.
start_timedatetimeStudy start (ISO-8601 UTC)
end_timedatetimeStudy end (ISO-8601 UTC, populated on completion)
experimentslist[dict]Per-experiment resolved config + status (running | completed | failed)
summaryStudySummaryAggregate counters (see below)

summary block

FieldTypeDescription
total_experimentsintTotal experiments planned for this study
completedintNumber of successfully completed experiments
failedintNumber of failed experiments
total_wall_time_sfloatTotal wall-clock time in seconds
total_energy_jfloatTotal energy across all experiments in joules
unique_configurationsint | nullDistinct experiment configs: total_experiments / n_cycles
warningslist[str]Runtime warnings emitted during the study

timeseries.parquet - sample-level sidecar

Written when output.save_timeseries: true (the default). One Parquet file per experiment, columnar layout, suitable for direct loading into Pandas / Polars / DuckDB.

ColumnTypeDescription
tfloat64Wall-clock seconds since experiment start
gpu_idxint32GPU device index (0, 1, ...) for multi-GPU runs
power_wfloat64Instantaneous GPU power draw in watts
temperature_cfloat64GPU temperature in degC
memory_used_mibfloat64GPU memory used in MiB
sm_clock_mhzfloat64SM clock in MHz (when available)

LLenergyMeasure polls NVML at 100 ms intervals; thermal-throttle events shorter than the polling interval may be missed - see Methodology > Known limitations.

StudyResult - final return value (Python API)

Returned by run_study(...). Distinct from manifest.json: this is the fully-assembled object handed back to the caller after the study completes.

FieldTypeDescription
experimentslist[ExperimentResult]One entry per experiment cell (same fields as the per-experiment result.json)
study_namestr | nullSame as manifest
study_design_hashstr | nullSame as manifest
measurement_protocoldictFlat snapshot of ExecutionConfig: n_cycles, experiment_order, experiment_gap_seconds, cycle_gap_seconds, shuffle_seed, experiment_timeout_seconds
result_fileslist[str]Paths to per-experiment result.json files (paths, not embedded payload)
summaryStudySummarySame shape as in the manifest
skipped_experimentslist[dict]Grid points skipped due to validation errors. Each entry: {raw_config, reason, errors}

Loading from disk

import json
from pathlib import Path

study = Path("results/tutorial-multi-engine_2026-05-07T14-32-08")

# Load study manifest
with (study / "manifest.json").open() as f:
manifest = json.load(f)

# Load every experiment result
results = []
for cell in sorted(study.glob("*/result.json")):
with cell.open() as f:
results.append(json.load(f))

# Load timeseries (Pandas)
import pandas as pd
ts = pd.read_parquet(study / "001_c0_qwen-transformers_a1b2c3" / "timeseries.parquet")

For the Python API equivalent (StudyResult object), see Reference > Library API.

Schema versioning

result.json.schema_version follows semantic versioning: minor bumps add fields without breaking existing readers, major bumps signal breaking changes. Pre-1.0 the policy is conservative - new fields land as Optional with default = null so existing parsers don't break.

See also