Benchmarking¶
The benchmark/ directory ships a curated set of representative
material models together with a sweep harness and a
vectorisation-efficiency fitter. The tooling answers the question
that comes up whenever a NEML2 user sizes a production run:
How does compute time scale with batch size for a given model on a given device?
The same tooling is the substrate for tracking performance changes across NEML2 releases; runtime-touching PRs are expected to keep the headline numbers from regressing.
Scenarios¶
The committed scenarios cover the breadth of the solid-mechanics module, from trivial elasticity to fully-coupled crystal plasticity:
Code name |
Description |
|---|---|
|
Linear isotropic elasticity without model composition |
|
\(J_2\) viscoplasticity with radial-return integration |
|
Viscoplasticity with isotropic hardening |
|
Chaboche viscoplasticity, 2 back stresses |
|
Chaboche viscoplasticity, 4 back stresses |
|
Chaboche viscoplasticity, 6 back stresses |
|
GTN visco-poroplasticity with thermal eigenstrain |
|
Fully-coupled implicit single-crystal plasticity |
|
Coupled SCP with multiplicative \(F = F^e F^p\) decomposition |
|
Decoupled implicit single-crystal plasticity |
|
Decoupled SCP with explicit orientation update |
|
Taylor polycrystal with a single orientation |
|
Taylor polycrystal with pseudo-random FCC orientations |
|
Mixed-control polycrystal: |
Each scenario lives under benchmark/<name>/model.i as an HIT input
file with two HIT-substituted parameters: ${nbatch} (the dynamic
batch size) and ${device} (cpu or cuda). All scenarios except
tcprandom advance 100 prescribed time steps; tcprandom advances
500 to simulate severe plastic deformation from the standard rolling
experiment.
Vectorisation-efficiency model¶
For a fixed device and model, compute time as a function of batch size \(N\) is fitted to ANL-24/43 eq 7.1:
where \(\langle x \rangle = \max(x, 0)\) is the Macaulay bracket, \(t_0\) is the per-call cost when the device’s vectorisation bandwidth is not yet saturated (a constant time across all small batches), and \(N^*\) is the optimal batch size at which the device starts trading constant time for linear-in-\(N\) time.
Two limits of the model are intuitive:
\(N \ll N^*\): \(t \to t_0\). Vectorisation lanes are unsaturated; per-call cost is dominated by launch overhead, fixed setup, and whatever per-step constant work the model carries. Increasing batch costs nothing.
\(N \gg N^*\): \(t \to t_0 \cdot N / N^*\). The device is saturated; per-call cost scales linearly with the batch.
The fitted \((t_0,\, N^*)\) pair characterises the model+device pair in one of two useful ways:
On a fixed device, \(t_0\) sets the floor on per-call cost — every measurement at \(N \le N^*\) sits there, so a smaller \(t_0\) means a cheaper baseline.
For a fixed model, \(N^*\) is the largest batch that still pays only \(t_0\), so a larger \(N^*\) means more useful work per call before the device falls out of the free region.
The peak throughput of a model+device pair is
(material updates per ms), since by construction \(t(N^*) = t_0\). This is the right single-number summary when comparing two runtimes — pick each runtime’s own optimal batch, and the user can run the model at that batch to get the headline rate. Comparing at a fixed batch can flatter or penalise either side depending on whether that batch happens to land in one runtime’s free region.
Running the sweep¶
The sweep tool is invoked through python -m benchmark.sweep:
# Init a fresh result folder (just records sweep config + env snapshot)
python -m benchmark.sweep init --device cuda
# Run every scenario into that folder (or one folder at a time)
python -m benchmark.sweep run --output-dir benchmark/results/<TS>_cuda_float64_aoti
# Rebuild summary.csv from per-scenario CSVs
python -m benchmark.sweep summarize --output-dir <DIR>
# Fit eq 7.1 per scenario into fitting.csv
python -m benchmark.sweep fit --output-dir <DIR>
# End-to-end shortcut (init + run every discovered scenario + summarize)
python -m benchmark.sweep all --device cuda
Per-scenario stop conditions trip the inner adaptive-batch loop
whichever comes first: (a) log-log slope \(\ge 0.85\) for two
consecutive points (the model has entered its linear regime so
more data points add little information), (b) measured median call
time exceeds 30 s (absolute wall-time saturation), © CUDA OOM /
host MemoryError, or (d) batch exceeds the --max-batch cap
(default 65536).
Each batch is compiled separately, with the scenario’s example
batch shape resolved at the measured batch size. Per-batch compile
mirrors what a production user gets from neml2-compile invoked at
the batch they actually intend to run; reusing one compile across
batches would be cheaper but can mis-report scaling behaviour at
batches far from the example.
Output¶
Each run populates four file kinds under <output-dir>:
File |
Contents |
|---|---|
|
Host + GPU snapshot, sweep config, per-scenario batch ranges, last-run wall time |
|
One row per measured batch: median / mean / std / min / max / p10 / p90 / RSS / peak CUDA memory |
|
Raw per-iteration timings (warmup discarded, all measurement runs kept) |
|
Aggregate view across every scenario, regenerated by |
fit adds a fifth file:
File |
Contents |
|---|---|
|
Per-scenario \((t_0, N^*)\), log-space RMS residual, and number of fit points |
The CUDA compile pipeline saturates available CPU cores while running; co-locating a CPU sweep on the same box double-occupies the cores and taints both. Run one device at a time on a quiet machine. The harness itself does not detect contention.