Compiled models¶
You’ll take a model you can already run in Python, compile it ahead of time into a self-contained package, and call the compiled version from another process. The compiled form runs the kernels in pre-compiled C++ instead of the eager PyTorch dispatcher, so it’s the form to use once a model is locked in and you want it fast.
The pipeline¶
input.i ──► neml2-compile ──► elasticity_aoti.i (standalone drop-in stub)
elasticity/
cpu/ elasticity_meta.json (variable layout, dtype, device)
elasticity.pt2 (AOTI-compiled kernels)
elasticity_jvp.pt2 (per-pair Jacobian graph; only with -d)
neml2-compile emits one artifact folder per device (elasticity/<device>/)
plus a single standalone elasticity_aoti.i stub next to it. The stub is a
copy of the original input with the [Models]/elasticity block replaced by an
AOTIModel shim that points at the artifact folder via an absolute
artifact_path — everything else (drivers, settings, tensors) is copied through
verbatim, so the stub is a drop-in replacement anywhere a Driver consumes a
model by name. Compiling for several devices at once (--device cpu cuda) adds
more <device>/ subfolders; the loader picks the one matching the device it
runs on.
The input file¶
# Linear isotropic elasticity model used as the compilation target.
# E = 200 GPa, nu = 0.3 — same model as the "hello world" tutorial,
# repeated here so this page is self-contained.
[Models]
[elasticity]
type = LinearIsotropicElasticity
coefficients = '200e3 0.3'
coefficient_types = 'YOUNGS_MODULUS POISSONS_RATIO'
[]
[]
Compiling from the shell¶
neml2-compile (see CLI utilities) builds the package in one
invocation. Derivative graphs are opt-in: with no -d flag only the
forward graph is compiled (the smallest artifact). Request the
derivatives you need with -d OUT:IN (here stress:strain; omit a side
to select all on it, -d : for every pair) so the artifact also exposes
jvp / jacobian for those pairs:
!neml2-compile input.i --model elasticity -d stress:strain
Compiled 'elasticity' (composed, fully baked, 1 derivative pair(s)) for [cpu] -> aoti/elasticity
stub: aoti/elasticity_aoti.i
cpu: aoti/elasticity/cpu/elasticity_meta.json
Compilation is a one-time cost (Inductor + C++ compile, typically
seconds); the resulting artifact loads quickly on every subsequent
run. With no --output-dir the output lands in aoti/ — the standalone
aoti/<model>_aoti.i stub next to the per-device folder aoti/<model>/<device>/:
!ls aoti aoti/elasticity aoti/elasticity/cpu
aoti:
elasticity elasticity_aoti.i
aoti/elasticity:
cpu
aoti/elasticity/cpu:
elasticity.pt2 elasticity_jvp.pt2 elasticity_meta.json
Loading the compiled model¶
The stub looks like any other HIT input file, so neml2.load_model
loads it the usual way:
import neml2
compiled = neml2.load_model("aoti/elasticity_aoti.i", "elasticity")
compiled
AOTIModel()
The shim behaves like a native model — same inputs, same outputs,
same call convention — but underneath it dispatches to the compiled
.pt2 instead of running Python:
print("inputs :", list(compiled.input_spec))
print("outputs:", list(compiled.output_spec))
inputs : ['strain']
outputs: ['stress']
Round-trip equivalence¶
Evaluate the compiled and eager forms on the same strain and check they agree. The compiled model always returns a tuple (even with a single output), so we unpack it:
import torch
from neml2.types import SR2
eager = neml2.load_model("input.i", "elasticity")
strain = SR2(torch.tensor(
[
[0.01, 0.0, 0.0, 0.0, 0.0, 0.0 ],
[0.005, -0.002, 0.0, 0.0, 0.0, 0.001],
],
dtype=torch.float64,
))
(stress_compiled,) = compiled(strain)
stress_eager = eager(strain)
print("compiled:", stress_compiled.data)
print("eager :", stress_eager.data)
print("match :", torch.allclose(stress_compiled.data, stress_eager.data,
rtol=1e-12, atol=1e-12))
compiled: tensor([[2692.3077, 1153.8462, 1153.8462, 0.0000, 0.0000, 0.0000],
[1115.3846, 38.4615, 346.1538, 0.0000, 0.0000, 153.8462]],
dtype=torch.float64)
eager : tensor([[2692.3077, 1153.8462, 1153.8462, 0.0000, 0.0000, 0.0000],
[1115.3846, 38.4615, 346.1538, 0.0000, 0.0000, 153.8462]],
dtype=torch.float64, grad_fn=<AddBackward0>)
match : True
The compiled artifact matches eager output to machine precision. The leading batch dimension is dynamic, so the same artifact handles any batch size from 1 to roughly a million without recompilation.
Jacobian and Jacobian-vector product¶
Alongside forward, the compiled artifact also exposes jvp (outputs
plus a directional derivative J @ v) and jacobian (outputs plus the
per-variable Jacobian blocks). Both live on the underlying C++ binding
(neml2.aoti.Model). The HIT-loaded shim stores that binding on its
_inner attribute — this attribute is not stable public API and may
be renamed in a future release, but it is the only way to reach the
binding today:
# NOTE: _inner is an implementation detail, not stable public API.
binding = compiled._inner # the bare ``neml2.aoti.Model`` runtime
# Tangent on the input — same shape as the input itself.
strain_t = torch.zeros_like(strain.data)
strain_t[:, 0] = 1.0 # probe d(stress)/d(epsilon_xx)
out, jvp = binding.jvp({"strain": strain.data}, {"strain": strain_t})
print("output[stress] :", out["stress"][:, :3])
print("J @ v [stress] :", jvp["stress"][:, :3])
output[stress] : tensor([[2692.3077, 1153.8462, 1153.8462],
[1115.3846, 38.4615, 346.1538]], dtype=torch.float64)
J @ v [stress] : tensor([[269230.7692, 115384.6154, 115384.6154],
[269230.7692, 115384.6154, 115384.6154]], dtype=torch.float64)
out, J = binding.jacobian({"strain": strain.data})
print("output[stress] :", out["stress"][:, :3])
# J is nested by variable pair: J[output_name][input_name] is the block
# d(output)/d(input), shaped (batch, *output_base, *input_base).
block = J["stress"]["strain"]
print("J[stress][strain] :", tuple(block.shape)) # (batch, 6, 6)
print("first sample (6x6):\n", block[0])
output[stress] : tensor([[2692.3077, 1153.8462, 1153.8462],
[1115.3846, 38.4615, 346.1538]], dtype=torch.float64)
J[stress][strain] : (6, 6)
first sample (6x6):
tensor([269230.7692, 115384.6154, 115384.6154, 0.0000, 0.0000,
0.0000], dtype=torch.float64)
Use jvp when you only need one direction of the gradient; use
jacobian when you need the full per-variable blocks.
Promoting parameters¶
By default every parameter is baked into the compiled graph as a
constant. That’s fastest, but it also means you can’t change them at
runtime. To keep one mutable, promote it at compile time with -p:
!neml2-compile input.i --model elasticity --output-dir aoti_promoted -p E -d stress:strain
Compiled 'elasticity' (composed, 1 promoted parameter(s), 1 derivative pair(s)) for [cpu] -> aoti_promoted/elasticity
stub: aoti_promoted/elasticity_aoti.i
cpu: aoti_promoted/elasticity/cpu/elasticity_meta.json
The promoted artifact loads the same way; the difference is that the named parameter is reachable from Python and editing it in place is reflected on the next forward call:
m = neml2.load_model("aoti_promoted/elasticity_aoti.i", "elasticity")
# Initial value comes from the snapshot taken at compile time.
(stress0,) = m(strain)
# Mutate E in place on the underlying binding. The next call sees it.
# NOTE: _inner is not stable public API; it is the only way to reach
# the parameter surface from a HIT-loaded shim today.
m._inner.named_parameters()["E"].fill_(100e3)
(stress1,) = m(strain)
print("initial E=200e3 stress[0,0]:", stress0.data[0, 0].item())
print("after E=100e3 stress[0,0]:", stress1.data[0, 0].item())
print("ratio (expect 0.5): ", (stress1.data[0, 0] / stress0.data[0, 0]).item())
initial E=200e3 stress[0,0]: 2692.307692307692
after E=100e3 stress[0,0]: 1346.153846153846
ratio (expect 0.5): 0.5
Each promoted name adds a small per-call cost (it becomes a graph
input rather than a baked constant), but the compiled artifact is
typically still much faster than eager. Promotion doesn’t work for
parameters inside an ImplicitUpdate segment — the compiler will
refuse them with a clear error.
Trade-offs vs eager¶
Eager |
AOTI |
|
|---|---|---|
Per-call overhead |
Python + PyTorch dispatcher per op |
Thin Python shim into pre-compiled C++ kernels |
Build cost |
Zero |
Seconds (Inductor + C++ compile, once) |
Parameter mutation |
Free — they’re live attributes |
Only for |
Autograd |
Yes (full eager AD) |
No (forward-only; |
Device/dtype |
Switchable at runtime via |
Pinned at compile time ( |
Portability |
Needs the full Python source |
Needs only |
Rule of thumb: stay in eager mode while you’re still iterating on a model. Compile it once you’re ready to run it at scale — driver runs, large batch sweeps, the inner loop of a finite-element kernel.
Where to go next¶
Implicit models covers
ImplicitUpdatemodels; the same compile flow handles them — each implicit segment produces its own residual and Newton-step artifacts and the runtime orchestrates the Newton solve internally.Transient driver shows how a compiled model plugs straight into a
TransientDriver— the driver doesn’t care whether its model is eager or AOTI-backed.Evaluation device covers the device choice itself; with AOTI the device is baked at compile time, so pick carefully.