Compiled models

You’ll take a model you can already run in Python, compile it ahead of time into a self-contained package, and call the compiled version from another process. The compiled form runs the kernels in pre-compiled C++ instead of the eager PyTorch dispatcher, so it’s the form to use once a model is locked in and you want it fast.

The pipeline

input.i ──► neml2-compile ──► elasticity_aoti.i             (standalone drop-in stub)
                              elasticity/
                                cpu/  elasticity_meta.json   (variable layout, dtype, device)
                                      elasticity.pt2         (AOTI-compiled kernels)
                                      elasticity_jvp.pt2     (per-pair Jacobian graph; only with -d)

neml2-compile emits one artifact folder per device (elasticity/<device>/) plus a single standalone elasticity_aoti.i stub next to it. The stub is a copy of the original input with the [Models]/elasticity block replaced by an AOTIModel shim that points at the artifact folder via an absolute artifact_path — everything else (drivers, settings, tensors) is copied through verbatim, so the stub is a drop-in replacement anywhere a Driver consumes a model by name. Compiling for several devices at once (--device cpu cuda) adds more <device>/ subfolders; the loader picks the one matching the device it runs on.

The input file

Listing 13 input.i
# Linear isotropic elasticity model used as the compilation target.
# E  = 200 GPa, nu = 0.3 — same model as the "hello world" tutorial,
# repeated here so this page is self-contained.
[Models]
  [elasticity]
    type = LinearIsotropicElasticity
    coefficients      = '200e3          0.3'
    coefficient_types = 'YOUNGS_MODULUS POISSONS_RATIO'
  []
[]

Compiling from the shell

neml2-compile (see CLI utilities) builds the package in one invocation. Derivative graphs are opt-in: with no -d flag only the forward graph is compiled (the smallest artifact). Request the derivatives you need with -d OUT:IN (here stress:strain; omit a side to select all on it, -d : for every pair) so the artifact also exposes jvp / jacobian for those pairs:

!neml2-compile input.i --model elasticity -d stress:strain
Compiled 'elasticity' (composed, fully baked, 1 derivative pair(s)) for [cpu] -> aoti/elasticity
  stub: aoti/elasticity_aoti.i
  cpu: aoti/elasticity/cpu/elasticity_meta.json

Compilation is a one-time cost (Inductor + C++ compile, typically seconds); the resulting artifact loads quickly on every subsequent run. With no --output-dir the output lands in aoti/ — the standalone aoti/<model>_aoti.i stub next to the per-device folder aoti/<model>/<device>/:

!ls aoti aoti/elasticity aoti/elasticity/cpu
aoti:
elasticity  elasticity_aoti.i

aoti/elasticity:
cpu

aoti/elasticity/cpu:
elasticity.pt2	elasticity_jvp.pt2  elasticity_meta.json

Loading the compiled model

The stub looks like any other HIT input file, so neml2.load_model loads it the usual way:

import neml2

compiled = neml2.load_model("aoti/elasticity_aoti.i", "elasticity")
compiled
AOTIModel()

The shim behaves like a native model — same inputs, same outputs, same call convention — but underneath it dispatches to the compiled .pt2 instead of running Python:

print("inputs :", list(compiled.input_spec))
print("outputs:", list(compiled.output_spec))
inputs : ['strain']
outputs: ['stress']

Round-trip equivalence

Evaluate the compiled and eager forms on the same strain and check they agree. The compiled model always returns a tuple (even with a single output), so we unpack it:

import torch
from neml2.types import SR2

eager = neml2.load_model("input.i", "elasticity")

strain = SR2(torch.tensor(
    [
        [0.01,  0.0,    0.0, 0.0, 0.0, 0.0  ],
        [0.005, -0.002, 0.0, 0.0, 0.0, 0.001],
    ],
    dtype=torch.float64,
))

(stress_compiled,) = compiled(strain)
stress_eager = eager(strain)

print("compiled:", stress_compiled.data)
print("eager   :", stress_eager.data)
print("match   :", torch.allclose(stress_compiled.data, stress_eager.data,
                                  rtol=1e-12, atol=1e-12))
compiled: tensor([[2692.3077, 1153.8462, 1153.8462,    0.0000,    0.0000,    0.0000],
        [1115.3846,   38.4615,  346.1538,    0.0000,    0.0000,  153.8462]],
       dtype=torch.float64)
eager   : tensor([[2692.3077, 1153.8462, 1153.8462,    0.0000,    0.0000,    0.0000],
        [1115.3846,   38.4615,  346.1538,    0.0000,    0.0000,  153.8462]],
       dtype=torch.float64, grad_fn=<AddBackward0>)
match   : True

The compiled artifact matches eager output to machine precision. The leading batch dimension is dynamic, so the same artifact handles any batch size from 1 to roughly a million without recompilation.

Jacobian and Jacobian-vector product

Alongside forward, the compiled artifact also exposes jvp (outputs plus a directional derivative J @ v) and jacobian (outputs plus the per-variable Jacobian blocks). Both live on the underlying C++ binding (neml2.aoti.Model). The HIT-loaded shim stores that binding on its _inner attribute — this attribute is not stable public API and may be renamed in a future release, but it is the only way to reach the binding today:

# NOTE: _inner is an implementation detail, not stable public API.
binding = compiled._inner          # the bare ``neml2.aoti.Model`` runtime

# Tangent on the input — same shape as the input itself.
strain_t = torch.zeros_like(strain.data)
strain_t[:, 0] = 1.0               # probe d(stress)/d(epsilon_xx)

out, jvp = binding.jvp({"strain": strain.data}, {"strain": strain_t})
print("output[stress] :", out["stress"][:, :3])
print("J @ v [stress] :", jvp["stress"][:, :3])
output[stress] : tensor([[2692.3077, 1153.8462, 1153.8462],
        [1115.3846,   38.4615,  346.1538]], dtype=torch.float64)
J @ v [stress] : tensor([[269230.7692, 115384.6154, 115384.6154],
        [269230.7692, 115384.6154, 115384.6154]], dtype=torch.float64)
out, J = binding.jacobian({"strain": strain.data})
print("output[stress] :", out["stress"][:, :3])

# J is nested by variable pair: J[output_name][input_name] is the block
# d(output)/d(input), shaped (batch, *output_base, *input_base).
block = J["stress"]["strain"]
print("J[stress][strain] :", tuple(block.shape))   # (batch, 6, 6)
print("first sample (6x6):\n", block[0])
output[stress] : tensor([[2692.3077, 1153.8462, 1153.8462],
        [1115.3846,   38.4615,  346.1538]], dtype=torch.float64)
J[stress][strain] : (6, 6)
first sample (6x6):
 tensor([269230.7692, 115384.6154, 115384.6154,      0.0000,      0.0000,
             0.0000], dtype=torch.float64)

Use jvp when you only need one direction of the gradient; use jacobian when you need the full per-variable blocks.

Promoting parameters

By default every parameter is baked into the compiled graph as a constant. That’s fastest, but it also means you can’t change them at runtime. To keep one mutable, promote it at compile time with -p:

!neml2-compile input.i --model elasticity --output-dir aoti_promoted -p E -d stress:strain
Compiled 'elasticity' (composed, 1 promoted parameter(s), 1 derivative pair(s)) for [cpu] -> aoti_promoted/elasticity
  stub: aoti_promoted/elasticity_aoti.i
  cpu: aoti_promoted/elasticity/cpu/elasticity_meta.json

The promoted artifact loads the same way; the difference is that the named parameter is reachable from Python and editing it in place is reflected on the next forward call:

m = neml2.load_model("aoti_promoted/elasticity_aoti.i", "elasticity")

# Initial value comes from the snapshot taken at compile time.
(stress0,) = m(strain)

# Mutate E in place on the underlying binding. The next call sees it.
# NOTE: _inner is not stable public API; it is the only way to reach
# the parameter surface from a HIT-loaded shim today.
m._inner.named_parameters()["E"].fill_(100e3)
(stress1,) = m(strain)

print("initial E=200e3 stress[0,0]:", stress0.data[0, 0].item())
print("after   E=100e3 stress[0,0]:", stress1.data[0, 0].item())
print("ratio (expect 0.5):         ", (stress1.data[0, 0] / stress0.data[0, 0]).item())
initial E=200e3 stress[0,0]: 2692.307692307692
after   E=100e3 stress[0,0]: 1346.153846153846
ratio (expect 0.5):          0.5

Each promoted name adds a small per-call cost (it becomes a graph input rather than a baked constant), but the compiled artifact is typically still much faster than eager. Promotion doesn’t work for parameters inside an ImplicitUpdate segment — the compiler will refuse them with a clear error.

Trade-offs vs eager

Eager

AOTI

Per-call overhead

Python + PyTorch dispatcher per op

Thin Python shim into pre-compiled C++ kernels

Build cost

Zero

Seconds (Inductor + C++ compile, once)

Parameter mutation

Free — they’re live attributes

Only for -p names; rest are baked

Autograd

Yes (full eager AD)

No (forward-only; jvp / jacobian sidecars)

Device/dtype

Switchable at runtime via .to(...)

Pinned at compile time (--device, --dtype)

Portability

Needs the full Python source

Needs only .pt2 + metadata + the C++ runtime

Rule of thumb: stay in eager mode while you’re still iterating on a model. Compile it once you’re ready to run it at scale — driver runs, large batch sweeps, the inner loop of a finite-element kernel.

Where to go next

  • Implicit models covers ImplicitUpdate models; the same compile flow handles them — each implicit segment produces its own residual and Newton-step artifacts and the runtime orchestrates the Newton solve internally.

  • Transient driver shows how a compiled model plugs straight into a TransientDriver — the driver doesn’t care whether its model is eager or AOTI-backed.

  • Evaluation device covers the device choice itself; with AOTI the device is baked at compile time, so pick carefully.