Migrating from v2.x to v3.0¶
v3.0 is the largest single release since v2 — the runtime is now implemented entirely in Python (running on the PyTorch backend), with AOT-Inductor as the path to a portable compiled artifact. Most input files written for v2.1.2 continue to work unchanged; the changes below are the ones to look at first.
Why migrate?¶
Measured as peak throughput (material updates per ms at the optimal batch size \(N^*\) that each runtime hits its vectorisation ceiling at — see Benchmarking for the methodology), comparing v3.0 against v2.1.6 on the same node (NVIDIA RTX A5000 + Intel Xeon Gold 6346, 32 hyperthreaded cores):
CUDA, every scenario: v3 is 1.0× to 4.5× faster than v2 — every benchmarked model + driver combination is at least as fast on v3, most are 2-4× faster.
CPU, non-crystal-plasticity: v3 is 1.4× to >10× faster than v2 across elasticity, viscoplastic hardening, Chaboche (2/4/6 back stresses), GTN poroplasticity, and \(J_2\) radial-return.
CPU, crystal-plasticity (known regression): v3 is 3× to 30× slower than v2 on
scpcoup,scpdecoup,scpdecoupexp,tcpsingle,tcprandom. v3’s CP optimisation work targeted the CUDA AOTI path (per-slip-reduction-fused Triton kernels, K-batched JVP with broadcast-state tangents); the equivalent CPU codegen is currently slower than v2’s eager TorchScript-JIT path. Migrate CP workloads to CUDA to capture the 1.3-4.1× win there; CP-on-CPU is a planned follow-up.
The full per-batch sweep, fitted vectorisation-efficiency parameters, and raw CSVs are documented in Benchmarking.
Install / consumption¶
pip install neml2 is now the only supported install path. The wheel
bundles the runtime shared library, public C++ headers, CMake config
exports, and pkg-config files alongside the Python package, so a
single install serves both Python and C++ consumers — see
Basic installation.
If you previously built from source against a custom LibTorch, the
source-build flow still works through pip install -e . and the
dev / cc CMake presets; see Building from source.
Removed CMake presets¶
The preset list collapsed to dev and cc. The build flavors that
used to live in CMakePresets.json (release, asan, tsan,
coverage, profiling, …) were removed:
The wheel build that
pip installdrives usescmake.build-type = "Release"internally — there’s no developer-facingreleasepreset to invoke directly.Sanitizer / coverage flavors are no longer wired into presets. Re-add them through
CMakeUserPresets.jsonif you need them locally.
Removed CMake options¶
NEML2_PCH, NEML2_JSON, NEML2_CSV are gone. Precompiled headers
made a measurable difference in v2 when the C++ tower had thousands
of TUs; the v3 codebase has a handful, and the option no longer pays
for itself. JSON and CSV support are always on.
Removed CLI tools¶
neml2-diagnose and neml2-time were removed. The remaining CLI
surface is:
Tool |
Purpose |
|---|---|
|
Drive a model through a load history. |
|
Print the structural summary of a model. |
|
Browse the registered-object catalog. |
|
Export a model to an AOT-Inductor package. |
|
Regenerate |
See CLI utilities for the full reference.
Compiled-model story: TorchScript → AOTI¶
neml2-jit and the TorchScript path are gone. The portable artifact
is now an AOT-Inductor (AOTI) package built with neml2-compile:
neml2-compile input.i --model elasticity
The output layout is per-device: neml2-compile writes the compiled
kernels and a _meta.json into <output-dir>/<model>/<device>/, and
places a standalone HIT stub <output-dir>/<model>_aoti.i next to the
<model>/ folder. The stub points at the artifact folder via an
absolute path; the loader picks the subfolder for the running device
automatically. Passing --device cpu cuda emits both cpu/ and
cuda/ subfolders under the same <model>/ folder, ready for the C++
multi-device dispatcher. The stub is not relocatable without
recompiling (the artifact path is absolute). The artifact loads in
milliseconds from either Python or C++ without re-parsing the input
file. See Compiled models for the round-trip walkthrough
and AOTI packages for the package-format reference.
Input-file changes¶
The HIT format is unchanged. Most input files from v2.1.2 work verbatim. The notable differences:
[Tensors] collapses to Python + CSV<Type>¶
The big input-file change is in the [Tensors] section. v2.1.2
shipped a registered catalog of tensor-constructor types —
UserTensor, FullTensor, LinspaceTensor, LogspaceTensor,
IdentityTensor, GaussianTensor, RandomTensor, FillR2,
FillSR2, FillRot, FillWR2, Orientation,
SymmetryFromOrbifold, FromTorchScript, and the per-primitive
type names (Scalar, R2, SR2, …). v3 keeps only two
built-in flavors:
Python— an inline PyTorch expression evaluated against a namespace pre-populated withtorch,math,np, and every typed wrapper fromneml2.types(Scalar,SR2,R2,Rot, …). Cross-references to other[Tensors]entries by name resolve lazily, so one block can build on another.CSV<Type>—CSVScalar,CSVSR2,CSVVec,CSVWR2for loading reference snapshots from disk. These are the only typed-constructor blocks that survived.
Everything else moves into a type = Python block with an expr
that calls the equivalent torch primitive. The wrapper that the
expression returns can chain .sub_batch.retag(...) to tag a
sub-batch axis — the same source of truth sub_batch_ndim /
sub_batch.expand_at used to encode separately.
# v2.1.2 — registered constructor type per shape / pattern
[Tensors]
[times]
type = LinspaceScalar
start = 0
end = 1
nstep = 5
[]
[strain_template]
type = FillSR2
values = '0.01 0 0 0 0 0'
[]
[]
# v3.0 — same data, one Python block per entry
[Tensors]
[times]
type = Python
expr = 'Scalar.linspace(0, 1, 5)'
[]
[strain_template]
type = Python
expr = 'SR2.fill(0.01, 0, 0, 0, 0, 0)'
[]
[]
Bare numeric / list option values (used directly inside [Models]
or [Drivers] blocks) still parse the same way and don’t need a
[Tensors] entry at all. The migration is only for blocks that
previously named a constructor type.
Scheduler / dispatcher input wiring¶
The work-dispatcher is back, but C+±only and not input-file-wired.
Drop the scheduler = '<name>' option and any [Schedulers] block
from your input files — the v2 HIT surface is gone. In v3 dispatch is
a feature of the compiled C++ runtime: neml2-compile --device cpu cuda
emits one artifact per device, and neml2::aoti::load_model(stub, name, scheduler) runs a batch across them via SimpleScheduler /
MPISimpleScheduler, configured in C++ source rather than the .i. See
Dispatching across devices.
Python (neml2.load_model, neml2-run) stays eager and single-device;
the per-device sweep that pyzag / your own loop performs remains the
Python multi-device path.
Python API surface¶
The user-facing Python surface is mostly stable:
neml2.load_model(path, name)andneml2.load_input(path)work as in v2.Typed wrappers (
Scalar,Vec,R2,SR2,WR2,SSR4,Rot,MillerIndex) live inneml2.typesas before. Their underlying tensor is.data.Scalar(<number>)now accepts a plain Python number directly (defaults totorch.float64). v2 requiredScalar(torch.tensor(<number>, dtype=torch.float64)).
The notable surface changes:
neml2.esand the assembled-vector / assembled-matrix submodule moved toneml2.es.neml2.tensors(the submodule re-exporting typed wrappers from the C++ bindings) is gone; the wrappers live underneml2.types.The
neml2.postprocessingmodule (ODF, pole-figure helpers) is not in v3.0. The crystal-plasticity outputs are still produced exactly as before; you’d build the post-processing on top of them yourself.The
neml2.readermodule is gone —neml2.load_inputcovers the parser-facing surface.
Documentation pipeline¶
The documentation pipeline moved from Doxygen + custom Python scripts to Sphinx with the shibuya theme and MyST-NB. The Python contract:
pip install ".[dev]" -v
doc/scripts/build.sh
See Documentation for the contributor view of the doc build
(including the --clean / --serve / --port flags exposed by the
wrapper) and Building from source for the source-build view.