Migrating from v2.x to v3.0

v3.0 is the largest single release since v2 — the runtime is now implemented entirely in Python (running on the PyTorch backend), with AOT-Inductor as the path to a portable compiled artifact. Most input files written for v2.1.2 continue to work unchanged; the changes below are the ones to look at first.

Why migrate?

Measured as peak throughput (material updates per ms at the optimal batch size \(N^*\) that each runtime hits its vectorisation ceiling at — see Benchmarking for the methodology), comparing v3.0 against v2.1.6 on the same node (NVIDIA RTX A5000 + Intel Xeon Gold 6346, 32 hyperthreaded cores):

  • CUDA, every scenario: v3 is 1.0× to 4.5× faster than v2 — every benchmarked model + driver combination is at least as fast on v3, most are 2-4× faster.

  • CPU, non-crystal-plasticity: v3 is 1.4× to >10× faster than v2 across elasticity, viscoplastic hardening, Chaboche (2/4/6 back stresses), GTN poroplasticity, and \(J_2\) radial-return.

  • CPU, crystal-plasticity (known regression): v3 is 3× to 30× slower than v2 on scpcoup, scpdecoup, scpdecoupexp, tcpsingle, tcprandom. v3’s CP optimisation work targeted the CUDA AOTI path (per-slip-reduction-fused Triton kernels, K-batched JVP with broadcast-state tangents); the equivalent CPU codegen is currently slower than v2’s eager TorchScript-JIT path. Migrate CP workloads to CUDA to capture the 1.3-4.1× win there; CP-on-CPU is a planned follow-up.

The full per-batch sweep, fitted vectorisation-efficiency parameters, and raw CSVs are documented in Benchmarking.

Install / consumption

pip install neml2 is now the only supported install path. The wheel bundles the runtime shared library, public C++ headers, CMake config exports, and pkg-config files alongside the Python package, so a single install serves both Python and C++ consumers — see Basic installation.

If you previously built from source against a custom LibTorch, the source-build flow still works through pip install -e . and the dev / cc CMake presets; see Building from source.

Removed CMake presets

The preset list collapsed to dev and cc. The build flavors that used to live in CMakePresets.json (release, asan, tsan, coverage, profiling, …) were removed:

  • The wheel build that pip install drives uses cmake.build-type = "Release" internally — there’s no developer-facing release preset to invoke directly.

  • Sanitizer / coverage flavors are no longer wired into presets. Re-add them through CMakeUserPresets.json if you need them locally.

Removed CMake options

NEML2_PCH, NEML2_JSON, NEML2_CSV are gone. Precompiled headers made a measurable difference in v2 when the C++ tower had thousands of TUs; the v3 codebase has a handful, and the option no longer pays for itself. JSON and CSV support are always on.

Removed CLI tools

neml2-diagnose and neml2-time were removed. The remaining CLI surface is:

Tool

Purpose

neml2-run

Drive a model through a load history.

neml2-inspect

Print the structural summary of a model.

neml2-syntax

Browse the registered-object catalog.

neml2-compile

Export a model to an AOT-Inductor package.

neml2-stub

Regenerate .pyi stubs for the pybind11 extensions.

See CLI utilities for the full reference.

Authoring custom models

Custom models are now written as a Python class deriving from neml2.models.model.Model (which is itself a torch.nn.Module). The authoring surface is documented end-to-end in Extension — in summary:

  • Schema declaration uses the helpers in neml2.schema (input, output, parameter, …) rather than the C++ options.add_input / add_parameter family.

  • Registration uses the @register_neml2_object("TypeName") decorator rather than the register_NEML2_object C++ macro.

  • forward(self, *typed_inputs, v=None, v2=None, vh=None) replaces the C++ set_value virtual; the optional v / v2 / vh kwargs carry the first- and second-order chain-rule channels.

If you were maintaining a C++ subclass of Model against v2, the Declaring inputs, outputs, and parameters, The forward operator, and Connecting to input files tutorials walk through the equivalent Python pattern on a single running example.

Compiled-model story: TorchScript → AOTI

neml2-jit and the TorchScript path are gone. The portable artifact is now an AOT-Inductor (AOTI) package built with neml2-compile:

neml2-compile input.i --model elasticity

The output layout is per-device: neml2-compile writes the compiled kernels and a _meta.json into <output-dir>/<model>/<device>/, and places a standalone HIT stub <output-dir>/<model>_aoti.i next to the <model>/ folder. The stub points at the artifact folder via an absolute path; the loader picks the subfolder for the running device automatically. Passing --device cpu cuda emits both cpu/ and cuda/ subfolders under the same <model>/ folder, ready for the C++ multi-device dispatcher. The stub is not relocatable without recompiling (the artifact path is absolute). The artifact loads in milliseconds from either Python or C++ without re-parsing the input file. See Compiled models for the round-trip walkthrough and AOTI packages for the package-format reference.

Input-file changes

The HIT format is unchanged. Most input files from v2.1.2 work verbatim. The notable differences:

[Tensors] collapses to Python + CSV<Type>

The big input-file change is in the [Tensors] section. v2.1.2 shipped a registered catalog of tensor-constructor types — UserTensor, FullTensor, LinspaceTensor, LogspaceTensor, IdentityTensor, GaussianTensor, RandomTensor, FillR2, FillSR2, FillRot, FillWR2, Orientation, SymmetryFromOrbifold, FromTorchScript, and the per-primitive type names (Scalar, R2, SR2, …). v3 keeps only two built-in flavors:

  • Python — an inline PyTorch expression evaluated against a namespace pre-populated with torch, math, np, and every typed wrapper from neml2.types (Scalar, SR2, R2, Rot, …). Cross-references to other [Tensors] entries by name resolve lazily, so one block can build on another.

  • CSV<Type>CSVScalar, CSVSR2, CSVVec, CSVWR2 for loading reference snapshots from disk. These are the only typed-constructor blocks that survived.

Everything else moves into a type = Python block with an expr that calls the equivalent torch primitive. The wrapper that the expression returns can chain .sub_batch.retag(...) to tag a sub-batch axis — the same source of truth sub_batch_ndim / sub_batch.expand_at used to encode separately.

# v2.1.2 — registered constructor type per shape / pattern
[Tensors]
  [times]
    type = LinspaceScalar
    start = 0
    end = 1
    nstep = 5
  []
  [strain_template]
    type = FillSR2
    values = '0.01 0 0 0 0 0'
  []
[]
# v3.0 — same data, one Python block per entry
[Tensors]
  [times]
    type = Python
    expr = 'Scalar.linspace(0, 1, 5)'
  []
  [strain_template]
    type = Python
    expr = 'SR2.fill(0.01, 0, 0, 0, 0, 0)'
  []
[]

Bare numeric / list option values (used directly inside [Models] or [Drivers] blocks) still parse the same way and don’t need a [Tensors] entry at all. The migration is only for blocks that previously named a constructor type.

Scheduler / dispatcher input wiring

The work-dispatcher is back, but C+±only and not input-file-wired. Drop the scheduler = '<name>' option and any [Schedulers] block from your input files — the v2 HIT surface is gone. In v3 dispatch is a feature of the compiled C++ runtime: neml2-compile --device cpu cuda emits one artifact per device, and neml2::aoti::load_model(stub, name, scheduler) runs a batch across them via SimpleScheduler / MPISimpleScheduler, configured in C++ source rather than the .i. See Dispatching across devices.

Python (neml2.load_model, neml2-run) stays eager and single-device; the per-device sweep that pyzag / your own loop performs remains the Python multi-device path.

Python API surface

The user-facing Python surface is mostly stable:

  • neml2.load_model(path, name) and neml2.load_input(path) work as in v2.

  • Typed wrappers (Scalar, Vec, R2, SR2, WR2, SSR4, Rot, MillerIndex) live in neml2.types as before. Their underlying tensor is .data.

  • Scalar(<number>) now accepts a plain Python number directly (defaults to torch.float64). v2 required Scalar(torch.tensor(<number>, dtype=torch.float64)).

The notable surface changes:

  • neml2.es and the assembled-vector / assembled-matrix submodule moved to neml2.es.

  • neml2.tensors (the submodule re-exporting typed wrappers from the C++ bindings) is gone; the wrappers live under neml2.types.

  • The neml2.postprocessing module (ODF, pole-figure helpers) is not in v3.0. The crystal-plasticity outputs are still produced exactly as before; you’d build the post-processing on top of them yourself.

  • The neml2.reader module is gone — neml2.load_input covers the parser-facing surface.

Documentation pipeline

The documentation pipeline moved from Doxygen + custom Python scripts to Sphinx with the shibuya theme and MyST-NB. The Python contract:

pip install ".[dev]" -v
doc/scripts/build.sh

See Documentation for the contributor view of the doc build (including the --clean / --serve / --port flags exposed by the wrapper) and Building from source for the source-build view.