Evaluation device¶
You’ll move a model and its inputs onto a target device (CPU here, but the same calls work for CUDA or any other torch device) and run a forward pass.
The runnable cells stay on CPU so the doc build is portable. Swap
"cpu" for "cuda" to run on a GPU — nothing else changes.
Note
A freshly loaded NEML2 model sits on CPU with torch.float64
parameters. There is no separate “CUDA build” of NEML2 — the same
wheel runs on whichever devices your PyTorch install supports; you
opt in at runtime with .to(...).
The input file¶
# Linear isotropic elasticity used by the evaluation-device tutorial.
# Same model as the "running your first model" tutorial -- a small,
# device-agnostic forward operator to demonstrate how parameters and
# inputs are moved between CPU and CUDA.
[Models]
[elasticity]
type = LinearIsotropicElasticity
coefficients = '200e3 0.3'
coefficient_types = 'YOUNGS_MODULUS POISSONS_RATIO'
[]
[]
Loading and inspecting placement¶
Load the model and check where its parameters live:
import torch
import neml2
model = neml2.load_model("input.i", "elasticity")
for name, p in model.named_parameters():
print(f"{name:>3}: device={p.device}, dtype={p.dtype}")
E: device=cpu, dtype=torch.float64
nu: device=cpu, dtype=torch.float64
Moving the model and its inputs¶
Two pieces have to land on the target device before you call
model(x):
The model —
model.to(device=...)moves its parameters and buffers (recursively, for composed models).The inputs — types in
neml2.types(likeSR2) accept adevice=keyword in their constructors.
The cell below targets CPU. To run on a GPU, swap in
torch.device("cuda"):
from neml2.types import SR2
target = torch.device("cpu") # swap for torch.device("cuda") on a CUDA box
# 1. Move model parameters/buffers.
model.to(device=target)
for name, p in model.named_parameters():
print(f"{name:>3}: device={p.device}")
# 2. Allocate the input on the same device.
strain = SR2.fill(0.01, 0.0, 0.0, 0.0, 0.0, 0.0, device=target)
print(f"strain.device = {strain.device}")
E: device=cpu
nu: device=cpu
strain.device = cpu
Forward pass and bringing the result home¶
With the model and input on the same device, the call looks just like
the CPU case. The result lives on that same device, so pull it back
with .to(device="cpu") if you need it for NumPy or Matplotlib:
stress = model(strain)
print(f"stress.device = {stress.device}")
stress_host = stress.to(device=torch.device("cpu"))
print(f"stress (host copy): {stress_host}")
stress.device = cpu
stress (host copy): SR2(data=tensor([2692.3076, 1153.8462, 1153.8462, 0.0000, 0.0000, 0.0000],
grad_fn=<AddBackward0>), sub_batch_ndim=0, sub_batch_state=(), sub_batch_meta=(), k_ndim=0, k_state=(), k_pairing=())
If target had been torch.device("cuda"), stress.device would
read cuda:0 and the .to(device="cpu") call would copy the result
across the host-device boundary.
Detecting CUDA at runtime¶
Production code that wants to opportunistically use CUDA usually
guards on torch.cuda.is_available():
target = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device=target)
is_available() returns True when a CUDA runtime and a visible GPU
are both present. If you want to fall back gracefully when the first
CUDA call fails, wrap the first forward in a try / except and
reload onto CPU in the handler.
Mixed-device errors¶
If the model is on one device and the input is on another, PyTorch raises at the first op that touches both. The fix is the same in both directions: move both ends to the same device.
On a CUDA-equipped machine, this would raise:
model = neml2.load_model("input.i", "elasticity") # CPU
strain = SR2.fill(0.01, 0.0, 0.0, 0.0, 0.0, 0.0, device="cuda")
model(strain) # RuntimeError
Host-device transfer cost¶
.to(device=...) is not free — it copies data between host and
device memory. A few rules of thumb:
Move the model once, up front. The parameters don’t change between calls, so copying them every time wastes bandwidth.
Build inputs on the device. Pass
device=to the constructor (e.g.SR2.fill(..., device=target)) instead of building on CPU and copying.Pull only what you need back to CPU. Keep the integration loop on GPU and only
.to(device="cpu")the final history slice for plotting.
Where to go next¶
The same model also runs on batched inputs — see Vectorization.
To read or mutate the model’s parameters from Python, see Model parameters.