Vectorization¶
You’ll evaluate the elasticity model from
Running your first model on a whole batch of
strain states in one call, then time it against a Python for-loop
to see how much the loop costs. The short version: when you have
many states, stack them into one batched input and let the model
evaluate them in a single call.
The input file¶
We reuse the elasticity model from the first tutorial unchanged:
# Reuses the hello-world elasticity model from the
# "Running your first model" tutorial. Linear isotropic elasticity
# is a trivial map (SR2 -> SR2), which keeps the focus on *how* we
# feed in a batch of strains rather than what the model is doing.
[Models]
[elasticity]
type = LinearIsotropicElasticity
coefficients = '200e3 0.3'
coefficient_types = 'YOUNGS_MODULUS POISSONS_RATIO'
[]
[]
import torch
import neml2
from neml2.types import SR2
torch.set_default_dtype(torch.float64)
model = neml2.load_model("input.i", "elasticity")
model
LinearIsotropicElasticity()
One state at a time¶
A single (unbatched) strain has base shape (6,) — the six independent
components of an SR2 in Mandel packing. The model returns a stress
of the same shape:
strain1 = SR2.fill(0.01, 0.0, 0.0, 0.0, 0.0, 0.0)
strain1.data.shape, model(strain1).data.shape
(torch.Size([6]), torch.Size([6]))
A batch of states in one call¶
To evaluate the model on N strain states at once, build the input
with a leading batch dimension: shape (N, 6) instead of (6,).
The output comes back with the same leading shape:
N = 5
batch_data = torch.zeros(N, 6)
batch_data[:, 0] = torch.linspace(0.0, 0.01, N) # ramp epsilon_xx from 0 to 1%
strains = SR2(batch_data)
stresses = model(strains)
strains.data.shape, stresses.data.shape
(torch.Size([5, 6]), torch.Size([5, 6]))
stresses.data
tensor([[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 673.0769, 288.4615, 288.4615, 0.0000, 0.0000, 0.0000],
[1346.1538, 576.9231, 576.9231, 0.0000, 0.0000, 0.0000],
[2019.2308, 865.3846, 865.3846, 0.0000, 0.0000, 0.0000],
[2692.3077, 1153.8462, 1153.8462, 0.0000, 0.0000, 0.0000]],
grad_fn=<AddBackward0>)
Each row of the output is the stress for the corresponding strain — no
loop required. The leading dimension is free-form: a 2-D batch like
(n_load_steps, n_samples) returns (n_load_steps, n_samples, 6),
and so on.
multi = torch.zeros(4, 3, 6)
multi[..., 0] = torch.linspace(0.0, 0.01, 4).unsqueeze(-1)
model(SR2(multi)).data.shape
torch.Size([4, 3, 6])
Loop vs. batched: a numerical experiment¶
How much does the loop actually cost? Evaluate the same elastic model on \(N=10{,}000\) strain states two ways — a Python loop over single states and one batched call — and time just the model evaluations:
import time
N = 10_000
batch_data = torch.zeros(N, 6)
batch_data[:, 0] = torch.linspace(0.0, 0.01, N)
strains = SR2(batch_data)
# --- Python loop: one model call per state ---
t0 = time.perf_counter()
for i in range(N):
model(SR2(batch_data[i]))
t_loop = time.perf_counter() - t0
# --- Single batched call: one model invocation, N states ---
t0 = time.perf_counter()
model(strains)
t_batch = time.perf_counter() - t0
print(f"python loop: {t_loop*1e3:8.2f} ms")
print(f"batched: {t_batch*1e3:8.2f} ms")
print(f"speedup: {t_loop/t_batch:8.1f}x")
python loop: 2403.96 ms
batched: 1.50 ms
speedup: 1597.8x
The cell prints the speedup — large even for a model this trivial on CPU, because the loop pays a constant per-call overhead that the batched call amortizes across the whole batch. On GPU, where each Python-level launch carries extra latency, the gap is usually wider still.
Where to go next¶
Evaluation device — the same batched call runs on CUDA once you move the model and the input there with
.to(device); on GPU, the per-call overhead the batched form avoids is usually even more pronounced.Cross-referencing and Model composition — once a model is composed of several pieces, the same batched-call semantics propagate through every internal evaluation.
Transient driver — for time-stepping a batched state through a load history, where each batched call is one step.