Vectorization

You’ll evaluate the elasticity model from Running your first model on a whole batch of strain states in one call, then time it against a Python for-loop to see how much the loop costs. The short version: when you have many states, stack them into one batched input and let the model evaluate them in a single call.

The input file

We reuse the elasticity model from the first tutorial unchanged:

Listing 4 input.i
# Reuses the hello-world elasticity model from the
# "Running your first model" tutorial. Linear isotropic elasticity
# is a trivial map (SR2 -> SR2), which keeps the focus on *how* we
# feed in a batch of strains rather than what the model is doing.
[Models]
  [elasticity]
    type = LinearIsotropicElasticity
    coefficients      = '200e3          0.3'
    coefficient_types = 'YOUNGS_MODULUS POISSONS_RATIO'
  []
[]
import torch
import neml2
from neml2.types import SR2

torch.set_default_dtype(torch.float64)

model = neml2.load_model("input.i", "elasticity")
model
LinearIsotropicElasticity()

One state at a time

A single (unbatched) strain has base shape (6,) — the six independent components of an SR2 in Mandel packing. The model returns a stress of the same shape:

strain1 = SR2.fill(0.01, 0.0, 0.0, 0.0, 0.0, 0.0)
strain1.data.shape, model(strain1).data.shape
(torch.Size([6]), torch.Size([6]))

A batch of states in one call

To evaluate the model on N strain states at once, build the input with a leading batch dimension: shape (N, 6) instead of (6,). The output comes back with the same leading shape:

N = 5
batch_data = torch.zeros(N, 6)
batch_data[:, 0] = torch.linspace(0.0, 0.01, N)  # ramp epsilon_xx from 0 to 1%
strains = SR2(batch_data)

stresses = model(strains)
strains.data.shape, stresses.data.shape
(torch.Size([5, 6]), torch.Size([5, 6]))
stresses.data
tensor([[   0.0000,    0.0000,    0.0000,    0.0000,    0.0000,    0.0000],
        [ 673.0769,  288.4615,  288.4615,    0.0000,    0.0000,    0.0000],
        [1346.1538,  576.9231,  576.9231,    0.0000,    0.0000,    0.0000],
        [2019.2308,  865.3846,  865.3846,    0.0000,    0.0000,    0.0000],
        [2692.3077, 1153.8462, 1153.8462,    0.0000,    0.0000,    0.0000]],
       grad_fn=<AddBackward0>)

Each row of the output is the stress for the corresponding strain — no loop required. The leading dimension is free-form: a 2-D batch like (n_load_steps, n_samples) returns (n_load_steps, n_samples, 6), and so on.

multi = torch.zeros(4, 3, 6)
multi[..., 0] = torch.linspace(0.0, 0.01, 4).unsqueeze(-1)
model(SR2(multi)).data.shape
torch.Size([4, 3, 6])

Loop vs. batched: a numerical experiment

How much does the loop actually cost? Evaluate the same elastic model on \(N=10{,}000\) strain states two ways — a Python loop over single states and one batched call — and time just the model evaluations:

import time

N = 10_000
batch_data = torch.zeros(N, 6)
batch_data[:, 0] = torch.linspace(0.0, 0.01, N)
strains = SR2(batch_data)

# --- Python loop: one model call per state ---
t0 = time.perf_counter()
for i in range(N):
    model(SR2(batch_data[i]))
t_loop = time.perf_counter() - t0

# --- Single batched call: one model invocation, N states ---
t0 = time.perf_counter()
model(strains)
t_batch = time.perf_counter() - t0

print(f"python loop: {t_loop*1e3:8.2f} ms")
print(f"batched:     {t_batch*1e3:8.2f} ms")
print(f"speedup:     {t_loop/t_batch:8.1f}x")
python loop:  2403.96 ms
batched:         1.50 ms
speedup:       1597.8x

The cell prints the speedup — large even for a model this trivial on CPU, because the loop pays a constant per-call overhead that the batched call amortizes across the whole batch. On GPU, where each Python-level launch carries extra latency, the gap is usually wider still.

Where to go next

  • Evaluation device — the same batched call runs on CUDA once you move the model and the input there with .to(device); on GPU, the per-call overhead the batched form avoids is usually even more pronounced.

  • Cross-referencing and Model composition — once a model is composed of several pieces, the same batched-call semantics propagate through every internal evaluation.

  • Transient driver — for time-stepping a batched state through a load history, where each batched call is one step.