What is vectorization?

Vectorization refers to array programming. Both SIMD (single instruction, multiple data) operation and SIMT (single instruction, multiple threads) operation are implementations of array programming. These terms all refer to the methods that apply operations to an entire set of values at once.

While the benefit of vectorization depends on the instruction support, instruction throughput, and register size, among other factors, it is generally expected that vectorization can provide a considerable amount of speedup to many operations. For example, given a device's instruction set with 1024-bit registers. These can be used to operate on 16 operands of 64-bits apiece. More specifically, if we are dealing with double-precision floating point numbers, a single instruction can do 16 operations at once. In other words, ideally, $N$ operations can be completed using $N / 16$ instructions. And again, ideally, this leads to 16 times faster execution than operations without vectorization.

Although the implementation varies, both CPU and CUDA supports vectorization. NEML2 supports vectorization of model evaluation on both CPU and CUDA. All NEML2 models ensure that the data being processed, e.g., variables, parameters, buffers, are contiguously allocated, and that the same set of operations are applied to those contiguous chunks of data.

For loop versus vectorization: A numerical experiment

Given the material model we have been working with in the previous tutorials, how much (wall) time would it take to perform 100,000 forward evaluations?

There are two ways of completing this task:

using a for loop, or
using vectorization

For loop

The following code snippet shows how to use a for loop to perform the $N$ evaluations. For convenience, the $N$ strains are evenly spaced.

#include <torch/cuda.h>

#include "neml2/models/Model.h"

#include "neml2/tensors/SR2.h"

int

main()

{

using namespace neml2;

set_default_dtype(kFloat64);

// Preparation

Size N = 10;

auto device = kCPU;

auto & model = load_model("input.i", "my_model");

model.to(device);

// Create the strain on the device

auto strain_name = VariableName("forces", "E");

auto strain_min = SR2::fill(0.1, 0.05, -0.03, 0.02, 0.06, 0.03, device);

auto strain_max = SR2::fill(0.5, 0.4, -0.2, 0.2, 0.3, 0.1, device);

auto strain = SR2::linspace(strain_min, strain_max, N);

// Evaluate the model N times

for (Size i = 0; i < N; i++)

auto output = model.value({{strain_name, strain.batch_index({i})}});

}

neml2::SR2::fill
static SR2 fill(const Real &a, const TensorOptions &options=default_tensor_options())
Fill the diagonals with a11 = a22 = a33 = a.
Definition SR2.cxx:53

neml2::TensorBase< SR2 >::linspace
static SR2 linspace(const SR2 &start, const SR2 &end, Size nstep, Size dim=0)
Definition TensorBaseImpl.h:102

neml2
Definition DiagnosticsInterface.cxx:30

neml2::kFloat64
constexpr auto kFloat64
Definition types.h:53

neml2::load_model
Model & load_model(const std::filesystem::path &path, const std::string &mname)
A convenient function to load an input file and get a model.
Definition Model.cxx:48

neml2::set_default_dtype
void set_default_dtype(Dtype dtype)
Definition defaults.cxx:32

neml2::Size
int64_t Size
Definition types.h:69

neml2::VariableName
LabeledAxisAccessor VariableName
Definition LabeledAxisAccessor.h:185

neml2::kCPU
constexpr auto kCPU
Definition types.h:56
import torch

import neml2

from neml2.tensors import SR2

torch.set_default_dtype(torch.double)

# Preparation

N = 10

device = torch.device("cpu")

model = neml2.load_model("input.i", "my_model")

model.to(device=device)

# Create the strain on the device

strain_min = SR2.fill(0.1, 0.05, -0.03, 0.02, 0.06, 0.03, device=device)

strain_max = SR2.fill(0.5, 0.4, -0.2, 0.2, 0.3, 0.1, device=device)

strain = SR2.linspace(strain_min, strain_max, N)

# Evaluate the model N times

for i in range(N):

output = model.value({"forces/E": strain.batch[i]})

Vectorization

The following code snippet shows how to rely on NEML2 internal vectorization to perform the $N$ evaluations.

#include <torch/cuda.h>

#include "neml2/models/Model.h"

#include "neml2/tensors/SR2.h"

int

main()

{

using namespace neml2;

set_default_dtype(kFloat64);

// Preparation

Size N = 10;

auto device = kCPU;

auto & model = load_model("input.i", "my_model");

model.to(device);

// Create the strain on the device

auto strain_name = VariableName("forces", "E");

auto strain_min = SR2::fill(0.1, 0.05, -0.03, 0.02, 0.06, 0.03, device);

auto strain_max = SR2::fill(0.5, 0.4, -0.2, 0.2, 0.3, 0.1, device);

auto strain = SR2::linspace(strain_min, strain_max, N);

// Evaluate the model N times

auto output = model.value({{strain_name, strain}});

}
import torch

import neml2

from neml2.tensors import SR2

torch.set_default_dtype(torch.double)

# Preparation

N = 10

device = torch.device("cpu")

model = neml2.load_model("input.i", "my_model")

model.to(device=device)

# Create the strain on the device

strain_min = SR2.fill(0.1, 0.05, -0.03, 0.02, 0.06, 0.03, device=device)

strain_max = SR2.fill(0.5, 0.4, -0.2, 0.2, 0.3, 0.1, device=device)

strain = SR2.linspace(strain_min, strain_max, N)

# Evaluate the model N times

output = model.value({"forces/E": strain})

For a fair comparison, the input variables are allocated prior to model evaluation, and only the model evaluation part is timed. The timing study was performed with $N = {10, 1000, 100000}$ on both CPU and CUDA. Results are summarized in the following table.

Device	N	For loop	Vectorization
CPU	10	0.061	0.003
	1000	0.232	0.004
	100000	16.782	0.041
CUDA	10	0.145	0.004
	1000	0.337	0.004
	100000	26.176	0.005

Remarks

While there are many other factors contributing to the performance difference between the for loop approach and the vectorization approach, we hope this simple experiment serves as a reasonable introduction to the benefit of vectorization.

It is worth noting that, in the above code snippets,

The pre-allocated strain tensor has a shape of (N; 6). The leading dimension with size N is referred to as a batch dimension in NEML2.
The strain tensor with shape (N; 6) can operate with scalar-valued model parameters $K$ and $G$ . This behavior relies on the concept of broadcasting.

Batching and broadcasting are the fundamental mechanisms that allow NEML2 to efficiently vectorize model evaluation while providing users with a unified API for both CPU and CUDA. More details on these mechanisms are discussed in the tutorials on NEML2 tensors.

Previous	Next
Evaluation device	Model composition

Table of Contents

What is vectorization?

For loop versus vectorization: A numerical experiment

For loop

Vectorization

Remarks