Home Artificial Intelligence Boosting Machine Learning Performance With Rust The Forward Pass Error Calculation The Backward Pass The Training Loop Final Helper Functions Results and Opinions

Boosting Machine Learning Performance With Rust The Forward Pass Error Calculation The Backward Pass The Training Loop Final Helper Functions Results and Opinions

1
Boosting Machine Learning Performance With Rust
The Forward Pass
Error Calculation
The Backward Pass
The Training Loop
Final Helper Functions
Results and Opinions

Photo by Chris Liverani on Unsplash

In this text, I want to share my experience of attempting to create a little bit Machine Learning (ML) framework from scratch using Rust.

For my experiment, I had the next objectives in mind:

  1. I wanted to analyze whether slightly than using Python + PyTorch, shifting to Rust + LibTorch (the C++ backend library of PyTorch) would translate into tangible speed improvements, especially in the course of the model training process. As we all know, ML models have gotten larger and hence requiring increasing (sometimes unfeasible for the common bloke) computational power to coach. One approach to mitigate increasing hardware requirements is to discover a way learn how to make algorithms more computationally efficient. Knowing that inside PyTorch, Python just acts as a layer on top of LibTorch, my big query was whether replacing the highest Python layer with Rust is well worth the effort. The plan was to make use of the Tch-rs Rust crate to only expose me to the Tensors and Autograd functionality of the LibTorch DLL, hence acting as our “gradients calculator”, but then develop the remainder from scratch in Rust.
  2. I desired to keep the code easy enough to allow a transparent understanding of all of the linear algebra being performed and permit me to simply extend it if required.
  3. As much as possible my framework had to permit me to define ML models following the same structure as per standard Python/PyTorch.
  4. Ummm … for “rusty” fun and learning 🙂

The post just isn’t intended to show Rust per se, but slightly to supply an appreciation of how Rust will be applied for ML and the advantages of that.

Jumping straight to the , my little framework allows me to create Neural Network models as per below:

Listing 1 — Defining my Neural Network model.

struct MyModel {
l1: Linear,
l2: Linear,
}

impl MyModel {
fn recent (mem: &mut Memory) -> MyModel {
let l1 = Linear::recent(mem, 784, 128);
let l2 = Linear::recent(mem, 128, 10);
Self {
l1: l1,
l2: l2,
}
}
}

impl Compute for MyModel {
fn forward (&self, mem: &Memory, input: &Tensor) -> Tensor {
let mut o = self.l1.forward(mem, input);
o = o.relu();
o = self.l2.forward(mem, &o);
o
}
}

… after which instantiate and train the model like this:

Listing 2 — Instantiating and training my Neural Network model.

fn most important() {
let (x, y) = load_mnist();

let mut m = Memory::recent();
let mymodel = MyModel::recent(&mut m);
train(&mut m, &x, &y, &mymodel, 100, 128, cross_entropy, 0.3);
let loose = mymodel.forward(&m, &x);
println!("Training Accuracy: {}", accuracy(&y, &out));
}

For PyTorch users, the above creates quite an intuitive similarity with how one would define and train a Neural Network in Python. The instance above shows a Neural Network model, which is then used for classification (the model is applied on the Mnist dataset, which I shall be using as my benchmark dataset to check the Rust — Python model versions).

In the primary block of the code, a MyModel struct is created which holds two layers of type Linear.

The second block is the MyModel stuct implementation, which defines an associated function recent. This function initializes the 2 layers and returns a recent instance of the struct.

Finally, the third block implements the Compute trait for MyModel, which defines the forward method. Within the most important function I then load the Mnist dataset, initialize the memory, instantiate MyModel, after which train it using 100 Epochs, batch size of 128, Cross Entropy Loss, and a learning rate of 0.3.

Pretty intuitive huh? That’s what can be required to create and train recent models in Rust using my little framework. Nevertheless, we now start looking a bit under the hood to see what makes the above possible.

Taking a look at the above code, an obvious query might pop up in case you are used to constructing ML models in PyTorch — what’s the Memory reference doing? I explain below.

From ML literature, we all know that the Neural Network training mechanism happens by iteratively going through two steps for a variety of epochs (and normally also for a variety of batches), a forward pass, and a backward pass (backpropagation).

Within the forward pass, we push the inputs and subsequent calculations along all of the layers within the network, where for every layer we’ve:

Equation 1 — Linear and Activation functions happening in each Neural Network layer (Goodfellow et. al., 2016)

where provides the weights for the linear function, the biases, after which that is passed through an activation function resembling Sigmoid providing the non-linearity.

With that information, we will now create our Linear layer (Listing 3 below). As you possibly can notice, the structure for outlining a layer follows the identical structure for outlining our model (Listing 1 above) and implements the identical functions and traits.

Within the case of the Linear layer, the struct incorporates a field named params. The params field is a set of type HashMap, where the hot button is of type String, which stores a parameter name, and the worth is of type usize, which holds the situation of the precise parameter (which is a PyTorch tensor) in our Memory, which in turn acts as our store for all our parameters.

Listing 3— Defining a Neural Network Layer, on this case, a Linear Layer.

trait Compute {
fn forward (&self, mem: &Memory, input: &Tensor) -> Tensor;
}

struct Linear {
params: HashMap,
}

impl Linear {
fn recent (mem: &mut Memory, ninputs: i64, noutputs: i64) -> Self {
let mut p = HashMap::recent();
p.insert("W".to_string(), mem.new_push(&[ninputs,noutputs], true));
p.insert("b".to_string(), mem.new_push(&[1, noutputs], true));

Self {
params: p,
}
}
}

impl Compute for Linear {
fn forward (&self, mem: &Memory, input: &Tensor) -> Tensor {
let w = mem.get(self.params.get(&"W".to_string()).unwrap());
let b = mem.get(self.params.get(&"b".to_string()).unwrap());
input.matmul(w) + b
}
}

In step with Equation 1, in our associated function recent, we insert on our HashMap two parameters and “which are required for the Linear Layer.

The mem.new_push() method, presented later, creates their respective tensors within the required sizes, pushes them to the memory store, and returns their location. The boolean parameter within the insert method defines that we’d like to calculate the gradient for these parameters. In this manner, each layer will contain the parameter names and their respective tensor store locations in our Memory structure.

Just like the MyModel definition, we then implement the Compute trait for our Linear Layer. This requires defining the function forward, which is named in the course of the forward pass of the training process.

On this function, we first obtain a reference to the 2 tensor parameters from our tensor store using the get method after which calculate our linear function (Equation 1). Just like PyTorch, from our Neural Network we output the unnormalized predictions (logits) and perform the normalization (on this case Softmax) later in the course of the error calculation.

One might ask, why take this approach to represent a Linear Layer slightly than perhaps hard-coding Equation 1 directly in a single or two lines of code?

This approach was taken in order that if additional Neural Network layer types must be defined, e.g. a CNN or an LSTM layer, then it’s just an issue of copying precisely the above Linear Layer structure and injecting additional parameters and computations within the associated function recent and forward method and it should immediately grow to be available to incorporate in your models (as per Listing 1).

As well as, this approach of pushing all tensors in a central store will grow to be handy within the backpropagation step, as I’ll discuss below.

At the top of the forward pass, we’d like to calculate the error between our predictions and the targets.

Below is the code for mean squared error calculation, which is often applied for regression, and cross-entropy loss, which is often applied for classification.

Listing 4 — Mean Squared Error and Cross Entropy Loss Functions.

fn mse(goal: &Tensor, pred: &Tensor) -> Tensor {
(goal - pred).square().mean(Kind::Float)
}

fn cross_entropy (goal: &Tensor, pred: &Tensor) -> Tensor {
let loss = pred.log_softmax(-1, Kind::Float).nll_loss(goal);
loss
}

And that completes the forward pass … we now kick off with the backward pass.

Within the backward pass, we’d like to update the parameters of the model using the gradients, where each gradient is the derivative of the loss function with respect to every respective parameter. In step one, we obtain the gradients:

Equation 2 — Derivative of the loss function with respect to our model parameters (Goodfellow et. al., 2016)

where m’ represents the scale of the minibatch. For every minibatch, the parameters are then updated as follows:

Equation 3— Parameter update rule using the gradient (Goodfellow et. al., 2016)

where epsilon is the educational rate.

That is where I exploit the Autograd functionality from LibTorch to acquire my gradients. In PyTorch, we normally apply the backward method on the loss to calculate the derivatives, which is then followed by calling the step function from the optimizer to use the gradients to the model parameters. The identical process happens here, with the difference that we cannot apply the step function on to apply the gradients because we should not extending our models from the nn.Module class and using PyTorch optimizers as we normally do in Python. Hence the step part we’d like to cater to it ourselves.

Within the snippet below (Listing 5) we show our tensor Memory implementation, which also caters to the gradient step functionality. The tensor store is implemented as a struct with two fields, a size, which holds the present variety of tensors stored, and values, which is a vector of tensors. Within the implementation block, the brand new method handles the shop intialiazation and the push, new_push and get methods handle the passing backwards and forwards of the tensors (the latter two we utilized within the Linear Layer above).

Listing 5 — The tensor store — Memory.

struct Memory {
size: usize,
values: Vec,
}

impl Memory {

fn recent() -> Self {
let v = Vec::recent();
Self {size: 0,
values: v}
}

fn push (&mut self, value: Tensor) -> usize {
self.values.push(value);
self.size += 1;
self.size-1
}

fn new_push (&mut self, size: &[i64], requires_grad: bool) -> usize {
let t = Tensor::rand(size, (Kind::Float, Device::Cpu)).requires_grad_(requires_grad);
self.push(t)
}

fn get (&self, addr: &usize) -> &Tensor {
&self.values[*addr]
}

fn apply_grads_sgd(&mut self, learning_rate: f32) {
let mut g = Tensor::recent();
self.values
.iter_mut()
.for_each(|t| {
g = t.grad();
t.set_data(&(t.data() - learning_rate*&g));
t.zero_grad();
});
}

fn apply_grads_sgd_momentum(&mut self, learning_rate: f32) {
let mut g: Tensor = Tensor::recent();
let mut velocity: Vec = Vec::recent();
(0..self.size).for_each(|_| {velocity.push(Tensor::from(0.0))});
let mut vcounter = 0;
const BETA:f32 = 0.9;

self.values
.iter_mut()
.for_each(|t| {
g = t.grad();
velocity[vcounter] = BETA * &velocity[vcounter] + (1.0 - BETA) * &g;
t.set_data(&(t.data() - learning_rate * &velocity[vcounter]));
t.zero_grad();
vcounter += 1;
});
}

}

The last two methods within the code above implement the essential gradient descent and the gradient descent with momentum algorithms. The methods assume that the backward step, which generates the gradients, was already called, so here we’re handling what in PyTorch can be the step function call.

The method involves looping through each tensor on the shop, obtain the calculated gradient using the grad method, after which by calling the set_data method we apply the parameter update rule. One can easily introduce other methods and implement other algorithms resembling Rmsprop and Adam.

Within the training loop, we bring together the whole lot discussed earlier for our learning process. As usual, we apply a loop for every epoch, by which we’re then looping for every minibatch, and for every minibatch we do a forward pass, calculate the error, call the backward method on the error to generate the gradients, after which apply the gradients (Listing 6).

Listing 6— The training loop.

fn train(mem: &mut Memory, x: &Tensor, y: &Tensor, model: &dyn Compute, epochs: i64, batch_size: i64, errfunc: F, learning_rate: f32) 
where F: Fn(&Tensor, &Tensor)-> Tensor
{
let mut error = Tensor::from(0.0);
let mut batch_error = Tensor::from(0.0);
let mut pred = Tensor::from(0.0);
for epoch in 0..epochs {
batch_error = Tensor::from(0.0);
for (batchx, batchy) in get_batches(&x, &y, batch_size, true) {
pred = model.forward(mem, &batchx);
error = errfunc(&batchy, &pred);
batch_error += error.detach();
error.backward();
mem.apply_grads_sgd_momentum(learning_rate);
}
println!("Epoch: {:?} Error: {:?}", epoch, batch_error/batch_size);
}
}

Whilst in PyTorch we’ve our Dataset and Dataloader classes which handle the information mini-batching mechanism, in my case I built my very own batching mechanism.

The Rust function below (Listing 7) accepts a reference to the total dataset after which returns an iterator which allows the training function (Listing 6) to iterate over the mini-batches.

Listing 7— Mini-batching.

fn get_batches(x: &Tensor, y: &Tensor, batch_size: i64, shuffle: bool) -> impl Iterator {
let num_rows = x.size()[0];
let num_batches = (num_rows + batch_size - 1) / batch_size;

let indices = if shuffle {
Tensor::randperm(num_rows as i64, (Kind::Int64, Device::Cpu))
} else
{
let rng = (0..num_rows).collect::>();
Tensor::of_slice(&rng)
};
let x = x.index_select(0, &indices);
let y = y.index_select(0, &indices);

(0..num_batches).map(move |i| {
let start = i * batch_size;
let end = (start + batch_size).min(num_rows);
let batchx: Tensor = x.narrow(0, start, end - start);
let batchy: Tensor = y.narrow(0, start, end - start);
(batchx, batchy)
})
}

The last two functions that you have to to run the total code, are only two helper functions (Listing 8).

The primary function loads the dataset from a directory that I named data (you could have to first download the Mnist dataset).

The second function calculates the accuracy of the model, accepting as parameters a reference to the goal and predictions.

Listing 8 — Last two helper functions.


fn load_mnist() -> (Tensor, Tensor) {
let m = vision::mnist::load_dir("data").unwrap();
let x = m.train_images;
let y = m.train_labels;
(x, y)
}

fn accuracy(goal: &Tensor, pred: &Tensor) -> f64 {
let yhat = pred.argmax(1,true).squeeze();
let eq = goal.eq_tensor(&yhat);
let accuracy: f64 = (eq.sum(Kind::Int64) / goal.size()[0]).double_value(&[]).into();
accuracy
}

The one imports that you have to are:

Listing 9 — Required imports.

use std::{collections::HashMap};
use tch::{Tensor, Kind, Device, vision, Scalar};

Before running the code you furthermore may must download the LibTorch C++ Library from the PyTorch website.

To match the above code with a Python-PyTorch equivalent, I attempted to be as faithful as possible to get a good comparison, mainly ensuring that I apply the identical Neural Network hyper-parameters, training parameters, and training algorithms.

For my tests, I applied the Mnist dataset, which consists of 60K training examples with 28×28 features. I ran the tests on my laptop, a Surface Pro 8, i7, with 16G of RAM, hence no GPU. After running the tests multiple times, on average Rust training resulted in 5.5 times faster than the Python equivalent. Unfortunately, at this point I didn’t pinpoint which areas within the training process generated the largest gains in performance over and above a normal PyTorch approach (is it within the training loops?, within the error calculations?, within the gradient step?, etc), hence the gain I mention is an overall gain over the entire process.

As a concluding thought, developing the above in Rust definitely takes more time initially, especially for somebody like me who I still consider a newbie in Rust, nonetheless, when you construct all of your library components and pipeline code and just must test/create recent models (like in Listing 1), then for my part it becomes as easy as working in Python.

The improvements in training speed that I experienced are for my part to not be ignored — it could literally save long hours, if not days, of coaching, especially with the increasing complexity of ML models, larger datasets, or huge iterative learning processes like in Reinforcement Learning.

Hope you found the article well worth the read!

Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press, 2016. http://www.deeplearningbook.org

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here