Hi all, here’s the fourth on my series on neural networks / machine learning / AI from scratch. In the previous articles (please read them first!), I explained how a single neuron works, then how to calculate the gradient of its weight and bias, and how you can use that gradient to train the neuron. In this article, I’ll explain how to determine the gradients when you have many layers of many neurons, and use those gradients to train the neural net.
In my previous articles in this series, I used spreadsheets to make the maths easier to follow along. Unfortunately I don’t think I’ll be able to demonstrate this topic in a spreadsheet, I think it’d get out of hand, so I’ll keep it in code. I hope you can still follow along!
Pardon my pseudocode:
class Net {
layers: [Layer]
}
class Layer {
neurons: [Neuron]
}
class Neuron {
value: float
bias: float
weights: [float]
activation_gradient: float
}
Explanation:
What we’re trying to achieve here is to use calculus to determine the ‘gradient’ of every bias and every weight in this neural net. In order to do this, we have to ‘back propagate’ these gradients from the back to the front of the ‘layers’ array.
Concretely - if, say, we had 3 layers: we’d figure out the gradients of the activation functions of layers[2], then use those values to calculate the gradients of layers[1], and then layers[0].
Once we have the gradients of the activation functions for each neuron in each layer, it’s easy to figure out the gradient of the weights and bias for each neuron.
And, as demonstrated in my previous article, once we have the gradients, we can ‘nudge’ the weights and biases in the direction that their gradients say, thus train the neural net.
Training and determining the gradients go hand-in-hand, as you need the inputs to calculate the values of each neuron in the net, and you need the targets (aka desired outputs) to determine the gradients. Thus it’s a three step process:
This pass fills in the ‘value’ fields.
Forward pass pseudocode:
for layer in layers, first to last {
if this is the first layer {
for neuron in layer.neurons {
total = neuron.bias
for weight in neuron.weights {
total += weight * inputs[weight_index]
}
neuron.value = tanh(total)
}
} else {
previous_layer = layers[layer_index - 1]
for neuron in layer.neurons {
total = neuron.bias
for weight in neuron.weights {
total += weight * previous_layer.neuron[weight_index].value
}
neuron.value = tanh(total)
}
}
}
This fills in the ‘activation_gradient’ fields.
(1 - value^2) * ...
are calculus equations for determining gradients.Backward pass pseudocode:
for layer in reversed layers, last to first {
if this is the last layer {
for neuron in layer.neurons {
neuron.activation_gradient =
(1 - neuron.value^2) *
(value - targets[neuron_index])
}
} else {
next_layer = layers[layer_index + 1]
for this_layer_neuron in layer.neurons {
next_layer_gradient_sum = 0
for next_layer_neuron in next_layer.neurons {
next_layer_gradient_sum +=
next_layer_neuron.activation_gradient *
next_layer_neuron.weights[this_layer_neuron_index]
}
this_layer_neuron.activation_gradient =
(1 - this_layer_neuron.value^2) *
next_layer_gradient_sum
}
}
}
Now that you have the gradients, you can adjust the biases/weights to train it to better.
I’ll skim over this as it’s covered in my earlier articles in this series. The gist of it is that, for each neuron, the gradient is calculated for the bias and every weight, and the bias/weights are adjusted a little to ‘descend the gradient’. Perhaps my pseudocode might make more sense:
learning_rate = 0.01 // Aka 1%
for layer in layers {
if this is the first layer {
for neuron in layer.neurons {
neuron.bias -= neuron.activation_gradient * learning_rate
for weight in neuron.weights {
gradient_for_this_weight = inputs[weight_index] *
neuron.activation_gradient
weight -= gradient_for_this_weight * learning_rate
}
}
} else {
previous_layer = layers[layer_index - 1]
for neuron in layer.neurons {
neuron.bias -= neuron.activation_gradient * learning_rate
for weight in neuron.weights {
gradient_for_this_weight =
previous_layer.neurons[weight_index].value *
neuron.activation_gradient
weight -= gradient_for_this_weight * learning_rate
}
}
}
}
Because I’m a Rust tragic, here’s a demo. It’s kinda long, sorry, not sorry. It was fun to write :)
This trains a neural network to calculate the area and circumference of a rectangle, given the width and height as inputs.
🦀🦀🦀
use rand::Rng;
struct Net {
layers: Vec<Layer>,
}
struct Layer {
neurons: Vec<Neuron>,
}
struct Neuron {
value: f64,
bias: f64,
weights: Vec<f64>,
activation_gradient: f64
}
const LEARNING_RATE: f64 = 0.001;
fn main() {
let mut rng = rand::thread_rng();
// Make a 3,3,2 neural net that inputs the width and height of a rectangle,
// and outputs the area and circumference.
let mut net = Net {
layers: vec![
Layer { // First layer has 2 weights to suit the 2 inputs.
neurons: vec![
Neuron {
value: 0.,
bias: rng.gen_range(-1. .. 1.),
weights: vec![
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
],
activation_gradient: 0.,
},
Neuron {
value: 0.,
bias: rng.gen_range(-1. .. 1.),
weights: vec![
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
],
activation_gradient: 0.,
},
Neuron {
value: 0.,
bias: rng.gen_range(-1. .. 1.),
weights: vec![
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
],
activation_gradient: 0.,
},
],
},
Layer { // Second layer neurons have the same number of weights as the previous layer has neurons.
neurons: vec![
Neuron {
value: 0.,
bias: rng.gen_range(-1. .. 1.),
weights: vec![
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
],
activation_gradient: 0.,
},
Neuron {
value: 0.,
bias: rng.gen_range(-1. .. 1.),
weights: vec![
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
],
activation_gradient: 0.,
},
Neuron {
value: 0.,
bias: rng.gen_range(-1. .. 1.),
weights: vec![
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
],
activation_gradient: 0.,
},
],
},
Layer { // Last layer has 2 neurons to suit 2 outputs.
neurons: vec![
Neuron {
value: 0.,
bias: rng.gen_range(-1. .. 1.),
weights: vec![
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
],
activation_gradient: 0.,
},
Neuron {
value: 0.,
bias: rng.gen_range(-1. .. 1.),
weights: vec![
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
rng.gen_range(-1. .. 1.),
],
activation_gradient: 0.,
},
],
},
],
};
// Train.
let mut cumulative_error_counter: i64 = 0; // These vars are for averaging the errors.
let mut area_error_percent_sum: f64 = 0.;
let mut circumference_error_percent_sum: f64 = 0.;
for training_iteration in 0..100_000_000 {
// Inputs:
let width: f64 = rng.gen_range(0.1 .. 1.);
let height: f64 = rng.gen_range(0.1 .. 1.);
let inputs: Vec<f64> = vec![width, height];
// Targets (eg desired outputs):
let area = width * height;
let circumference_scaled = (height * 2. + width * 2.) * 0.25; // Scaled by 0.25 so it'll always be in range 0..1.
let targets: Vec<f64> = vec![area, circumference_scaled];
// Forward pass!
for layer_index in 0..net.layers.len() {
if layer_index == 0 {
let layer = &mut net.layers[layer_index];
for neuron in &mut layer.neurons {
let mut total = neuron.bias;
for (weight_index, weight) in neuron.weights.iter().enumerate() {
total += weight * inputs[weight_index];
}
neuron.value = total.tanh();
}
} else {
// Workaround for Rust not allowing you to borrow two different vec elements simultaneously.
let previous_layer: &Layer;
unsafe { previous_layer = & *net.layers.as_ptr().add(layer_index - 1) }
let layer = &mut net.layers[layer_index];
for neuron in &mut layer.neurons {
let mut total = neuron.bias;
for (weight_index, weight) in neuron.weights.iter().enumerate() {
total += weight * previous_layer.neurons[weight_index].value;
}
neuron.value = total.tanh();
}
}
}
// Let's check the results!
let outputs: Vec<f64> = net.layers.last().unwrap().neurons
.iter().map(|n| n.value).collect();
let area_error_percent = (targets[0] - outputs[0]).abs() / targets[0] * 100.;
let circumference_error_percent = (targets[1] - outputs[1]).abs() / targets[1] * 100.;
area_error_percent_sum += area_error_percent;
circumference_error_percent_sum += circumference_error_percent;
cumulative_error_counter += 1;
if training_iteration % 10_000_000 == 0 {
println!("Iteration {} errors: area {:.3}%, circumference: {:.3}% (smaller = better)",
training_iteration,
area_error_percent_sum / cumulative_error_counter as f64,
circumference_error_percent_sum / cumulative_error_counter as f64);
area_error_percent_sum = 0.;
circumference_error_percent_sum = 0.;
cumulative_error_counter = 0;
}
// Backward pass! (aka backpropagation)
let layers_len = net.layers.len();
for layer_index in (0..layers_len).rev() { // Reverse the order.
if layer_index == layers_len - 1 { // Last layer.
let layer = &mut net.layers[layer_index];
for (neuron_index, neuron) in layer.neurons.iter_mut().enumerate() {
neuron.activation_gradient =
(1. - neuron.value * neuron.value) *
(neuron.value - targets[neuron_index]);
}
} else {
// Workaround for Rust not allowing you to borrow two different vec elements simultaneously.
let next_layer: &Layer;
unsafe { next_layer = & *net.layers.as_ptr().add(layer_index + 1) }
let layer = &mut net.layers[layer_index];
for (this_layer_neuron_index, this_layer_neuron) in layer.neurons.iter_mut().enumerate() {
let mut next_layer_gradient_sum: f64 = 0.;
for next_layer_neuron in &next_layer.neurons {
next_layer_gradient_sum +=
next_layer_neuron.activation_gradient *
next_layer_neuron.weights[this_layer_neuron_index];
}
this_layer_neuron.activation_gradient =
(1. - this_layer_neuron.value * this_layer_neuron.value) *
next_layer_gradient_sum;
}
}
}
// Training pass!
for layer_index in 0..net.layers.len() {
if layer_index == 0 {
let layer = &mut net.layers[layer_index];
for neuron in &mut layer.neurons {
neuron.bias -= neuron.activation_gradient * LEARNING_RATE;
for (weight_index, weight) in neuron.weights.iter_mut().enumerate() {
let gradient_for_this_weight =
inputs[weight_index] *
neuron.activation_gradient;
*weight -= gradient_for_this_weight * LEARNING_RATE
}
}
} else {
// Workaround for Rust not allowing you to borrow two different vec elements simultaneously.
let previous_layer: &Layer;
unsafe { previous_layer = & *net.layers.as_ptr().add(layer_index - 1) }
let layer = &mut net.layers[layer_index];
for neuron in &mut layer.neurons {
neuron.bias -= neuron.activation_gradient * LEARNING_RATE;
for (weight_index, weight) in neuron.weights.iter_mut().enumerate() {
let gradient_for_this_weight =
previous_layer.neurons[weight_index].value *
neuron.activation_gradient;
*weight -= gradient_for_this_weight * LEARNING_RATE;
}
}
}
}
}
}
Which outputs:
Iteration 0 errors: area 223.106%, circumference: 13.175% (smaller = better)
Iteration 10000000 errors: area 17.861%, circumference: 1.123% (smaller = better)
Iteration 20000000 errors: area 14.656%, circumference: 0.790% (smaller = better)
Iteration 30000000 errors: area 14.516%, circumference: 0.698% (smaller = better)
Iteration 40000000 errors: area 6.359%, circumference: 0.882% (smaller = better)
Iteration 50000000 errors: area 2.966%, circumference: 0.875% (smaller = better)
Iteration 60000000 errors: area 2.769%, circumference: 0.807% (smaller = better)
Iteration 70000000 errors: area 2.600%, circumference: 0.698% (smaller = better)
Iteration 80000000 errors: area 2.401%, circumference: 0.573% (smaller = better)
Iteration 90000000 errors: area 2.166%, circumference: 0.468% (smaller = better)
Which you can see the error percentage drop down as it ‘learns’ to calculate the area and circumference of a rectangle. Magic!
Thanks for reading, hope you found this helpful, at least a tiny bit, God bless!
Photo by Jonas Hensel on Unsplash
Thanks for reading! And if you want to get in touch, I'd love to hear from you: chris.hulbert at gmail.
(Comp Sci, Hons - UTS)
Software Developer (Freelancer / Contractor) in Australia.
I have worked at places such as Google, Cochlear, Assembly Payments, News Corp, Fox Sports, NineMSN, FetchTV, Coles, Woolworths, Trust Bank, and Westpac, among others. If you're looking for help developing an iOS app, drop me a line!
Get in touch:
[email protected]
github.com/chrishulbert
linkedin