Hi all, here’s the second on my series on neural networks / machine learning / AI from scratch. In the previous article (please read it first!), I explained
how a single neuron works. In this article, I’ll explain how you can determine the ‘gradients’ of that neuron, in other words how much effect the weight and bias has on the final ‘loss’, using some high-school calculus. This is an prerequisite for training, which I’ll cover later.
I recommend opening this spreadsheet in a separate tab, and viewing it as you read this post which explains the maths: Single neuron gradients.
In case the linked spreadsheet is lost to posterity, here it is in slightly less well-formatted form (note: for brevity’s sake, I’ve shortened references such as B2 to simply ‘B’ when referring to a column in the same row):
A | B | C | D | E | F | G | H | I | J | K | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Input | Weight | Bias | Net | Output | Target | Error | Loss | |||
2 | Neuron maths: | 0.4 | 0.5 | 0.6 | 0.8 (B*C+D) | 0.664 (tanh(E)) | 0.7 | -0.035963 (F-G) | 0.0006467 (H^2 / 2) | ||
3 | Real local gradients: | 0.5 (C2) | 0.4 (B2) | 1 | 0.5591 (1-F2^2) | -0.036 (H2) | |||||
4 | Real global gradients: | -0.0101 (B3*E) | -0.0080 (C3*E) | -0.0201 (E) | -0.0201 (E3*F) | -0.036 (F3) | |||||
5 | Faux gradient | ||||||||||
6 | Faux gradient of ‘output’: | 0.66414 (F2+Tiny) | 0.7 | -0.035863 (F-G) | 0.0006431 (H^2 / 2) | -0.0359 ((I - I2)/Tiny) | |||||
7 | Faux gradient of ‘net’: | 0.8001 (E2+Tiny) | 0.66409 (tanh(E)) | 0.7 | -0.035907 (F-G) | 0.0006447 (H^2 / 2) | -0.0201 ((I - I2)/Tiny) | ||||
8 | Faux gradient of ‘bias’: | 0.4 | 0.5 | 0.6001 (D2+Tiny) | 0.8001 (B*C+D) | 0.66409 (tanh(E)) | 0.7 | -0.035907 (F-G) | 0.0006447 (H^2 / 2) | -0.0201 ((I - I2)/Tiny) | |
9 | Faux gradient of ‘weight’: | 0.4 | 0.5001 (C2+Tiny) | 0.6 | 0.80004 (B*C+D) | 0.66406 (tanh(E)) | 0.7 | -0.035941 (F-G) | 0.0006459 (H^2 / 2) | -0.0080 ((I - I2)/Tiny) | |
10 | Faux gradient of ‘input’: | 0.4001 (B2+Tiny) | 0.5 | 0.6 | 0.80005 (B*C+D) | 0.66406 (tanh(E)) | 0.7 | -0.035935 (F-G) | 0.0006457 (H^2 / 2) | -0.0100 ((I - I2)/Tiny) | |
Tiny | 0.0001 | Moved down here to help with readability |
Firstly: what is the gradient? It is also known as the slope, derivative, or velocity of an equation.
For a simple example, consider tides in a river mouth:
In this analogy, the height of the water is the position (like the values for the weights, bias, net, output, or loss), and the velocity of the water is the gradient (or derivative, or slope). Figuring out that gradient is what this article is all about.
For a more thorough explanation of gradients, check out Wikipedia.
The reason we want the gradients of a neuron’s weight(s) and bias, is that we can use them to figure out whether we need to nudge their values up or down a bit or leave them as-is, in order to get an output that’s closer to the target during training.
You can fake a gradient by comparing the result of an equation vs the result when adding a tiny amount to the input. These faux gradients are helpful for verifying our calculus later.
Here’s the general way to fake a gradient:
Faux gradient of f(x) = ( f(x + tiny) - f(x) ) / tiny
To make it more specific to our neuron:
Faux gradient of how weight affects output = (
tanh(input * (weight + tiny) + bias) -
tanh(input * weight + bias)
) / tiny
Or the full kahuna on the loss function:
Faux gradient of how bias affects loss = (
(tanh(input * weight + (bias + tiny)) - target)^2 / 2
-
(tanh(input * weight + bias) - target)^2 / 2
) / tiny
Please note that the loss function changed vs the previous article (it now has a / 2
) - this is to make the calculus simpler.
You can look at rows 6 through 10 in the spreadsheet to see how these faux gradients are calculated. In columns B to I, various
things have the tiny value added to them, to see how this affects the final ‘loss’. For instance, on row 6, you can see I’m adding
the tiny value to the output, then feeding that through to the loss function, and doing the (loss with tiny - loss without tiny) / tiny
to
calculate the faux gradient. The rest of these faux gradients are similar.
Lets use calculus to calculate the real gradients. Firstly we need to calculate the ‘local’ gradients. See row 3 in the spreadsheet as you follow along:
What is a local gradient? Since all our calculations are performed in stages (eg net > output > error > loss), a local gradient is how much impact changes in one stage have on the next stage.
A better maths teacher than I would be able to explain how we arrive at the following, but here are the formulas below:
(Note when I say ‘the gradient of Y with respect to X’ it means that X is the input/earlier stage, Y is the output/later stage, and it roughly means ‘if you nudge X, what impact will that have on Y?’.)
/ 2
in our loss helps) (see H3)Next we need to combine the gradients using the calculus ‘chain rule’, so that we can get the impacts of each variable on the loss.
These are calculated in reverse order (this is why it is called _back_propagation) because most of these rely on the next step’s gradient.
You may like to compare these with the respective faux gradients and see that they are (roughly) the same.
And there you have it, you have the gradients for a single neuron. Next I’ll explain how to use these gradients for training!
Just for the hell of it, here’s an implementation in Rust:
struct Neuron {
input: f32,
weight: f32,
bias: f32,
target: f32,
}
impl Neuron {
fn net(&self) -> f32 {
self.input * self.weight + self.bias
}
fn output(&self) -> f32 {
self.net().tanh()
}
fn error(&self) -> f32 {
self.output() - self.target
}
fn loss(&self) -> f32 {
let e = self.error();
e * e / 2.
}
fn output_gradient(&self) -> f32 {
self.error()
}
fn net_gradient(&self) -> f32 {
let o = self.output();
let net_local_derivative = 1. - o * o;
net_local_derivative * self.output_gradient()
}
fn bias_gradient(&self) -> f32 {
self.net_gradient()
}
fn weight_gradient(&self) -> f32 {
self.input * self.net_gradient()
}
}
fn main() {
let neuron = Neuron {
input: 0.4,
weight: 0.5,
bias: 0.6,
target: 0.7,
};
println!("Weight gradient: {:.4}", neuron.weight_gradient());
println!("Bias gradient: {:.4}", neuron.bias_gradient());
}
Which outputs:
Weight gradient: -0.0080
Bias gradient: -0.0201
Which matches the spreadsheet nicely!
Thanks for reading, hope you found this helpful, at least a tiny bit, God bless!
Photo by Chinnu Indrakumar on Unsplash
Thanks for reading! And if you want to get in touch, I'd love to hear from you: chris.hulbert at gmail.
(Comp Sci, Hons - UTS)
Software Developer (Freelancer / Contractor) in Australia.
I have worked at places such as Google, Cochlear, Assembly Payments, News Corp, Fox Sports, NineMSN, FetchTV, Coles, Woolworths, Trust Bank, and Westpac, among others. If you're looking for help developing an iOS app, drop me a line!
Get in touch:
[email protected]
github.com/chrishulbert
linkedin