It’s been a while since I’ve worked on anything machine learning related, and a project is coming up soon, so I decided it would be a good idea to refresh my ML knowledge by implementing the basics from scratch. I read a similar post by Andrej Karpathy years ago, so if you are looking for a better explanation, you might want to look there. I am deliberately not following a guide here and implementing from memory, so this may not be the best way to do things.
Automatic Differentiation
At the heart of modern machine learning is the idea of modeling a process as a parameterized differentiable function and tuning its parameters until they match the target process as closely as possible. To do this, you take a set of input-output pairs from the target system, define a function that quantifies the difference between the real output and the output of your model (the loss function), and compute the partial derivatives of that function with respect to its parameters (the gradient). With the gradient in hand, you will know in which direction to adjust the parameters of the model to more accurately predict the given output.
To do any of this, perhaps the most important part is to specify your model in such a way that you can easily compute its derivatives. This is where automatic differentiation comes in. The basic idea is to keep track of the elementary operations that make up the calculation of outputs from inputs, and use the derivation rules repeatedly to compute the gradient.
Inputs and Parameters
Since the combinations of inputs, parameters, and operations form a graph, we start by creating a Node
class.
A Node
generally has two operations, computing its value and computing its gradient.
Both take the parameters and the inputs as arguments.
(I use the __call__
operator for the value computation for some syntactic sugar).
class Node:
def __call__(self, params, values):
raise NotImplementedError
def grad(self, params, values):
raise NotImplementedError
At this point we need to decide how to represent params
, values
and the gradient.
We need a data structure that stores which Node
has which value.
You could do this using arrays with complicated indexing, or you could force each Node
to have a unique name to use a name-to-value dictionary, and there are probably other ways.
I decided that the simplest way would be to use a dictionary where the keys are the Node
s themselves.
So with the Node
classes Input
and Parameter
we are about to define, params
is a dict[Parameter, float]
, values is a dict[Input, float]
and the return type of grad
is also dict[Parameter, float]
.
Let’s start with the simple case: a constant. A constant has a value that is assigned when it is created, to compute its own value it simply returns it, and its derivative is always zero.
class Constant(Node):
def __init__(self, value):
self.value = value
def __call__(self, params, values):
return self.value
def grad(self, params, values):
return {}
Next, we define an Input
.
These are the values we put into the model to evaluate it, they come from the training data.
These are not the Parameter
s that we change during training and we do not calculate derivatives with respect to these values.
Therefore, the grad
function still returns nothing and the evaluation is simply the corresponding entry from the input values
.
class Input(Node):
def __call__(self, params, values):
return values[self]
def grad(self, params, values):
return {}
Moving on to the Parameter
.
In the context of a neural network, this could be a weight or a bias.
These are the values for which we compute the gradient and which we want to change during training.
Similar to Input
, evaluating a parameter simply returns its current value, but now we see a first gradient value: if we calculate the derivative $\frac{\partial f}{\partial x}$ with $f(x) = x$, we get one.
The derivative with respect to any other variable $\frac{\partial f}{\partial y}$ is implicitly zero (so not specified), since $f$ is constant with respect to any other variable.
class Parameter(Node):
def __call__(self, params, values):
return params[self]
def grad(self, params, values):
return {self: 1}
Binary Operations
Implementing operations on multiple values gets more interesting, let’s start with multiplication.
A Multiply
node gets a left-hand side lhs
and a right-hand side rhs
.
Evaluating the multiplication simply means evaluating both operands and multiplying the results.
For the derivative, we derive both operands individually and turn to the product rule (see the deriv
helper function) to combine the results.
class Multiply(Node):
def __init__(self, lhs, rhs):
self.lhs = lhs
self.rhs = rhs
def __call__(self, params, values):
return self.lhs(params, values) * self.rhs(params, values)
def grad(self, params, values):
dlhs = self.lhs.grad(params, values)
drhs = self.rhs.grad(params, values)
return {
key: self.deriv(
self.lhs(params, values),
dlhs.get(key, 0),
self.rhs(params, values),
drhs.get(key, 0)
) for key in (dlhs | drhs).keys()
}
def deriv(self,
lhs: float,
dlhs: float,
rhs: float,
drhs: float
) -> float:
return dlhs * rhs + drhs * lhs
With this in hand, we can define our first model: a linear function $f(x) = m \cdot x$.
m = Parameter()
x = Input()
f = Multiply(m, x)
We can evaluate the function f({m: 3}, {x: 5})
, which gives 15, and we can calculate its gradient with respect to the parameter m
with f.grad({m: 3}, {x: 5})
.
This will print something like {<__main__.Parameter at 0x...>: 5}
, which means that the derivative of f
with respect to the parameter at that memory address (in this case m
) is five.
This may seem counterintuitive at first because we are used to deriving by $x$ and getting the slope $m$ (so three in this case) as the result, but it is correct since we are deriving by $m$ here.
We can make specifying a model a bit nicer by using operator overloading to define the following operators on the Node
class.
This allows us to write f = m * x
instead of f = Multiply(m, x)
.
def __mul__(self, other):
return Multiply(self, other)
def __rmul__(self, other):
return self * other
Before moving on to other operations, we can simplify the code by introducing a BinaryOp
which will become the common parent class for Multiply
and other binary operations.
BinaryOp
takes care of the merging of gradients, and child classes only need to implement their specific derivation rules in the deriv
function.
It’s also nice to add some instance checking to the constructor, which wraps values that aren’t Node
s in a Constant
.
This way we can write something like f = 2 * m * x
instead of f = Constant(2) * m * x
.
class BinaryOp(Node):
def __init__(self, lhs, rhs):
if not isinstance(lhs, Node):
self.lhs = Constant(lhs)
else:
self.lhs = lhs
if not isinstance(rhs, Node):
self.rhs = Constant(rhs)
else:
self.rhs = rhs
def deriv(self, lhs, dlhs, rhs, drhs):
raise NotImplementedError
def grad(self, params, values):
dlhs = self.lhs.grad(params, values)
drhs = self.rhs.grad(params, values)
return {
key: self.deriv(
self.lhs(params, values),
dlhs.get(key, 0),
self.rhs(params, values),
drhs.get(key, 0)
) for key in (dlhs | drhs).keys()
}
With this, the implementation of Multiply
boils down to just __call__
and deriv
.
class Multiply(BinaryOp):
def __call__(self, params, values):
return self.lhs(params, values) * self.rhs(params, values)
def deriv(self, lhs, dlhs, rhs, drhs):
return dlhs * rhs + drhs * lhs
It is now very easy to add other binary operations.
Here is Add
as an example.
Don’t forget to add the operator overloads __add__
and __radd__
to Node
for syntactic sugar.
class Add(BinaryOp):
def __call__(self, params, values):
return self.lhs(params, values) + self.rhs(params, values)
def deriv(self, lhs, dlhs, rhs, drhs):
return dlhs + drhs
Modelling a Perceptron
As a simple example, we will try to train the parameters of a perceptron to mimic a binary logic gate, starting with an AND gate. The perceptron receives two inputs, each with a weight, and the result is offset by a bias.
x1 = Input()
x2 = Input()
w1 = Parameter()
w2 = Parameter()
b1 = Parameter()
Mathematically, the output of the perceptron is defined as
$$f(x_1, x_2) = \sigma(w_1 \cdot x_1 + w_2 \cdot x_2 + b_1)$$
The only thing missing to model this is the sigmoid function $\sigma$.
To add this operation, we will first introduce another superclass UnaryOp
.
This takes care of applying the chain rule
for which the child class only needs to specify $g'$ by implementing the deriv
function.
class UnaryOp(Node):
def __init__(self, child):
if not isinstance(child, Node):
self.child = Constant(child)
else:
self.child = child
def deriv(self, val):
raise NotImplementedError
def grad(self, params, values):
child_val = self.child(params, values)
return {
param: self.deriv(child_val) * gradient
for param, gradient
in self.child.grad(params, values).items()
}
With this helper, the sigmoid function $\sigma(x) = \frac{1}{1 + \exp(-x)}$ with its derivative $\sigma'(x) = \frac{\exp(-x)}{(1 + \exp(-x))^2}$ can be defined as follows:
class Sigmoid(UnaryOp):
def __call__(self, params, values):
return 1.0 / (1.0 + np.exp(-self.child(params, values)))
def deriv(self, val):
expMinusX = np.exp(-val)
return expMinusX/((1+expMinusX)**2)
Now we can define our perceptron
as Sigmoid(w1*x1 + w2*x2 + b1)
and for some given parameters = {w1: 1, w2: 1, b1: 0}
and inputs = {x1: 0, x2: 1}
we can execute it with perceptron(parameters, inputs)
and compute its gradient with perceptron.grad(parameters, inputs)
.
Model Training
Looking at the truth table of the perceptron, we can see that it is quite far away from the AND gate that we want it to mirror.
$x_1$ | $x_2$ | Perceptron | AND |
---|---|---|---|
0 | 0 | 0.5 | 0 |
0 | 1 | 0.7310585786300049 | 0 |
1 | 0 | 0.7310585786300049 | 0 |
1 | 1 | 0.8807970779778823 | 1 |
We can even plot the ouputs for intermediate values in a heatmap to get a more visual idea of what the model is doing. This is the current perceptron output compared to a perfect AND gate.
The next step is to use the automatic differentiation to build a simple optimizer that fits the parameters of our model to some training data. In this case, the training data is simply the truth table of the AND gate we want to mimic. The optimizer we will write will adjust the parameters of a given model so that the output value is minimized. To formulate our goal of training data matching in terms of minimizing a function value, we need to extend our model a bit so that minimizing its output means matching a given target output: we need to specify a loss function. Since our model produces binary outputs, we will use the cross-entropy loss function
$$L(y, y') = -(y' \log(y) + (1 - y') \log(1 - y))$$where $y$ is the model output and $y'$ is the label (the correct output).
label = Input()
loss = -((label * Log(perceptron))
+ ((1.0 - label) * Log(1.0 - perceptron)))
Now we can specify our training data with these input dictionaries:
training_data = [
{x1: 0, x2: 0, label: 0},
{x1: 0, x2: 1, label: 0},
{x1: 1, x2: 0, label: 0},
{x1: 1, x2: 1, label: 1},
]
Optimization
Now we come to the heart of the whole model learning process: the optimizer. The basic idea is pretty simple: we loop over all the samples in the training data and compute an average of the gradients, which gives us a direction in which we should change the parameters to improve the results. Then we update the parameters by subtracting the gradients (weighted by the learning rate) to take a step in that “right” direction. This is repeated for many iterations until we find a minimum.
def optimize(
loss_fun: Node,
initial_params: dict[Parameter, float],
training_data: list[dict[Input, float]]
) -> dict[Parameter, float]:
epochs = 2000
learning_rate = 0.3
params = initial_params
for i in range(1, epochs + 1):
# The gradients for all entries in training_data
gradients = [loss_fun.grad(params, inputs)
for inputs in training_data]
# The average gradient across all training samples
gradient = {
parameter: np.average([grad[parameter] for grad in gradients])
for parameter in params.keys()
}
params = {
parameter: prev - gradient[parameter] * learning_rate
for parameter, prev in params.items()
}
return params
If we now run optimize(loss, parameters, training_data)
, we will get back a set of optimized parameters:
{
w1: 6.403725189897986,
w2: 6.403725189897986,
b1: -9.781184135186036
}
Looking at the truth table, we can see that with these parameters the perceptron comes quite close to the target AND gate, which is also clearly visible in the heat map.
$x_1$ | $x_2$ | Perceptron | Optimized Perceptron | AND |
---|---|---|---|---|
0 | 0 | 0.5 | 0.000057 | 0 |
0 | 1 | 0.73105 | 0.033007 | 0 |
1 | 0 | 0.73105 | 0.033007 | 0 |
1 | 1 | 0.73105 | 0.953746 | 1 |
In conclusion, implementing basic machine learning through automatic differentiation from scratch is not as hard as one might think. I would like to extend this by building layers and convolutions to eventually tackle a more interesting problem like MNIST.