Automatic Differentiation from Scratch

Building an automatic differentiation engine and using it to train a perceptron to mimic a logic gate through gradient descent optimization.

September 15, 2024 · 10 min read

It’s been a while since I’ve worked on anything machine learning related, and a project is coming up soon, so I decided it would be a good idea to refresh my ML knowledge by implementing the basics from scratch. I read a similar post by Andrej Karpathy years ago, so if you are looking for a better explanation, you might want to look there. I am deliberately not following a guide here and implementing from memory, so this may not be the best way to do things.

Automatic Differentiation

At the heart of modern machine learning is the idea of modeling a process as a parameterized differentiable function and tuning its parameters until they match the target process as closely as possible. To do this, you take a set of input-output pairs from the target system, define a function that quantifies the difference between the real output and the output of your model (the loss function), and compute the partial derivatives of that function with respect to its parameters (the gradient). With the gradient in hand, you will know in which direction to adjust the parameters of the model to more accurately predict the given output.

To do any of this, perhaps the most important part is to specify your model in such a way that you can easily compute its derivatives. This is where automatic differentiation comes in. The basic idea is to keep track of the elementary operations that make up the calculation of outputs from inputs, and use the derivation rules repeatedly to compute the gradient.

Inputs and Parameters

Since the combinations of inputs, parameters, and operations form a graph, we start by creating a Node class. A Node generally has two operations, computing its value and computing its gradient. Both take the parameters and the inputs as arguments. (I use the __call__ operator for the value computation for some syntactic sugar).

class Node:    
    def __call__(self, params, values):
        raise NotImplementedError

    def grad(self, params, values):
        raise NotImplementedError

At this point we need to decide how to represent params, values and the gradient. We need a data structure that stores which Node has which value. You could do this using arrays with complicated indexing, or you could force each Node to have a unique name to use a name-to-value dictionary, and there are probably other ways. I decided that the simplest way would be to use a dictionary where the keys are the Nodes themselves. So with the Node classes Input and Parameter we are about to define, params is a dict[Parameter, float], values is a dict[Input, float] and the return type of grad is also dict[Parameter, float].

Let’s start with the simple case: a constant. A constant has a value that is assigned when it is created, to compute its own value it simply returns it, and its derivative is always zero.

class Constant(Node):
    def __init__(self, value):
        self.value = value
    
    def __call__(self, params, values):
        return self.value

    def grad(self, params, values):
        return {}

Next, we define an Input. These are the values we put into the model to evaluate it, they come from the training data. These are not the Parameters that we change during training and we do not calculate derivatives with respect to these values. Therefore, the grad function still returns nothing and the evaluation is simply the corresponding entry from the input values.

class Input(Node):
    def __call__(self, params, values):
        return values[self]

    def grad(self, params, values):
        return {}

Moving on to the Parameter. In the context of a neural network, this could be a weight or a bias. These are the values for which we compute the gradient and which we want to change during training. Similar to Input, evaluating a parameter simply returns its current value, but now we see a first gradient value: if we calculate the derivative $\frac{\partial f}{\partial x}$ with $f(x) = x$, we get one. The derivative with respect to any other variable $\frac{\partial f}{\partial y}$ is implicitly zero (so not specified), since $f$ is constant with respect to any other variable.

class Parameter(Node):
    def __call__(self, params, values):
        return params[self]

    def grad(self, params, values):
        return {self: 1}

Binary Operations

Implementing operations on multiple values gets more interesting, let’s start with multiplication. A Multiply node gets a left-hand side lhs and a right-hand side rhs. Evaluating the multiplication simply means evaluating both operands and multiplying the results. For the derivative, we derive both operands individually and turn to the product rule (see the deriv helper function) to combine the results.

class Multiply(Node):
    def __init__(self, lhs, rhs):
        self.lhs = lhs
        self.rhs = rhs

    def __call__(self, params, values):
        return self.lhs(params, values) * self.rhs(params, values)

    def grad(self, params, values):
        dlhs = self.lhs.grad(params, values)
        drhs = self.rhs.grad(params, values)
        return {
          key: self.deriv(
            self.lhs(params, values),
            dlhs.get(key, 0),
            self.rhs(params, values),
            drhs.get(key, 0)
          ) for key in (dlhs | drhs).keys()
        }

    def deriv(self,
		  lhs: float,
		  dlhs: float,
		  rhs: float,
		  drhs: float
		) -> float:
        return dlhs * rhs + drhs * lhs

With this in hand, we can define our first model: a linear function $f(x) = m \cdot x$.

m = Parameter()
x = Input()
f = Multiply(m, x)

We can evaluate the function f({m: 3}, {x: 5}), which gives 15, and we can calculate its gradient with respect to the parameter m with f.grad({m: 3}, {x: 5}). This will print something like {<__main__.Parameter at 0x...>: 5}, which means that the derivative of f with respect to the parameter at that memory address (in this case m) is five. This may seem counterintuitive at first because we are used to deriving by $x$ and getting the slope $m$ (so three in this case) as the result, but it is correct since we are deriving by $m$ here.

We can make specifying a model a bit nicer by using operator overloading to define the following operators on the Node class. This allows us to write f = m * x instead of f = Multiply(m, x).

def __mul__(self, other):
	return Multiply(self, other)

def __rmul__(self, other):
	return self * other

Before moving on to other operations, we can simplify the code by introducing a BinaryOp which will become the common parent class for Multiply and other binary operations. BinaryOp takes care of the merging of gradients, and child classes only need to implement their specific derivation rules in the deriv function.

It’s also nice to add some instance checking to the constructor, which wraps values that aren’t Nodes in a Constant. This way we can write something like f = 2 * m * x instead of f = Constant(2) * m * x.

class BinaryOp(Node):
    def __init__(self, lhs, rhs):
        if not isinstance(lhs, Node):
            self.lhs = Constant(lhs)
        else:
            self.lhs = lhs
        if not isinstance(rhs, Node):
            self.rhs = Constant(rhs)
        else:
            self.rhs = rhs

    def deriv(self, lhs, dlhs, rhs, drhs):
        raise NotImplementedError
    
    def grad(self, params, values):
        dlhs = self.lhs.grad(params, values)
        drhs = self.rhs.grad(params, values)
        return {
          key: self.deriv(
            self.lhs(params, values),
            dlhs.get(key, 0),
            self.rhs(params, values),
            drhs.get(key, 0)
          ) for key in (dlhs | drhs).keys()
        }

With this, the implementation of Multiply boils down to just __call__ and deriv.

class Multiply(BinaryOp):
    def __call__(self, params, values):
        return self.lhs(params, values) * self.rhs(params, values)

    def deriv(self, lhs, dlhs, rhs, drhs):
        return dlhs * rhs + drhs * lhs

It is now very easy to add other binary operations. Here is Add as an example. Don’t forget to add the operator overloads __add__ and __radd__ to Node for syntactic sugar.

class Add(BinaryOp):    
    def __call__(self, params, values):
        return self.lhs(params, values) + self.rhs(params, values)

    def deriv(self, lhs, dlhs, rhs, drhs):
        return dlhs + drhs

Modelling a Perceptron

As a simple example, we will try to train the parameters of a perceptron to mimic a binary logic gate, starting with an AND gate. The perceptron receives two inputs, each with a weight, and the result is offset by a bias.

x1 = Input()
x2 = Input()
w1 = Parameter()
w2 = Parameter()
b1 = Parameter()

Mathematically, the output of the perceptron is defined as

$$f(x_1, x_2) = \sigma(w_1 \cdot x_1 + w_2 \cdot x_2 + b_1)$$

The only thing missing to model this is the sigmoid function $\sigma$. To add this operation, we will first introduce another superclass UnaryOp. This takes care of applying the chain rule

$$f(x) = g(h(x)) \rightarrow f'(x) = g'(h(x)) + h'(x)$$

for which the child class only needs to specify $g'$ by implementing the deriv function.

class UnaryOp(Node):
    def __init__(self, child):
        if not isinstance(child, Node):
            self.child = Constant(child)
        else:
            self.child = child

    def deriv(self, val):
        raise NotImplementedError

    def grad(self, params, values):
        child_val = self.child(params, values)
        return {
	        param: self.deriv(child_val) * gradient
	        for param, gradient
	        in self.child.grad(params, values).items()
        }

With this helper, the sigmoid function $\sigma(x) = \frac{1}{1 + \exp(-x)}$ with its derivative $\sigma'(x) = \frac{\exp(-x)}{(1 + \exp(-x))^2}$ can be defined as follows:

class Sigmoid(UnaryOp):
    def __call__(self, params, values):
        return 1.0 / (1.0 + np.exp(-self.child(params, values)))

    def deriv(self, val):
        expMinusX = np.exp(-val)
        return expMinusX/((1+expMinusX)**2)

Now we can define our perceptron as Sigmoid(w1*x1 + w2*x2 + b1) and for some given parameters = {w1: 1, w2: 1, b1: 0} and inputs = {x1: 0, x2: 1} we can execute it with perceptron(parameters, inputs) and compute its gradient with perceptron.grad(parameters, inputs).

Model Training

Looking at the truth table of the perceptron, we can see that it is quite far away from the AND gate that we want it to mirror.

$x_1$	$x_2$	Perceptron	AND
0	0	0.5	0
0	1	0.7310585786300049	0
1	0	0.7310585786300049	0
1	1	0.8807970779778823	1

We can even plot the ouputs for intermediate values in a heatmap to get a more visual idea of what the model is doing. This is the current perceptron output compared to a perfect AND gate.

A heatmap visualizing the perceptron outputs for different combinations of x1 and x2.

A heatmap visualizing a perfect AND gate.

The next step is to use the automatic differentiation to build a simple optimizer that fits the parameters of our model to some training data. In this case, the training data is simply the truth table of the AND gate we want to mimic. The optimizer we will write will adjust the parameters of a given model so that the output value is minimized. To formulate our goal of training data matching in terms of minimizing a function value, we need to extend our model a bit so that minimizing its output means matching a given target output: we need to specify a loss function. Since our model produces binary outputs, we will use the cross-entropy loss function

$$L(y, y') = -(y' \log(y) + (1 - y') \log(1 - y))$$

where $y$ is the model output and $y'$ is the label (the correct output).

label = Input()
loss = -((label * Log(perceptron))
      + ((1.0 - label) * Log(1.0 - perceptron)))

Now we can specify our training data with these input dictionaries:

training_data = [
    {x1: 0, x2: 0, label: 0},
    {x1: 0, x2: 1, label: 0},
    {x1: 1, x2: 0, label: 0},
    {x1: 1, x2: 1, label: 1},
]

Optimization

Now we come to the heart of the whole model learning process: the optimizer. The basic idea is pretty simple: we loop over all the samples in the training data and compute an average of the gradients, which gives us a direction in which we should change the parameters to improve the results. Then we update the parameters by subtracting the gradients (weighted by the learning rate) to take a step in that “right” direction. This is repeated for many iterations until we find a minimum.

def optimize(
	loss_fun: Node,
	initial_params: dict[Parameter, float],
	training_data: list[dict[Input, float]]
) -> dict[Parameter, float]:
	epochs = 2000
	learning_rate = 0.3
	params = initial_params
	for i in range(1, epochs + 1):
		# The gradients for all entries in training_data
		gradients = [loss_fun.grad(params, inputs)
					 for inputs in training_data]
		# The average gradient across all training samples
		gradient = {
			parameter: np.average([grad[parameter] for grad in gradients])
			for parameter in params.keys()
		}
		params = {
			parameter: prev - gradient[parameter] * learning_rate
			for parameter, prev in params.items()
		}
    return params

If we now run optimize(loss, parameters, training_data), we will get back a set of optimized parameters:

{
  w1: 6.403725189897986,
  w2: 6.403725189897986,
  b1: -9.781184135186036
}

Looking at the truth table, we can see that with these parameters the perceptron comes quite close to the target AND gate, which is also clearly visible in the heat map.

$x_1$	$x_2$	Perceptron	Optimized Perceptron	AND
0	0	0.5	0.000057	0
0	1	0.73105	0.033007	0
1	0	0.73105	0.033007	0
1	1	0.73105	0.953746	1

A heatmap visualizing the outputs of the optimized perceptron.

In conclusion, implementing basic machine learning through automatic differentiation from scratch is not as hard as one might think. I would like to extend this by building layers and convolutions to eventually tackle a more interesting problem like MNIST.