The miniaturist's neural network
Preface
We will implement a fully-connected feed-forward neural network1, in other words, a
Multilayer perceptron
Essentially an optimization problem of a function \(f: \mathbb{R}^n \rightarrow \mathbb{R}^m\) that has exceptionally good properties for approximating other continuous functions on compact subsets of \(\mathbb{R}^n\). A multilayer perceptron (MLP) of \(L\) layers, features \(x_i\), and targets \(y_i\) has the following recursive definition:
\begin{equation*} f = \begin{cases} a_i^{(0)} = x_i & \\ a_i^{(l)} = \sigma\left( \sum_{j=1}^{N_{l-1}} w_{ij}^{(l)}\, a_j^{(l-1)} + b_i^{(l)} \right) = \sigma\left( z_i^{(l)} \right) & l \in [1, L] \end{cases} \end{equation*}where \(a_i^{(l)}\) is the activation of the layer \(l\), \(w_{ij}^{(l)}\) is the weight connecting the \(j\)-th neuron in layer \(l-1\) to the \(i\)-th neuron in layer \(l\), \(b_i^{(l)}\) is the bias for the \(i\)-th neuron in layer \(l\), \(N_l\) is the number of neurons in layer \(l\), and \(\sigma\) is the activation function (the logistic function in our case).
As a reference implementation, we will use Tinn, which is a MLP of a single hidden layer, written in pure C with no dependencies2. As usual, we will set the stage by importing and defining some utility functions, namely plotting, random number generation, matrix product, and root mean square error:
Setplot‿Plot ← •Import "../bqn-utils/plots.bqn" U ← 0.5-˜•rand.Range⟜0 M ← +˝∘×⎉1‿∞ RMS ← √≠∘⊢÷˜·+´×˜∘-
Tinn's diminution
The original C implementation has 175 lines, excluding the optimisation loop. The BQN version has only 10, and is more general as it supports an arbitrary number of hidden layers. The biases are left unoptimised as in Tinn: a fair price to pay to avoid adding an ugly extra column on the weights' matrices with all zeros except the first element. For toy problems like the ones I am interested in, we have enough variational freedom with the weights.
Minn ← {rt‿ly𝕊dat: A‿DA ← ⟨1÷1+⋆∘-, ⊢×1⊸-⟩ BP ← {fs‿ts‿we𝕊𝕩: do ← (<-⟜ts×DA)⊑⊑hx‿tx ← ¯1(↑⋈↓)𝕩 (fs<⊸∾tx)×⌜˜¨do∾˜{d𝕊w‿z: z DA⊸×d M˜⍉w}`˜⟜do⌾⌽tx⋈¨˜1↓we } FP ← {z𝕊bi‿we: A bi+we M z}`⟜(⋈¨´) nn ← dat{𝕨𝕊bi‿we: fs‿ts ← 𝕨(↑⋈↓)˜⊑ly bi⋈we-rt× fs‿ts‿we BP fs<⊸FP𝕩}˝˜(U⚇1-⟜1∘≠⋈·<∘⌽˘2⊸↕)ly E ⇐ ⊢´<⊸FP⟜nn }
Training the MLP involves a two-stage process for each input: forward propagation followed by backpropagation, during which the neural network's weight matrices are adjusted to minimize a cost function. The second step is often shrouded in mystery, despite being nothing more than
An application of the chain rule
Before introducing a vectorized representation of the backpropagation algorithm, it is important to note that we use a quadratic loss function \( C = \frac{1}{2} \| a^{(L)} - y \|^2 \), and optimize the network using gradient descent. Using the MLP definition in the first collapsible and the chain rule, we can compute the error at the output layer \(L\) with the following Hadamard product:
\begin{equation*} \delta^{(L)} = \left( a^{(L)} - y \right) \odot \sigma'\left( z^{(L)} \right) \end{equation*}The sigmoid is the solution to the logistic differential equation, can you work out what its derivative is? Then, the total derivative and the chain rule come to rescue once again to express the error of the hidden layers \(l\in [1,L)\):
\begin{equation*} \delta^{(l)} = \left({W^{(l+1)}}^\top \delta^{(l+1)}\right) \odot \sigma'\left( z^{(l)} \right) \end{equation*}where we have introduced the matrix form of the weights \(W^{(l)}\). The gradient of the cost function is:
\begin{equation*} \nabla C = \left\{ \frac{\partial C}{\partial W^{(l)}} = \delta^{(l)} {a^{(l-1)}}^\top, \quad \frac{\partial C}{\partial b^{(l)}} = \delta^{(l)} \right\}_{l=1}^{L} \end{equation*}Finally, we can do a gradient descent step with a learning rate \(\eta\), which can be possibly annealed:
\begin{equation*} \Delta\left\{W^{(l)}, b^{(l)}\right\}_{l=1}^{L} \gets -\eta\nabla C \end{equation*}For a straightforward derivation, refer to the dedicated section in Nielsen's book. For a rigorous presentation, see arXiv:2107.09384.
Learning the logistic map
Minn
should handle digit recognition just fine3. However, I would like to switch clichés for the demonstration.
Instead, we will use it to learn the logistic map4. This is a quintessential example of how chaos can emerge from simple systems.
Moreover, it is not so trivial to approximate: the recurrence lacks a closed-form solution, and has been a subject of study in
the context of neural networks5. First let's generate some (non-overlapping) training and test data, ensuring shuffling after each epoch
to reduce correlation:
neq‿ntr‿nte‿e ← 600‿100‿50‿100 I ← {↕∘⌈⌾((2.8+𝕩×⊢)⁼)4} L ← {𝕨(⊣×1⊸-×⊢)⍟((neq-𝕩)+↕𝕩)•rand.Range 0} te ← ∾{𝕩∾˘⊏⍉2↕𝕩L nte}¨I (0.004-˜10⊸⋆÷˜√)2 tr ← •rand.Deal∘≠⊸⊏⊸∾⍟(e-1)˜∾{𝕩∾˘2↕𝕩L ntr}¨I 0.1 ≠¨tr‿te
⟨ 128700 5831 ⟩
The network can then be trained. The values of hyper-parameters like learning rate rt
, number and length of layers ly
,
and epochs e
are system-dependent and susceptible to change. You have to experiment with them until you get a satisfactory error.
lm ← 0.001‿⟨2, 300, 1⟩ Minn tr (⊢RMS⟜∾·lm.E¨⊣)˝⍉tr
0.044822952594953065
Let’s see if we’ve gotten the numbers right after learning. But then again, what is a number that a man may know it6…
)r Setplot "scatter" ⋄ •Out¨ Plot˝⍉(⊑∾lm.E)˘te
Footnotes:
This post is not intended to be an introduction to the topic. There are excellent videos, lecture notes, papers, and books that do this better than I could. I will provide only the essential context to ensure the reading is self-contained.
Programming by poking is the antithesis of this blog's ethos.
You can try using UCI's Semeion Handwritten Digit dataset, like Tinn does.
Isn't it fascinating how closely related and yet so different the logistic map and the logistic function are? The former can be thought of as a discrete version of \(\dot{f} = f(1 - f)\), but whereas this ODE has a boring sigmoid solution, the logistic map yields beautiful bifurcation diagrams.
See, for instance, arXiv:2409.07468.