The miniaturist's neural network (WIP)
Preface
We will implement a fully-connected feed-forward neural network1, in other words, a
Multilayer perceptron
Essentially an optimization problem of a function \(f: \mathbb{R}^n \rightarrow \mathbb{R}^m\) that has exceptionally good properties for approximating other continuous functions on compact subsets of \(\mathbb{R}^n\). A multilayer perceptron (MLP) of \(L\) layers, features \(x_i\), and targets \(y_i\) has the following recursive definition:
\begin{equation*} f = \begin{cases} a_i^{(0)} = x_i & \\ a_i^{(l)} = \sigma\left( \sum_{j=1}^{N_{l-1}} w_{ij}^{(l)}\, a_j^{(l-1)} + b_i^{(l)} \right) = \sigma\left( z_i^{(l)} \right) & l \in [1, L] \end{cases} \end{equation*}where \(a_i^{(l)}\) is the activation of the layer \(l\), \(w_{ij}^{(l)}\) is the weight connecting the \(j\)-th neuron in layer \(l-1\) to the \(i\)-th neuron in layer \(l\), \(b_i^{(l)}\) is the bias for the \(i\)-th neuron in layer \(l\), \(N_l\) is the number of neurons in layer \(l\), and \(\sigma\) is the activation function (the logistic function in our case).
As a reference implementation, we will use Tinn, which is a MLP of a single hidden layer, written in pure C with no dependencies2. As usual, we will set the stage by importing and defining some utility functions, namely plotting, random number generation, and matrix product:
Setplot‿Plot ← •Import "../bqn-utils/plots.bqn" U ← 0.5-˜•rand.Range⟜0 M ← +˝∘×⎉1‿∞
Tinn's diminution
The original C implementation has 175 lines excluding the optimization loop. The BQN version has only
Minn ← {r‿l𝕊𝕩: C ⇐ +`2÷˜·×˜- A‿DA ← ⟨1÷1+⋆∘-, ⊢×1⊸-⟩ F ⇐ {𝕨𝕊b‿w: A¨b+w⊸M𝕨}` B ← { do ← (DA∘⊣×𝕗-⊢)´⊢´¨𝕨‿𝕩 } l(U⚇1-⟜1∘≠⋈·<˘2⊸↕)⊸{fs‿ts𝕊b‿w: B fs F b⋈¨w }´𝕩 }
Training the MLP involves a two-stage process for each input: forward propagation followed by backpropagation, during which the neural network's weight matrices are adjusted to minimize a cost function. The second step is often shrouded in mystery, despite being nothing more than
An application of the chain rule
Before introducing a vectorized representation of the backpropagation algorithm, it is important to note that we use a quadratic loss function \( C = \frac{1}{2} \| a^{(L)} - y \|^2 \), and optimize the network using gradient descent. Using the MLP definition in the first collapsible and the chain rule, we can compute the error at the output layer \(L\) with the following Hadamard product:
\begin{equation*} \delta^{(L)} = \left( a^{(L)} - y \right) \odot \sigma'\left( z^{(L)} \right) \end{equation*}The sigmoid is the solution to the logistic differential equation, can you work out what its derivative is? Then, the total derivative and the chain rule come to rescue once again to express the error of the hidden layers \(l\in [1,L)\):
\begin{equation*} \delta^{(l)} = \left({W^{(l+1)}}^\top \delta^{(l+1)}\right) \odot \sigma'\left( z^{(l)} \right) \end{equation*}where we have introduced the matrix form of the weights \(W^{(l)}\). The gradient of the cost function is:
\begin{equation*} \nabla C = \left\{ \frac{\partial C}{\partial W^{(l)}} = \delta^{(l)} {a^{(l-1)}}^\top, \quad \frac{\partial C}{\partial b^{(l)}} = \delta^{(l)} \right\}_{l=1}^{L} \end{equation*}Finally, we can do a gradient descent step with a learning rate \(\eta\), which can be possibly annealed:
\begin{equation*} \Delta\left\{W^{(l)}, b^{(l)}\right\}_{l=1}^{L} \gets -\eta\nabla C \end{equation*}For a straightforward derivation, refer to the dedicated section in Nielsen's book. For a rigorous presentation, see arXiv:2107.09384.
Learning the logistic map
_minn
should handle digit recognition just fine3. However, I would like to switch clichés for the demonstration.
Instead, we will use it to learn the logistic map4. This is a quintessential example of how chaos can emerge from simple systems.
Moreover, it is not so trivial to approximate: the recurrence lacks a closed-form solution, and has been a subject of study in
the context of neural networks5.
L ← ⊣×1⊸-×⊢ np‿ns‿ri‿rf‿dr ← 600‿50‿2.8‿4‿0.01 r ← ↕∘⌈⌾((ri+dr×⊢)⁼)rf @ ⊣ m ← r L⍟((np-ns)+↕ns)¨ 0 •rand.Range˜ ≠r
Let’s see if we’ve gotten the numbers right after learning. But then again, what is a number that a man may know it6…
)r Setplot "scatter" ⋄ •Out¨ Plot´ m {ns←≠⊑𝕨 ⋄ (>𝕨)⋈˜∘‿ns⥊ns/𝕩} r
Footnotes:
This post is not intended to be an introduction to the topic. There are excellent videos, lecture notes, papers, and books that do this better than I could. I will provide only the essential context to ensure the reading is self-contained.
Programming by poking is the antithesis of this blog's ethos.
You can try using UCI's Semeion Handwritten Digit dataset, like Tinn does.
Isn't it fascinating how closely related and yet so different the logistic map and the logistic function are? The former can be thought of as a discrete version of \(\dot{f} = f(1 - f)\), but whereas this ODE has a boring sigmoid solution, the logistic map yields beautiful bifurcation diagrams.
See, for instance, arXiv:2409.07468.