Blazing matrix products

Why not use BLAS?

Because I am interested in Brutalist array programming, and the absence of a high-performance native matrix product in BQN was a compelling opportunity for exploration¹. Of course wrapping dgemm is always an option:

blasFFI ← (⊣•FFI·(1-˜+`×¬)∘=⟜⊏⊸⊔⊢)´ ⟨
  "/lib/libcblas.so"∾˜•BQN 1⊑•Sh "nix-instantiate"‿"--eval-only"‿"--expr"‿"(import <nixpkgs> {}).blas.outPath"
  " & cblas_dgemm u32 u32 u32 i32 i32 i32 f64 *f64 i32 *f64 i32 f64 &f64 i32"
⟩
Dgemm ← {BlasFFI 101‿111‿111‿m‿n‿k‿1‿𝕨‿k‿𝕩‿n‿0‿(m‿n⥊0)∾⊢´m‿k‿·‿n←𝕨∾○≢𝕩}

In case you're wondering, this function has roughly the same overhead as NumPy's dot. For fun, let's challenge the idea that you should never write your own² GEMM, but rather wrap BLAS.

Taming the cache

The first step towards higher performance is employing blocking to optimize cache access patterns. By using a straightforward square partitioning of the input matrices (without resorting to specialized assembly kernels and instead relying on the native BQN idiom) speed-ups of approximately sixfold are achievable for matrices that exceed the machine's cache size:

mat‿mbt ← ⟨⋈˜2⥊500, ⋈˜5⥊600⟩ /¨⊸⊔¨ ma‿mb ← •rand.Range⟜0¨1e3×⟨1‿1, 3‿3⟩
>⟨ma‿ma‿mat, mb‿mb‿mbt⟩ {𝕎˜•_timed𝕩}¨¨˜ <⟨Dgemm, +˝∘×⎉1‿∞, ∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞⟩

┌─                                                            
╵         0.008988871        0.646108393 0.37081367400000004  
  0.16528436400000002 45.110128999000004   7.460860705000001  
                                                             ┘

This performance gain requires only a modest 10-character leap in the code, from +˝∘×⎉1‿∞ to ∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞. Let's abstract this logic into reusable code. For instance, the function below³ computes powers of a square matrix 𝕩 using blocks of size 𝕨, padding with zeros as needed. This operation is particularly useful in domains like graph theory or analyzing Markov chains:

MPB ← {𝕩≢⊸↑∾(+˝+˝∘×⎉1‿∞¨)⎉1‿∞˜𝕩(⥊⟜𝕨¨∘⊢/¨⊸⊔𝕨⊸×↑⊣)⌈𝕨÷˜≢𝕩}

An empirical (naïve, really) search for the optimal block size yields:

(300+50×↕8) {𝕨⊸MPB•_timed𝕩}¨ <3e3‿3e3 •rand.Range 0

⟨ 8.30279774 10.112563361000001 9.781014477000001 9.670085717000001 7.556631647000001 10.970897867000001 7.570657628 10.231164773000001 ⟩

One might hypothesize that further performance could be gained by applying this blocking principle recursively to accommodate multiple levels of cache. This technique, known as nested tiling, can also be implemented easily, though experimentation shows it yields no improvement:

MPB2 ← {∾∾×_p¨_p¨(_p←{+˝∘𝔽⎉1‿∞})˜𝕩{𝕩⊔˜/¨⥊⟜𝕨¨⌈𝕨÷˜≢𝕩}´𝕨}
⟨10‿60, 4‿250, 3‿500⟩ {𝕨⊸MPB2•_timed𝕩}¨ <3e3‿3e3•rand.Range 0

⟨ 14.096323785000001 9.16644102 7.668334754000001 ⟩

Having seemingly reached the limits of performance gains by optimizing memory access patterns, the next logical step is to attack the problem from a different axis: reducing the algorithm's asymptotic complexity. Here is a little divide-and-conquer (and cache-oblivious) algorithm in its classic radix-2 form. It works for any square matrix, regardless of dimension: if it is odd, we pad with an extra row and column, and then take back the original.

_strassen_ ← {𝕘≥≠𝕩 ? 𝕨𝔽𝕩;
  [a‿b,c‿d]‿[e‿f,g‿h] ← (2⊸⥊¨∘⊢/¨⊸⊔2⊸×↑⊣)¨⟜(⌈2÷˜≢¨)𝕨‿𝕩
  p1‿p2‿p3‿p4‿p5‿p6‿p7 ← 𝕊´¨⟨a+d,e+h⟩‿⟨c+d,e⟩‿⟨a,f-h⟩‿⟨d,g-e⟩‿⟨a+b,h⟩‿⟨c-a,e+f⟩‿⟨b-d,g+h⟩
  𝕩≢⊸↑∾⟨p1+p4+p7-p5, p3+p5⟩≍⟨p2+p4, p1+p3+p6-p2⟩
}

Let's go somewhat big for a solid 9x speed-up over the naive implementation:

⟨+˝∘×⎉1‿∞, 600⊸MPB, +˝∘×⎉1‿∞ _strassen_ 256, Dgemm _strassen_ 256, Dgemm⟩ {𝕎˜•_timed𝕩}¨ <4096‿4096•rand.Range 0

⟨ 121.21441014300001 23.299975492 13.688074838 2.1399266160000003 0.400549596 ⟩

To the best of my ability, this marks the limit of what can be achieved with a pure, single-threaded BQN implementation⁴.

Parallelism via MPI

To approach true bare-metal performance on par with BLAS/BLIS, we must leverage multiple cores. As BQN lacked native support for SPMD programming, I developed bindings for a small (but useful IMHO) subset of the Message Passing Interface (MPI), which are available on Codeberg.

With these bindings, I implemented a variant of Cannon's algorithm. In this version, each process generates its initial local matrices, though scattering and gathering could be added as needed. The implementation assumes a perfect square number of tasks (otherwise errors out), forming a processor grid of ⋈˜√p, and pads matrices whose dimensions are not divisible by √p.

⟨mpi⟩ ⇐ •Import "mpi.bqn"

mpi.Init@ ⋄ r‿s ← mpi{𝕗.Rank⋈𝕗.Size}⋈cw ← mpi.comm_world

# Processor element coordinates in 2D grid (r≡y+q×x)
!⌊⊸=q←√s ⋄ b ← q÷˜n ← 2⋆12 ⋄ x‿y ← q(|⋈˜·⌊÷˜)r

# Local matrices
aml‿bml ← {(b×x𝕏y)+𝕏⌜˜↕b}¨+‿-

# Toroidal topology with periodic boundary conditions (aml←) (bml↑)
L‿U ← {(cw⊸mpi.Sendrecv⊢<⊸∾𝕩˙)⌾⥊}¨⟨(q×x)+q|y(-⋈+)1 ⋄ y+q×q|x(-⋈+)1⟩

# Strassen algorithm with blocking for cache efficiency
_strassen_ ← {𝕘≥≠𝕩 ? 𝕨𝔽𝕩;
  [a‿b,c‿d]‿[e‿f,g‿h] ← (2⊸⥊¨∘⊢/¨⊸⊔2⊸×↑⊣)¨⟜(⌈2÷˜≢¨)𝕨‿𝕩
  p1‿p2‿p3‿p4‿p5‿p6‿p7 ← 𝕊´¨⟨a+d,e+h⟩‿⟨c+d,e⟩‿⟨a,f-h⟩‿⟨d,g-e⟩‿⟨a+b,h⟩‿⟨c-a,e+f⟩‿⟨b-d,g+h⟩
  𝕩≢⊸↑∾⟨p1+p4+p7-p5, p3+p5⟩≍⟨p2+p4, p1+p3+p6-p2⟩
}
MP ← +˝∘×⎉1‿∞ _strassen_ 256

# Skewing
aml L⍟x↩ ⋄ bml U⍟y↩

# Multiply and shift
cml ← +´{𝕊: aml⊸MP⟜bml⟨aml L↩ ⋄ bml U↩⟩}¨↕q

# Test (not included in benchmark)
cmf ← (+⌜˜+˝∘×⎉1‿∞-⌜˜)↕n
!cml≡r⊑⥊cmf/¨⊸⊔˜⋈˜q⥊b

mpi.Finalize@

Which yields a speed-up of

31x

hyperfine --runs 4 'bqn -e "+˝∘×⎉1‿∞˜ ⟨2⋆12,2⋆12⟩•rand.Range 1e5"' 'mpirun --mca btl self,sm -n 4 bqn -f cannon.bqn'

Benchmark 1: bqn -e "+˝∘×⎉1‿∞˜ ⟨2⋆12,2⋆12⟩•rand.Range 1e5"
  Time (mean ± σ):     108.965 s ±  1.897 s    [User: 107.824 s, System: 0.169 s]
  Range (min … max):   106.771 s … 110.747 s    4 runs
 
Benchmark 2: mpirun --mca btl self,sm -n 4 bqn -f cannon.bqn
  Time (mean ± σ):      3.510 s ±  0.012 s    [User: 11.990 s, System: 0.701 s]
  Range (min … max):    3.493 s …  3.521 s    4 runs
 
Summary
  mpirun --mca btl self,sm -n 4 bqn -f cannon.bqn ran
   31.04 ± 0.55 times faster than bqn -e "+˝∘×⎉1‿∞˜ ⟨2⋆12,2⋆12⟩•rand.Range 1e5"

This result is only possible⁵ thanks to a combination of SPMD parallelism and a cache-efficient matrix multiplication algorithm. We have improved significantly, going from +˝∘×⎉1‿∞ being 300 times slower than OpenBLAS's dgemm to only eight times slower. The obvious limitation of Cannon's algorithm is the need for a perfect square number of tasks. But if your computer supports SMT, you can push the problem size further with the option --use-hwthread-cpus. Careful with the memory usage, though, as it might bring your system to a crawl if you push it too far.

Footnotes:

While the current idiom guarantees numerical accuracy, it is hundreds of times slower than BLAS for large matrices.

See this post or this other post for surprisingly accessible ways to replicate what OpenBLAS does without spending your life in assembly.

Here I could have used a fancier but slower under 𝔽˜⌾((/¨⥊⟜𝕨¨⌈𝕨÷˜≢𝕩)⊸⊔). Or even the memory-hungry outer product formulation +˝⍉∘⊢(+˝∘×⎉1‿∞¨)⌜˘⊢, which is only marginally slower.

⁴

For deeper insight into blocked matrix multiplication algorithms, I recommend this JAX post, the SaC paper on rank polymorphic blocking, and arXiv:1605.01078 for the high-performance Strassen implementation.

⁵

The approach to data locality and parallelism used here is rooted in the principles from Golub and Van Loan's Matrix Computations, an essential reference in the field of numerical linear algebra. Particularly relevant for this post is section 1.6.

⊑∘∞