Sieving primes at the speed of C

Motivation

When solving Project Euler problems, having an efficient prime sieve becomes indispensable sooner or later. Naturally, I wanted to code a vectorized sieve in BQN for my solutions, but my early attempts were relatively slow. After getting some help, I arrived at the following approaches¹:

S1 ← 2⊸↓{𝔽/(𝕩⥊1){↕∘⌈⌾((𝕨×˜⊸+𝕨×⊢)⁼)∘≠⊸{0¨⌾(𝕨⊸⊏)𝕩}𝕩}⍟⊑´⌽𝔽↕⌈√𝕩}
S2 ← 2⊸↓{𝔽/(𝕩⥊1){𝕩>(≠𝕩)↑/⁼↕∘⌈⌾((𝕨×˜⊸+𝕨×⊢)⁼)≠𝕩}⍟⊑´⌽𝔽↕⌈√𝕩}
S3 ← 2⊸↓{𝔽/(𝕩⥊1)(⊢>×˜∘⊣⊸⥊⟜0»≠∘⊢⥊↑⟜1)⍟⊑´⌽𝔽↕⌈√𝕩}
S4 ← 2⊸↓{
  L ← {↕∘⌈⌾((𝕨×˜⊸+𝕨×⊢)⁼)∘≠⊸{0¨⌾(𝕨⊸⊏)𝕩}𝕩}
  M ← ⊢>×˜∘⊣⊸⥊⟜0»≠∘⊢⥊↑⟜1
  𝔽/(𝕩⥊1)≤⟜80◶L‿M⍟⊑´⌽𝔽↕⌈√𝕩
}
S1‿S2‿S3‿S4 (1=·≠∘⍷{𝕎𝕩}¨⟜⊏)◶⟨"Not comparable!", {𝕎•_timed𝕩}⌜⟩ 10⋆3‿5‿7‿8‿9

┌─                                                                                               
╵             7.9856e¯5 0.001037675  0.06758383200000001         0.700785714         8.09195083  
              9.6947e¯5 0.000645739          0.065917548  2.9671749550000004  93.39382853900001  
              1.7913e¯5   6.1652e¯5          0.031183234         2.819513972 100.19372699300001  
  1.7245000000000002e¯5  0.00031342 0.008511399000000001 0.17044747400000002 4.7954312230000005  
                                                                                                ┘

The best-performing sieve employs a heuristic to determine how multiples are marked: either by directly zeroing the corresponding indices for large arguments (L) or by generating a mask and multiplying it element-wise with the current array during folding (M). The heuristic logic is straightforward: for small 𝕨, the arrays handled by 0¨⌾(m⊸⊏) become longer, making multiplication with the mask more SIMD-friendly.

Out of curiosity, I decided to compare S4's performance with an equivalent NumPy program that used only the direct indexing approach (analogous to L), without the heuristic. At that time¹, the BQN version was about 2.5 times slower. This sparked a productive discussion in the Matrix room, which eventually led dzaima to speed up the CBQN implementation of the under select 0¨⌾(m⊸⊏). Combined with algorithmic improvements suggested by Marshall², based on a publication by Roger Hui³, this resulted in a sieve that computes primes up to one billion in just 1.2 seconds on the hardware detailed below:

CPU specs

inxi -C -c

CPU:
  Info: 8-core model: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics bits: 64 type: MT MCP cache:
    L2: 8 MiB
  Speed (MHz): avg: 1100 min/max: 400/5134 cores: 1: 1100 2: 1100 3: 1100 4: 1100 5: 1100 6: 1100
    7: 1100 8: 1100 9: 1100 10: 1100 11: 1100 12: 1100 13: 1100 14: 1100 15: 1100 16: 1100

A microbenchmark

Ultimately, the optimized BQN code proved to be about six times as fast as a C reference implementation for large argument! While the algorithms differ, I argue that the C version's simple nested loops provide a relevant baseline⁴. Below you will find the corresponding programs; the C and Python versions are the standard Sieve of Eratosthenes without the more elaborated optimization of the final BQN sieve⁵:

NumPy

Straightforward, delivering quite good performance, comparable to C!

import numpy as np
def S(n):
   s=np.ones(n+1, bool)
   s[:2]=0
   for i in range(2,int(n**.5)+1):
     if s[i]:s[i*i::i]=0
   return np.flatnonzero(s)
%timeit S(1_000_000_000)

7.28 s ± 77.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python

After discussing this with other BQNators, I was motivated to implement a pure Python version of the same algorithm. I introduced a minor optimization over a naive list-based approach by using bytearray. I often find myself defending Python against claims of being terribly slow and I was curious how it would perform in this case:

def S(n):
  sieve = bytearray([1]) * ((n + 1) // 2)
  sieve[0] = 0
  for i in range(1, (int(n**0.5)// 2) + 1):
    if sieve[i]:
      p = 2 * i + 1
      start = (p * p) // 2
      sieve[start::p] = bytearray(len(sieve[start::p]))
  return [2] + [2 * i + 1 for i, is_prime in enumerate(sieve) if is_prime and i > 0]
%timeit S(1_000_000_000)

23.5 s ± 440 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So yes, it performed as one might expect for pure Python in such a task: considerably slower than the C/NumPy versions, and also slower than my initial BQN attempts¹.

This program follows closely the NumPy one:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void s(long long n) {
  if (n < 2) return;
  bool *is_prime = malloc((n + 1) * sizeof *is_prime);
  memset(is_prime, true, (n + 1) * sizeof *is_prime);
  is_prime[0] = is_prime[1] = false;
  for (long long p = 2; p * p <= n; p++)
    if (is_prime[p])
      for (long long i = p * p; i <= n; i += p)
        is_prime[i] = false;
  free(is_prime);
}

int main(int argc, char *argv[]) {
  char *endptr;
  long long n = strtoll(argv[1], &endptr, 10);
  s(n);
  return EXIT_SUCCESS;
}

Compiled with optimization:

gcc sieve.c -O3 -o sieve && hyperfine --runs 7 './sieve 1000000000'

Benchmark 1: ./sieve 1000000000
  Time (mean ± σ):      7.358 s ±  0.160 s    [User: 6.975 s, System: 0.343 s]
  Range (min … max):    7.195 s …  7.629 s    7 runs

BQN can be faster than C!

Yeah, right, as if that were possible! The function below again employs a heuristic, this time switching the core sieving strategy based on whether the prime 𝕨 is less than or equal to 20. Furthermore, the implemented algorithm, particularly the L function, isn't an exact one-to-one match with the straightforward two-loop structure of the C code. Nevertheless, we achieve this wonderful result:

S ← 2⊸↓{
  L ← 𝔽{0¨⌾((𝕨×𝔽/(⌈𝕨÷˜≠)⊸↑𝕩)⊸⊏)𝕩}
  M ← ⊢>×˜∘⊣⊸⥊⟜0»≠∘⊢⥊↑⟜1
  𝔽/(𝕩⥊1)≤⟜20◶L‿M⍟⊑´⌽𝔽↕⌈√𝕩
}
7 S•_timed 1_000_000_000

1.2084570474285714

The key to this speed-up lies in the L function, which zeros elements in the sieve mask. This optimized version determines a count k←⌈p÷n and then acts on at most k mask elements, ensuring the condition n≥k×p holds. Fewer zeroing operations 0¨⌾(m⊸⊏) are needed, as subsequent filtering passes with / process a reduced number of indices. Oh, and rest assured, it doesn't yield Bertelsen's Number:

≠S 1_000_000_000

50847534

Footnotes:

This benchmark has been updated to the latest CBQN. The original timing for a billion was 17.11 seconds for the best case (the heuristic-based function S4).

For an exceptionally performant sieve implementation, see Marshall's bqn-libs, which incorporates further elegant algorithmic optimizations. The one presented here aims instead for greater conciseness.

A History of APL in 50 Functions, Roger K.W. Hui (2016). From jsoftware's papers collection.

⁴

Again, the C version shown provides a clear performance baseline. Achieving superior speed in C is undoubtedly possible, but usually involves a steep increase in code complexity.

⁵

Sorry for the number of collapsibles; my focus for this blog is primarily on displaying BQN code.

⊑∘∞