Sieving primes at the speed of C

Motivation

When solving Project Euler problems, having an efficient prime sieve becomes indispensable sooner or later. Naturally, I wanted to code a vectorized sieve in BQN for my solutions, but my early attempts were relatively slow. After getting some help, I arrived at the following approaches1:

S1 ← 2βŠΈβ†“{𝔽/(𝕩β₯Š1){β†•βˆ˜βŒˆβŒΎ((π•¨Γ—ΛœβŠΈ+π•¨Γ—βŠ’)⁼)βˆ˜β‰ βŠΈ{0¨⌾(π•¨βŠΈβŠ)𝕩}𝕩}βŸβŠ‘Β΄βŒ½π”½β†•βŒˆβˆšπ•©}
S2 ← 2βŠΈβ†“{𝔽/(𝕩β₯Š1){𝕩>(≠𝕩)↑/βΌβ†•βˆ˜βŒˆβŒΎ((π•¨Γ—ΛœβŠΈ+π•¨Γ—βŠ’)⁼)≠𝕩}βŸβŠ‘Β΄βŒ½π”½β†•βŒˆβˆšπ•©}
S3 ← 2βŠΈβ†“{𝔽/(𝕩β₯Š1)(⊒>Γ—Λœβˆ˜βŠ£βŠΈβ₯ŠβŸœ0Β»β‰ βˆ˜βŠ’β₯Šβ†‘βŸœ1)βŸβŠ‘Β΄βŒ½π”½β†•βŒˆβˆšπ•©}
S4 ← 2βŠΈβ†“{
  L ← {β†•βˆ˜βŒˆβŒΎ((π•¨Γ—ΛœβŠΈ+π•¨Γ—βŠ’)⁼)βˆ˜β‰ βŠΈ{0¨⌾(π•¨βŠΈβŠ)𝕩}𝕩}
  M ← ⊒>Γ—Λœβˆ˜βŠ£βŠΈβ₯ŠβŸœ0Β»β‰ βˆ˜βŠ’β₯Šβ†‘βŸœ1
  𝔽/(𝕩β₯Š1)β‰€βŸœ80β—ΆLβ€ΏMβŸβŠ‘Β΄βŒ½π”½β†•βŒˆβˆšπ•©
}
S1β€ΏS2β€ΏS3β€ΏS4 (1=Β·β‰ βˆ˜β·{π•Žπ•©}¨⟜⊏)β—ΆβŸ¨"Not comparable!", {π•Žβ€’_timed𝕩}⌜⟩ 10⋆3β€Ώ5β€Ώ7β€Ώ8β€Ώ9
β”Œβ”€                                                                                               
β•΅             7.9856eΒ―5 0.001037675  0.06758383200000001         0.700785714         8.09195083  
              9.6947eΒ―5 0.000645739          0.065917548  2.9671749550000004  93.39382853900001  
              1.7913eΒ―5   6.1652eΒ―5          0.031183234         2.819513972 100.19372699300001  
  1.7245000000000002eΒ―5  0.00031342 0.008511399000000001 0.17044747400000002 4.7954312230000005  
                                                                                                β”˜

The best-performing sieve employs a heuristic to determine how multiples are marked: either by directly zeroing the corresponding indices for large arguments (L) or by generating a mask and multiplying it element-wise with the current array during folding (M). The heuristic logic is straightforward: for small 𝕨, the arrays handled by 0¨⌾(m⊸⊏) become longer, making multiplication with the mask more SIMD-friendly.

Out of curiosity, I decided to compare S4's performance with an equivalent NumPy program that used only the direct indexing approach (analogous to L), without the heuristic. At that time1, the BQN version was about 2.5 times slower. This sparked a productive discussion in the Matrix room, which eventually led dzaima to speed up the CBQN implementation of the under select 0¨⌾(m⊸⊏). Combined with algorithmic improvements suggested by Marshall2, based on a publication by Roger Hui3, this resulted in a sieve that computes primes up to one billion in just 1.2 seconds on the hardware detailed below:

CPU specs
inxi -C -c
CPU:
  Info: 8-core model: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics bits: 64 type: MT MCP cache:
    L2: 8 MiB
  Speed (MHz): avg: 1100 min/max: 400/5134 cores: 1: 1100 2: 1100 3: 1100 4: 1100 5: 1100 6: 1100
    7: 1100 8: 1100 9: 1100 10: 1100 11: 1100 12: 1100 13: 1100 14: 1100 15: 1100 16: 1100

A microbenchmark

Ultimately, the optimized BQN code proved to be about six times as fast as a C reference implementation for large argument! While the algorithms differ, I argue that the C version's simple nested loops provide a relevant baseline4. Below you will find the corresponding programs; the C and Python versions are the standard Sieve of Eratosthenes without the more elaborated optimization of the final BQN sieve5:

NumPy

Straightforward, delivering quite good performance, comparable to C!

import numpy as np
def S(n):
   s=np.ones(n+1, bool)
   s[:2]=0
   for i in range(2,int(n**.5)+1):
     if s[i]:s[i*i::i]=0
   return np.flatnonzero(s)
%timeit S(1_000_000_000)
7.28 s Β± 77.4 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
Python

After discussing this with other BQNators, I was motivated to implement a pure Python version of the same algorithm. I introduced a minor optimization over a naive list-based approach by using bytearray. I often find myself defending Python against claims of being terribly slow and I was curious how it would perform in this case:

def S(n):
  sieve = bytearray([1]) * ((n + 1) // 2)
  sieve[0] = 0
  for i in range(1, (int(n**0.5)// 2) + 1):
    if sieve[i]:
      p = 2 * i + 1
      start = (p * p) // 2
      sieve[start::p] = bytearray(len(sieve[start::p]))
  return [2] + [2 * i + 1 for i, is_prime in enumerate(sieve) if is_prime and i > 0]
%timeit S(1_000_000_000)
23.5 s Β± 440 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

So yes, it performed as one might expect for pure Python in such a task: considerably slower than the C/NumPy versions, and also slower than my initial BQN attempts1.

C

This program follows closely the NumPy one:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void s(long long n) {
  if (n < 2) return;
  bool *is_prime = malloc((n + 1) * sizeof *is_prime);
  memset(is_prime, true, (n + 1) * sizeof *is_prime);
  is_prime[0] = is_prime[1] = false;
  for (long long p = 2; p * p <= n; p++)
    if (is_prime[p])
      for (long long i = p * p; i <= n; i += p)
        is_prime[i] = false;
  free(is_prime);
}

int main(int argc, char *argv[]) {
  char *endptr;
  long long n = strtoll(argv[1], &endptr, 10);
  s(n);
  return EXIT_SUCCESS;
}

Compiled with optimization:

gcc sieve.c -O3 -o sieve && hyperfine --runs 7 './sieve 1000000000'
Benchmark 1: ./sieve 1000000000
  Time (mean Β± Οƒ):      7.358 s Β±  0.160 s    [User: 6.975 s, System: 0.343 s]
  Range (min … max):    7.195 s …  7.629 s    7 runs

BQN can be faster than C!

Yeah, right, as if that were possible! The function below again employs a heuristic, this time switching the core sieving strategy based on whether the prime 𝕨 is less than or equal to 20. Furthermore, the implemented algorithm, particularly the L function, isn't an exact one-to-one match with the straightforward two-loop structure of the C code. Nevertheless, we achieve this wonderful result:

S ← 2βŠΈβ†“{
  L ← 𝔽{0¨⌾((𝕨×𝔽/(βŒˆπ•¨Γ·Λœβ‰ )βŠΈβ†‘π•©)⊸⊏)𝕩}
  M ← ⊒>Γ—Λœβˆ˜βŠ£βŠΈβ₯ŠβŸœ0Β»β‰ βˆ˜βŠ’β₯Šβ†‘βŸœ1
  𝔽/(𝕩β₯Š1)β‰€βŸœ20β—ΆLβ€ΏMβŸβŠ‘Β΄βŒ½π”½β†•βŒˆβˆšπ•©
}
7 Sβ€’_timed 1_000_000_000
1.2084570474285714

The key to this speed-up lies in the L function, which zeros elements in the sieve mask. This optimized version determines a count kβ†βŒˆpΓ·n and then acts on at most k mask elements, ensuring the condition nβ‰₯kΓ—p holds. Fewer zeroing operations 0¨⌾(m⊸⊏) are needed, as subsequent filtering passes with / process a reduced number of indices. Oh, and rest assured, it doesn't yield Bertelsen's Number:

β‰ S 1_000_000_000
50847534

Footnotes:

1

This benchmark has been updated to the latest CBQN. The original timing for a billion was 17.11 seconds for the best case (the heuristic-based function S4).

2

For an exceptionally performant sieve implementation, see Marshall's bqn-libs, which incorporates further elegant algorithmic optimizations. The one presented here aims instead for greater conciseness.

3

A History of APL in 50 Functions, Roger K.W. Hui (2016). From jsoftware's papers collection.

4

Again, the C version shown provides a clear performance baseline. Achieving superior speed in C is undoubtedly possible, but usually involves a steep increase in code complexity.

5

Sorry for the number of collapsibles; my focus for this blog is primarily on displaying BQN code.