Sieving primes at the speed of C
Motivation
When solving Project Euler problems, having an efficient prime sieve becomes indispensable sooner or later. Naturally, I wanted to code a vectorized sieve in BQN for my solutions, but my early attempts were relatively slow. After getting some help, I arrived at the following approaches1:
S1 β 2βΈβ{π½/(π©β₯1){ββββΎ((π¨ΓΛβΈ+π¨Γβ’)βΌ)ββ βΈ{0Β¨βΎ(π¨βΈβ)π©}π©}ββΒ΄β½π½βββπ©} S2 β 2βΈβ{π½/(π©β₯1){π©>(β π©)β/βΌββββΎ((π¨ΓΛβΈ+π¨Γβ’)βΌ)β π©}ββΒ΄β½π½βββπ©} S3 β 2βΈβ{π½/(π©β₯1)(β’>ΓΛββ£βΈβ₯β0Β»β ββ’β₯ββ1)ββΒ΄β½π½βββπ©} S4 β 2βΈβ{ L β {ββββΎ((π¨ΓΛβΈ+π¨Γβ’)βΌ)ββ βΈ{0Β¨βΎ(π¨βΈβ)π©}π©} M β β’>ΓΛββ£βΈβ₯β0Β»β ββ’β₯ββ1 π½/(π©β₯1)β€β80βΆLβΏMββΒ΄β½π½βββπ© } S1βΏS2βΏS3βΏS4 (1=Β·β ββ·{ππ©}Β¨ββ)βΆβ¨"Not comparable!", {πβ’_timedπ©}ββ© 10β3βΏ5βΏ7βΏ8βΏ9
ββ β΅ 7.9856eΒ―5 0.001037675 0.06758383200000001 0.700785714 8.09195083 9.6947eΒ―5 0.000645739 0.065917548 2.9671749550000004 93.39382853900001 1.7913eΒ―5 6.1652eΒ―5 0.031183234 2.819513972 100.19372699300001 1.7245000000000002eΒ―5 0.00031342 0.008511399000000001 0.17044747400000002 4.7954312230000005 β
The best-performing sieve employs a heuristic to determine how multiples are marked: either by directly zeroing
the corresponding indices for large arguments (L
) or by generating a mask and multiplying it element-wise
with the current array during folding (M
). The heuristic logic is straightforward: for small π¨,
the arrays handled by 0Β¨βΎ(mβΈβ)
become longer, making multiplication with the mask more SIMD-friendly.
Out of curiosity, I decided to compare S4
's performance with an equivalent NumPy program that used
only the direct indexing approach (analogous to L
), without the heuristic. At that time1,
the BQN version was about 2.5 times slower. This sparked a productive discussion in the Matrix room,
which eventually led dzaima to speed up the CBQN implementation of the under select 0Β¨βΎ(mβΈβ)
.
Combined with algorithmic improvements suggested by Marshall2, based on a publication by Roger Hui3,
this resulted in a sieve that computes primes up to one billion in just 1.2 seconds on the hardware detailed below:
CPU specs
inxi -C -c
CPU: Info: 8-core model: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics bits: 64 type: MT MCP cache: L2: 8 MiB Speed (MHz): avg: 1100 min/max: 400/5134 cores: 1: 1100 2: 1100 3: 1100 4: 1100 5: 1100 6: 1100 7: 1100 8: 1100 9: 1100 10: 1100 11: 1100 12: 1100 13: 1100 14: 1100 15: 1100 16: 1100
A microbenchmark
Ultimately, the optimized BQN code proved to be about six times as fast as a C reference implementation for large argument! While the algorithms differ, I argue that the C version's simple nested loops provide a relevant baseline4. Below you will find the corresponding programs; the C and Python versions are the standard Sieve of Eratosthenes without the more elaborated optimization of the final BQN sieve5:
NumPy
Straightforward, delivering quite good performance, comparable to C!
import numpy as np def S(n): s=np.ones(n+1, bool) s[:2]=0 for i in range(2,int(n**.5)+1): if s[i]:s[i*i::i]=0 return np.flatnonzero(s) %timeit S(1_000_000_000)
7.28 s Β± 77.4 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
Python
After discussing this with other BQNators, I was motivated to implement a pure Python version
of the same algorithm. I introduced a minor optimization over a naive list-based approach by using bytearray
.
I often find myself defending Python against claims of being terribly slow
and I was curious how it would perform in this case:
def S(n): sieve = bytearray([1]) * ((n + 1) // 2) sieve[0] = 0 for i in range(1, (int(n**0.5)// 2) + 1): if sieve[i]: p = 2 * i + 1 start = (p * p) // 2 sieve[start::p] = bytearray(len(sieve[start::p])) return [2] + [2 * i + 1 for i, is_prime in enumerate(sieve) if is_prime and i > 0] %timeit S(1_000_000_000)
23.5 s Β± 440 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
So yes, it performed as one might expect for pure Python in such a task: considerably slower than the C/NumPy versions, and also slower than my initial BQN attempts1.
C
This program follows closely the NumPy one:
#include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> void s(long long n) { if (n < 2) return; bool *is_prime = malloc((n + 1) * sizeof *is_prime); memset(is_prime, true, (n + 1) * sizeof *is_prime); is_prime[0] = is_prime[1] = false; for (long long p = 2; p * p <= n; p++) if (is_prime[p]) for (long long i = p * p; i <= n; i += p) is_prime[i] = false; free(is_prime); } int main(int argc, char *argv[]) { char *endptr; long long n = strtoll(argv[1], &endptr, 10); s(n); return EXIT_SUCCESS; }
Compiled with optimization:
gcc sieve.c -O3 -o sieve && hyperfine --runs 7 './sieve 1000000000'
Benchmark 1: ./sieve 1000000000 Time (mean Β± Ο): 7.358 s Β± 0.160 s [User: 6.975 s, System: 0.343 s] Range (min β¦ max): 7.195 s β¦ 7.629 s 7 runs
BQN can be faster than C!
Yeah, right, as if that were possible! The function below again employs a heuristic, this time switching
the core sieving strategy based on whether the prime π¨ is less than or equal to 20. Furthermore,
the implemented algorithm, particularly the L
function, isn't an exact one-to-one match
with the straightforward two-loop structure of the C code. Nevertheless, we achieve this wonderful result:
S β 2βΈβ{ L β π½{0Β¨βΎ((π¨Γπ½/(βπ¨Γ·Λβ )βΈβπ©)βΈβ)π©} M β β’>ΓΛββ£βΈβ₯β0Β»β ββ’β₯ββ1 π½/(π©β₯1)β€β20βΆLβΏMββΒ΄β½π½βββπ© } 7 Sβ’_timed 1_000_000_000
1.2084570474285714
The key to this speed-up lies in the L
function, which zeros elements in the sieve mask.
This optimized version determines a count kββpΓ·n
and then acts on at most k
mask elements,
ensuring the condition nβ₯kΓp
holds. Fewer zeroing operations 0Β¨βΎ(mβΈβ)
are needed,
as subsequent filtering passes with /
process a reduced number of indices. Oh, and rest assured,
it doesn't yield Bertelsen's Number:
β S 1_000_000_000
50847534
Footnotes:
This benchmark has been updated to the latest CBQN. The original timing for a billion was 17.11
seconds for the best case (the heuristic-based function S4
).
For an exceptionally performant sieve implementation, see Marshall's bqn-libs, which incorporates further elegant algorithmic optimizations. The one presented here aims instead for greater conciseness.
Again, the C version shown provides a clear performance baseline. Achieving superior speed in C is undoubtedly possible, but usually involves a steep increase in code complexity.