Tinkering with fizz buzz and concurrency

20.12.2022 — Rust, Performance, Concurrency — 2 min read

Let's get a baseline of the performance we could expect with a naive implementation of fizz buzz in Python.

1i = 1
2while True:
3    if i % 15 == 0:
4        print("FizzBuzz")
5    elif i % 3 == 0:
6        print("Fizz")
7    elif i % 5 == 0:
8        print("Buzz")
9    else:
10        print(i)
11    i += 1

using pv and redirecting to /dev/null we reach a whopping 30 MiB/s.

For the upper bound, we'll use the GNU utility yes known for its high throughput, it just prints lines of y.

1>> yes           | pv --average-rate >/dev/null
2[10.6GiB/s]
3>> cat /dev/zero | pv --average-rate >/dev/null
4[9.11GiB/s]

Nice.

Let's try to reproduce this command using Rust.

1use std::io::{self, stdout, Write};
2
3fn main() -> Result<(), io::Error> {
4    let mut writer = stdout().lock();
5    loop {
6        writer.write(b"y\n")?;
7    }
8}

And we reach a throughput of ... 10 MiB/s, 3 orders of magnitude slower.

How come Rust is slower than Python? Did I just waste my time learni-

No calm down, when running this code through strace we find out that we invoke the syscall write for each iteration of the loop. Reaching for the kernel is a considerable overhead and is our bottleneck so far.

The solution to this problem is buffering and it is what Python does by default.
We can achieve the same in Rust by changing the writer from Stdout to a wrapper BufWriter.
It writes to a Vec<u8> in memory first and flushes to the underlying writer when necessary.

1- let mut writer = stdout().lock();
2+ let mut writer = BufWriter::new(stdout().lock());

1.67GiB/s, great but still far from the 10GiB/s of the GNU utility.

The bottleneck now is the payload. For each function call we write 2 bytes of data which -- uhh isn't great.
The CPU atomic operation on memory is a cache line, which is typically 64 bytes on x86/x64 CPUs.
Let's write the payload 32 times and see if we improve.

1use std::io::{self, stdout, BufWriter};
2use std::io::prelude::*;
3
4const PAGE_SIZE: usize = 4096;
5
6fn main() -> Result<(), io::Error> {
7    let mut writer = BufWriter::with_capacity(PAGE_SIZE * 4, stdout().lock());
8    loop {
9        writer.write_all(b"y\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\n")?;
10    }
11}

1>> cargo run --release | pv --average-rate >/dev/null
2[12.2GiB/s]

Nice! This is better than what I expected.

Let's test our theory and write 66 bytes instead of 64.

1>> cargo run --release | pv --average-rate >/dev/null
2[9.6GiB/s]

A 27% difference.
It's also beneficial to maintain a buffer that is a multiple of a page size as it's friendlier to Linux's ring buffer.

Let's try to transfer what we learned so far to write a performant fizz buzz implementation.
As a first optimization, we can notice that the problem is cyclic and can be made branchless.

1use std::io::prelude::*;
2use std::io::{self, stdout, BufWriter};
3const PAGE_SIZE: usize = 4096;
4
5fn main() -> Result<(), io::Error> {
6    let mut writer = BufWriter::with_capacity(PAGE_SIZE * 4, stdout().lock());
7    for i in (0..).step_by(15) {
8        write!(
9            writer,
10            "{}\n{}\nFizz\n{}\nBuzz\n{}\n{}\n{}\nFizz\nBuzz\n{}\nFizz\n{}\n{}\nFizzBuzz\n",
11            i + 1,
12            i + 2,
13            i + 4,
14            i + 6,
15            i + 7,
16            i + 8,
17            i + 11,
18            i + 13,
19            i + 14,
20        )?;
21    }
22    Ok(())
23}

1>> cargo run --release | pv --average-rate >/dev/null
2[ 825MiB/s]

We are an order of magnitude off.

When running the following code on a perf flame graph, we can notice that two third of the time is spent on integer to string conversion or itoa.

We are clearly CPU bound now.
This can be solved by either multi-core parallelism or low-level optimization of the "hot" loop.
We'll start by the latter.

We'll use the itoap crate as it's one of the fastest simd itoa implementation based on sse2.
It also writes numbers from the most significant digit first making the whole operation sequential.

1use std::io::prelude::*;
2use std::io::{self, stdout};
3
4use itoap::write_to_vec;
5
6const PAGE_SIZE: usize = 4096;
7const CHUNK_SIZE: usize = 15; // write 15 elements at a time
8const CHUNK_MAX_BYTES: usize = 359; // 39 * 8 u64 + 47 bytes of text
9
10// 128 KiB fits in the L2 cache of most processors
11const BUFFER_SIZE: usize = PAGE_SIZE * 32;
12
13const BATCH_PER_PAGE: usize = PAGE_SIZE / CHUNK_MAX_BYTES;
14const BATCH_SIZE: usize = BATCH_PER_PAGE * 32;
15
16fn main() -> Result<(), io::Error> {
17    let mut writer = stdout().lock();
18    let mut i = 0;
19    loop {
20        let mut buf = Vec::<u8>::with_capacity(BUFFER_SIZE);
21        (0..BATCH_SIZE).for_each(|_| {
22            write_chunk(&mut buf, i);
23            i += CHUNK_SIZE;
24        });
25        writer.write_all(&buf)?;
26    }
27}
28
29#[inline(always)]
30fn write_chunk(buf: &mut Vec<u8>, i: usize) {
31    write_to_vec(buf, i + 1);
32    buf.push(b'\n');
33    write_to_vec(buf, i + 2);
34    buf.extend(b"\nFizz\n");
35    write_to_vec(buf, i + 4);
36    buf.extend(b"\nBuzz\nFizz\n");
37    write_to_vec(buf, i + 7);
38    buf.push(b'\n');
39    write_to_vec(buf, i + 8);
40    buf.extend(b"\nFizz\nBuzz\n");
41    write_to_vec(buf, i + 11);
42    buf.extend(b"\nFizz\n");
43    write_to_vec(buf, i + 13);
44    buf.push(b'\n');
45    write_to_vec(buf, i + 14);
46    buf.extend(b"\nFizzBuzz\n");
47}

1>> cargo run --release | pv --average-rate >/dev/null
2[3.56GiB/s]

Not too bad for safe Rust that is still readable.
Amusingly, reusing buffers didn't have any impact on the throughput.
I'm guessing the allocator is smart enough to reuse buffers with the same capacity.

Now that we got reasonable single threaded performance we could use all the cores of our processor.

Since we know that we compute CHUNK_SIZE * BATCH_SIZE = x elements per loop, we could make each thread compute x elements at an offset i.e.

thread 1: from 0 to x
thread 2: from x to 2x
...

We'll end up with as many buffers as threads that will produce each x elements in parallel!
We could then write all the buffers in a single syscall thanks to vectored I/O.

The last thing we need to deal with is inter-thread communication.
We'll use bounded channels to communicate buffers as we want to block when a worker is outpacing I/O.
We will also make sure to manually thread::yield_now() to help the OS scheduler distribute the load.

The following program reaches a throughput of 6GiB/s on my 5950x CPU and 3733 MHz DDR4 RAM.

1#![feature(write_all_vectored)]
2use std::io;
3use std::io::prelude::*;
4use std::sync::mpsc::sync_channel;
5use std::thread::{self, available_parallelism};
6
7use itoap::write_to_vec;
8
9const PAGE_SIZE: usize = 4096;
10const CHUNK_SIZE: usize = 15;
11const CHUNK_MAX_BYTES: usize = 359;
12
13// limit the progression of a single worker
14const CHANNEL_CAPACITY: usize = 3;
15
16const BUFFER_SIZE: usize = PAGE_SIZE * 128; // < L2 cache on 5950x
17const BATCH_PER_PAGE: usize = PAGE_SIZE / CHUNK_MAX_BYTES;
18const BATCH_SIZE: usize = BATCH_PER_PAGE * 128;
19
20fn main() -> Result<(), io::Error> {
21    let workers_count = available_parallelism()?.get() - 1;
22
23    let receivers = (0..workers_count)
24        .map(|thread_id| {
25            let (sender, receiver) = sync_channel(CHANNEL_CAPACITY);
26            thread::spawn(move || {
27                // offset for this thread
28                let mut i = thread_id * (CHUNK_SIZE * BATCH_SIZE);
29                loop {
30                    let mut buf = Vec::<u8>::with_capacity(BUFFER_SIZE);
31                    (0..BATCH_SIZE).for_each(|_| {
32                        write_chunk(&mut buf, i);
33                        i += CHUNK_SIZE;
34                    });
35
36                    sender
37                        .send(buf)
38                        .unwrap_or_else(|_| 
39                            panic!("thread {} couldn't send", thread_id)
40                        );
41
42                    // skip work done by other workers
43                    i += (workers_count - 1) * (CHUNK_SIZE * BATCH_SIZE);
44
45                    thread::yield_now();
46                }
47            });
48            receiver
49        })
50        .collect::<Vec<_>>();
51
52    let mut writer = io::stdout().lock();
53    loop {
54        let mut buffers = receivers
55            .iter()
56            .enumerate()
57            .map(|(i, r)| {
58                r.recv()
59                 .unwrap_or_else(|_| panic!("worker thread {} died", i))
60            })
61            .collect::<Vec<_>>();
62
63        writer.write_all_vectored(
64            &mut buffers
65                .iter_mut()
66                .map(|b| io::IoSlice::new(b))
67                .collect::<Vec<_>>(),
68        )?;
69    }
70}

1>> cargo run --release | pv --average-rate >/dev/null
2[6.07GiB/s]

It’s possible to go even further, but that either requires using obscure zero copy system calls or a more complex algorithm that exploits the ever increasing property of fizz buzz, itoa could then be specialized to serialize monotonically increasing numbers which, for most chunks, only recomputes the 2 least significant digits.