The costs of laziness

16.03.2023 — Rust, Performance, Async, Distibuted Systems — 4 min read

Lazy evaluation is a concept that appears in multiple branches of computer science. It is the idea to pre-emptively do less work and defer the computation until it is needed, as opposed to eager evaluation. The problem with such technique is that it can have noticeable overheads such as bookkeeping. I won't talk about laziness in the context of Haskell as I'm not qualified and the internet is already filled with that.

It is often introduced as a functional programming concept but it also exists in the C-family programming languages. Functions are lazy by nature, function pointers (and lambdas or closures) by themselves do nothing -- until called.

Let's set the stage with an example, suppose we need to add tracing to some application that processes a list of objects. One way of implementing this is the following:

1fn trace(condition: bool, message: String) {
2    if condition {
3        println!("{}", message);
4    }
5}
6
7fn consume(objects: Vec<Object>) {
8    for obj in objects {
9        trace(obj.id % N == 0, format!("consumed {} objects", obj.id));
10        trace(obj.is_defective(), format!("{} is defective", obj));
11    }
12}

The above code may appear sensible as it is DRY and encapsulates the tracing logic inside a function, but it has a major flaw that won't be catched by the compiler.

For each iteration of the loop, the program will construct the message as if the condition happened.
The message has to materialize before being passed as an argument.
This has dramatic performance consequences.

One way of fixing this is to use higher order functions, by wrapping the message in a closure, the message won't be eagerly executed at the call site but only when required.

1fn trace(condition: bool, message: impl FnOnce() -> String) {
2    if condition {
3        println!("{}", message());
4    }
5}
6
7fn consume(objects: Vec<Object>) {
8    for obj in objects {
9        trace(obj.id % N == 0, || format!("consumed {} objects", obj.id));
10        trace(obj.is_defective(), || format!("{} is defective", obj));
11    }
12}

This works, but it would be preferable to have a type that encapsulates the formatting logic and only materializes the message when needed, and that is exactly what Arguments is.

It is a lazy structure that represents the formatted message, it is also faster than our string producing closure since, by virtue of laziness, prevents intermediate string allocation and writes directly to the writer.

1fn trace(condition: bool, message: std::fmt::Arguments) {
2    if condition {
3        println!("{}", message);
4    }
5}
6
7fn consume(objects: Vec<Object>) {
8    for obj in objects {
9        trace(obj.id % N == 0, { format_args!("consumed {} objects", obj.id) });
10        trace(obj.is_defective(), format_args!("{} is defective", obj));
11    }
12}

Actually this is not the only lazy construct used in the code above, let's desugar the for loop.

1fn consume(objects: Vec<Object>) {
2    match IntoIterator::into_iter(objects) {
3        mut iterator => loop {
4            match iterator.next() {
5                None => break,
6                Some(obj) => {
7                    trace(obj.id % N == 0, { format_args!("consumed {} objects", obj.id) });
8                    trace(obj.is_defective(), format_args!("{} is defective", obj));
9                }
10            }
11        }
12    }
13}

The code might seem cryptic but you could notice that each iteration of the loop requires a function call: next(); iteration is also lazy! The advantage of using lazy iteration is that it allows partial consumption (e.g. break), this could save unnecessary work and allow the usage of infinite data structures like Range.
It relates to the iterator design pattern which is an increasingly popular feature in programming languages.

1pub trait Iterator {
2    type Item;
3
4    fn next(&mut self) -> Option<Self::Item>;
5}

This kind of abstraction seems to trade-off performance for readability but it is actually a zero runtime cost abstraction. As everything is statically dispatched, the compiler is allowed to aggressively inline and optimize the code. This is not guaranteed but it has rarely generated bad assembly in my experience.
This abstraction enables the composition of simple transformations on iterators which would be clunkier in an imperative style, for example:

1/// finds the first element a + b = c + d
2/// where a, b, c, d are cubes
3/// in constant memory
4fn taxicab_two() -> Option<usize> {
5    let cube = |i: usize| i * i * i;
6
7    (0..)
8        .map(|i| {
9            (
10                (cube(i), (0..i).map(cube)),
11                (i + 1..).map(|j| (cube(j), (0..j).map(cube))),
12            )
13        })
14        .find_map(|(targets, candidates)| {
15            let (a, mut bs) = targets;
16            bs.find_map(|b| {
17                let target = a + b;
18                candidates
19                    .clone() // at worse a memcpy of a small structure
20                    .take_while(|&(c, _)| c <= target)
21                    .find_map(|(c, ds)| {
22                        ds.map(|d| c + d)
23                            .skip_while(|&candidate| candidate < target)
24                            .next()
25                            .filter(|&candidate| candidate == target)
26                    })
27            })
28        })
29}
30
31#[cfg(test)]
32mod tests {
33    use super::*;
34
35    #[test]
36    fn it_works() {
37        assert_eq!(taxicab_two(), Some(1729));
38    }
39}

It can get kind of messy with nested iterators but in the common case they're a pleasure to work with!

Another benefit of this API is that it guarantees (in the common case of slice::Iter) sequential access. This forces the user to write code that is friendlier to the hardware due to cache locality and memory prefetching, while also making it easier for the compiler to autovectorize and remove bound checks.

One problem I haven't mentioned yet about lazy evaluation is debugging.
As the computation starts only when required, it is possible to end up with a very deep computation DAG making it tough to debug errors and extremely hard to solve performance regressions.
Nonetheless, laziness leads to better code generation as the compiler has complete information of the computation, this is very apparent for processing engines like Spark.

Laziness can also appear for data structures, a common implementation of priority queue is a binary heap.
Let's look at the following code:

1fn top_k<T: Ord>(seq: impl IntoIterator<Item = T>, k: usize) -> impl Iterator<Item = T> {
2    seq.into_iter()
3        .collect::<BinaryHeap<_>>()
4        .into_iter_sorted()
5        .take(k)
6}
7
8#[cfg(test)]
9mod tests {
10    use super::*;
11
12    #[test]
13    fn it_works() {
14        let array = [7, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1];
15        assert!(top_k(array, 3).eq([52, 40, 34]));
16    }
17}

The top_k function builds a binary heap and yields an iterator that lazily pops the top element at most k times. So the function takes a linear amount of time if unused and costs an extra logarithm for each iteration (i.e. next()) of the output iterator. Most of that work is spent maintaining the heap invariant.

When that biggest element is popped, the data structure has to move things around to keep the invariant valid, but if we know that the heap is popping an element for the last time, we're wasting time!

This could be solved by having a function on the binary heap that consumes (i.e. takes ownership) of the data structure and returns the top most element in constant time... but algorithms rarely know when that would happen.

One solution is to use a lazy variant of binary heaps, lazy fibonacci heaps and the like!
Instead of resolving the invariant immediately, it defers the work until necessary.
This allows for a better amortized time complexity for common operations.

In practice, it is very hard to beat the vector backed binary heap due the low amount of overhead; nevertheless, it might be a good data structure for specific workloads like graph algorithms.

The generalization of iterators for distributed systems are pull-based systems. Just like iterators, the consumer must synchronize with the producer in order to receive data, in our case it was the next() function call but it can also be an RPC call, it is said that the consumer is polling for data.

But at a network boundary, the latency to resolve a function is in the order of milliseconds which is lifetimes in CPU time, it would be a waste to stay idle in the meantime.
This problem is what made asynchronous programming so prevalent in modern programming.

Async functions can be written like regular functions, but it must use await for each blocking call.
By doing so, the executor could yield back and use CPU time to advance a ready task!
To know when previously awaited tasks are ready, the executor must continuously poll the operating system for ready tasks through the poll system call. Internally, each async function maintains a lazy structure that represents the perfect state machine to store just enough information to continue the execution later. This is a form of lightweight, cooperative multitasking. The alternative would be full-bown threads.

Streams unify both concepts, it is a trait for objects that can produce a sequence of elements asynchronously. One apparent problem is that the consumer has to continuously poll to receive elements. For synchronous iterators this latency problem of polling was eliminated by inlining the next() function call. For asynchronous iterators that cost can't be optimized away!

This leads us to the dual, push systems! Here the producer and consumer(s) no longer synchronize, the producer keeps creating elements and sends them to the consumer without the round-trip cost of pull systems. It is easy to see that a producer outpacing a consumer could be problematic as the consumer might drop elements at best or crash at worst. This explains why Kafka (pull) has a higher latency than RabbitMQ (push), however, RabbitMQ latencies degrade significantly at throughputs higher than 30 MB/s.