Observations about zrepl's performance #775

asomers · 2024-01-31T20:43:39Z

asomers
Jan 31, 2024

During August 2023 my company developed a need to replicate a large amount of data cross-country. It was hundreds of datasets, multiple PB, and the latency was over 50 ms. zrepl was the obvious choice for how to do it. So we tried. But speed was was poor, and the transfer is still not complete. This is what we've learned about zrepl's performance.

tldr;

Concurrency matters
There is too much overhead per-snapshot, especially at startup
zrepl prunes snapshots too infrequently
zrepl uses kqueue inefficiently
Go has too much overhead on fork

Concurrency

We quickly increased zrepl's concurrency up to 50. But we don't know if this is the best number, because it takes so long for throughput to reach steady-state after starting up zrepl. Plus, there are so many other factors at play, that it's hard to experiment with the concurrency setting.

Snapshot overhead

When first starting up zrepl, it would spend more than 1 hour in the "PLANNING" phase. During this time its CPU usage would be very high, yet it wouldn't be transmitting anything. ps showed a lot of zfs list jobs. My belief was that zrepl had an asymptotic complexity problem during its planning phase. To fix this, we reduced the snapshot interval from 1h to 4h and later to 8h. That seemed to help. Now zrepl's startup time is more like 10-15 minutes. I thought that this might improve throughput too, because there would be fewer txg syncs on the destination pools. But it's hard to tell if it's made a difference.

Snapshot pruning

In a nutshell, zrepl's algorithm is, for each job:

Devise a transfer plan that will get all datasets up to date on the receiver. Each dataset's plan may include multiple incremental steps.
Execute that plan.
Prune snapshots on both source and destination.
At this point the job is "complete". However, as time has passed, some datasets may be out of date. No further action will be attempted until the job restarts according to its schedule.

The problem with this algorithm is that a lot of time may pass between steps 1 and 3. During that time, many source snapshots age to the point where they ought to be pruned, according to the pruning rules. In fact, it's possible that some source snapshots ought to be pruned even before step 1 starts. But zrepl will never prune anything until step 3. That means that it wastes precious bandwidth transferring data that it intends to prune anyway. During December, I determined that our zrepl jobs were wasting at least 10% of their bandwidth on such snapshots.

Ideally zrepl would prune such snapshots before transferring them. But that would require altering the plan during step 2. This may require significant refactoring with zrepl. As I am not a Go programmer, I did not attempt it. Instead, I wrote a shell script, later rewritten in Python, to prune such snapshots from outside of zrepl. Doing so causes errors whenever zrepl can't find a snapshot that was part of its plan. But it will eventually return to step 1 (usually because we either restart the process or restart the job) and devise a new plan.

Kqueue

For months, I noticed that our destination's bandwidth tends to jump from ~500 Mbps to ~5 Gbps and back again, with little in between. The jumps are very frequent, typically lasting no more than a minute in either state. Sometimes it gets stuck in the low state for days at a time. On one recent occasion when it was in the "low" state, I noticed that its CPU load had paradoxically increased from 25% to 35%. That's suspicious. top showed that the zrepl process was using the most CPU, and it was spending more time in system mode than in user mode. So I recorded a flamegraph.

What I saw was that zrepl spent most of its time in lock_delay, indicating lock contention. I've seen that before with another Go program, and I believe that it's caused by calling kevent with the same kqueue from different threads simultaneously. dtrace confirmed that was happening. It's probably a bug in some common Go library, but I don't have the skills to fix it. An hour later, zrepl fortuitously "fixed" itself. Bandwidth jumped back up to the 5 Gbps range, and I was lucky enough to snag another flamegraph while bandwidth remained high. This one spent a lot more time in syscalls like kevent, read, and write. So it was definitely getting more useful work done. But it was still mostly lock_delay. My guess is that the difference between the "low" and "high" bandwidth periods is due to some quirk of the scheduler.

Fork

As a quick experiment to fix the lock contention, I tried using cpuset to pin zrepl to a single core, without restarting the process. I reasoned that this should eliminate most if not all causes of lock contention in kevent. However, since I didn't restart the process, Go might have some wrongly-sized thread pools or something. What I found was that bandwidth fell to about 600-700 Mbps and CPU usage fell to about 10%. That's the lowest CPU usage I've seen while zrepl is actually doing stuff. So I took another flame graph. This one showed much reduced time in lock_delay. Instead, it was dominated by pmap_try_insert_pv_entry. That function is called during fork. It's used to fork the child's virtual memory space from the parent's. The problem wasn't that zrepl forked too often; dtrace showed that it only forked once every several seconds. Rather, the problem seemed to be that it was using too much memory (11 GiB), and possibly that memory was too fragmented.

The obvious solution would be to use posix_spawn (or vfork/exec) instead of fork. This is exactly what posix_spawn is meant for, because it avoids forking the process's address space. But dtrace shows that zrepl doesn't use posix_spawn. A quick look at Go's github repo revealed the answer. Go's philosophy of bypassing libc and making syscalls directly makes everything harder than it needs to be. Support vfork/exec on Linux/amd64 actually required freaking assembly. golang/go@9e6b79a#diff-1587342f077ea1dbe9673e4847da2423919de3899715421374649ab8004cef43 . No wonder it isn't implemented for other architectures and OSes (we're using FreeBSD/amd64). Instead, on FreeBSD go uses plain fork/exec, with about 200 LOC in between.

Patching Go to use posix_spawn would probably require an expert Go programmer, not just a passable one, and certainly not a novice like me. Alternatively, perhaps zrepl could be patched to use some posix_spawn bindings outside of the standard library, if any exist. Or even if its memory consumption could be reduced, that would help

Bypassing zrepl entirely

We hadn't tried this yet, because for various reasons it was technically difficult. But yesterday I wrote a short Rust program that would mimic what zrepl does. It connected from the pull server to the source server, started a pipeline of zfs send, openssl enc, and mbuffer, and redirected its stdout to a socket. On the pull side, it redirected that socket to the stdin of a mbuffer, openssl enc -d, and zfs recv pipeline. So none of the data actually flowed through the Rust process itself. Rust was just responsible for connecting sockets, starting processes, and maintaining desired concurrency. It screamed. Throughput hit a peak of 10.8 Gbps and a 30-minute average of 8 Gbps. That's compared to zrepl, which usually gets peaks from 5-6 Gbps and averages 3.5 Gbps, on a good day.

Note that I didn't completely reimplement zrepl. This Rust program is just a PoC. Its purpose is to measure what our hardware is capable of, and identify where the bottlenecks lay with zrepl.

Stuff that didn't make a difference

TCP congestion control settings. We tried various tweaks to the TCP stack, like switching the congestion control algorithm. None of that seemed to make a significant difference.
mbuffer. We merged Add piping into zfs send|recv #761 into our build and started using it. It didn't seem to make a difference in average throughput. It did, however, make second-by-second throughput more consistent. So now the Progress bar shown in zrepl status is more useful. For that reason, we've continued to use it.

Action Items

zrepl is definitely the best tool for what we're doing. But to improve its performance during the initial sync, it would be great if somebody who knows Go can:

Modify zrepl to prune source snapshots before transferring them.
Improve the performance of the Planning phase in the presence of lots of snapshots.
Make Go's standard library use posix_spawn instead of fork/exec on FreeBSD.
Make zrepl only poll each kqueue from a single thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observations about zrepl's performance #775

{{title}}

Replies: 0 comments

Select a reply

Observations about zrepl's performance #775

asomers Jan 31, 2024

tldr;

Concurrency

Snapshot overhead

Snapshot pruning

Kqueue

Fork

Bypassing zrepl entirely

Stuff that didn't make a difference

Action Items

Replies: 0 comments

asomers
Jan 31, 2024