Add PutObject Ring Buffer #19605

klauspost · 2024-04-24T16:44:22Z

Description

Replace the io.Pipe from streamingBitrotWriter -> CreateFile with a fixed size ring buffer.

This will add an output buffer for encoded shards to be written to disk - potentially via RPC.

This will remove blocking when (*streamingBitrotWriter).Write is called and it writes hashes and data.

With current settings the write looks like this:

Outbound
┌───────────────────┐             ┌────────────────┐               ┌───────────────┐                      ┌────────────────┐
│                   │   Parr.     │                │  (http body)  │               │                      │                │
│ Bitrot Hash       │     Write   │      Pipe      │      Read     │  HTTP buffer  │    Write (syscall)   │  TCP Buffer    │
│ Erasure Shard     │ ──────────► │  (unbuffered)  │ ────────────► │   (64K Max)   │ ───────────────────► │    (4MB)       │
│                   │             │                │               │  (io.Copy)    │                      │                │
└───────────────────┘             └────────────────┘               └───────────────┘                      └────────────────┘

We write a Hash (32 bytes). Since the pipe is unbuffered, it will block until the 32 bytes have been delivered to the TCP buffer and the next Read hits the Pipe. Then we write the shard data. This will typically be bigger than 64KB, so it will block until 2 blocks have been read from the pipe.

When we insert a ring buffer:

Outbound
┌───────────────────┐             ┌────────────────┐               ┌───────────────┐                      ┌────────────────┐
│                   │             │                │  (http body)  │               │                      │                │
│ Bitrot Hash       │     Write   │  Ring Buffer   │      Read     │  HTTP buffer  │    Write (syscall)   │  TCP Buffer    │
│ Erasure Shard     │ ──────────► │    (2MB)       │ ────────────► │   (64K Max)   │ ───────────────────► │    (4MB)       │
│                   │             │                │               │  (io.Copy)    │                      │                │
└───────────────────┘             └────────────────┘               └───────────────┘                      └────────────────┘

The hash+shard will fit within the ring buffer, so writes will not block - but will complete after a memcopy. Reads will be able to fill the 64KB buffer if there is data for it.

If network is congested and the ring buffer will become filled, and all syscalls will be on full buffers. Only when the ring buffer is filled will erasure coding start blocking.

Since there is always "space" to write output data we remove the parallel writing, since we are always writing to memory now and the gorotine synchronization overhead probably isn't worth taking. If output was blocked in the existing, we would still wait for it to unblock in parallelWriter, so it would make no difference there - except now the ringbuffer smoothes out the load.

There are some microoptimizations we could look at later. The biggest is probably that in most cases we could encode directly to the ring buffer - if we are not at a boundary. Also "force filling" the Read requests (ie blocking until a full read can be completed) could be investigated and maybe allow concurrent memcopy on read and write. But if this isn't by itself better I don't think it is overall worth it.

The 2MB are a bit big for my liking - but we don't have 128K-256K buffers, which I would probably go for.

Pending: Actual testing.
For now also vendor https://github.com/smallnest/ringbuffer

How to test this PR?

Test PutObject/Multipart with bigger files.

Types of changes

Optimization (provides speedup with no functional changes)

Replace the `io.Pipe` from streamingBitrotWriter -> CreateFile with a fixed size ring buffer. This will add an output buffer for encoded shards to be written to disk - potentially via RPC. This will remove blocking when `(*streamingBitrotWriter).Write` is called and it writes hashes and data. With current settings the write looks like this: ``` Outbound ┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐ │ │ Parr. │ │ (http body) │ │ │ │ │ Bitrot Hash │ Write │ Pipe │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │ │ Erasure Shard │ ──────────► │ (unbuffered) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │ │ │ │ │ │ (io.Copy) │ │ │ └───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘ ``` We write a Hash (32 bytes). Since the pipe is unbuffered, it will block until the 32 bytes have been delivered to the TCP buffer and the next Read hits the Pipe. Then we write the shard data. This will typically be bigger than 64KB, so it will block until 2 blocks have been read from the pipe. When we insert a ring buffer: ``` Outbound ┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐ │ │ │ │ (http body) │ │ │ │ │ Bitrot Hash │ Write │ Ring Buffer │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │ │ Erasure Shard │ ──────────► │ (2MB) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │ │ │ │ │ │ (io.Copy) │ │ │ └───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘ ``` The hash+shard will fit within the ring buffer, so writes will not block - but will complete after a memcopy. Reads will be able to fill the 64KB buffer if there is data for it. If network is congested and the ring buffer will become filled, and all syscalls will be on full buffers. Only when the ring buffer is filled will erasure coding start blocking. Since there is always "space" to write output data we remove the parallel writing, since we are always writing to memory now and the gorotine synchronization overhead probably isn't worth taking. If output was blocked in the existing, we would still wait for it to unblock in parallelWriter, so it would make no difference there - except now the ringbuffer smoothes out the load. There are some microoptimizations we could look at later. The biggest is probably that in most cases we could encode directly to the ring buffer - if we are not at a boundary. Also "force filling" the Read requests (ie blocking until a full read can be completed) could be investigated and maybe allow concurrent memcopy on read and write. But if this isn't by itself better I don't think it is overall worth it. Pending: Actual testing.

klauspost · 2024-04-24T17:31:27Z

(fixed...)

…nges.

cmd/erasure-encode.go

harshavardhana · 2024-05-02T09:55:23Z

I am backlogged on testing this, will have to take internal help on this.

klauspost · 2024-05-02T10:00:51Z

@harshavardhana No worries. It is not going anywhere.

klauspost requested a review from harshavardhana April 24, 2024 16:44

klauspost force-pushed the putobject-ringbuffer branch from f2a0b26 to 0bea34c Compare April 24, 2024 16:46

Make linter happier.

900b6e1

klauspost added 2 commits April 25, 2024 00:50

Merge branch 'master' into putobject-ringbuffer

3b76dc0

Simplify ringbuffer flushing - when we are unlocked we risk state cha…

4364c75

…nges.

klauspost marked this pull request as ready for review April 25, 2024 12:17

harshavardhana reviewed Apr 26, 2024

View reviewed changes

cmd/erasure-encode.go Show resolved Hide resolved

Merge branch 'master' into putobject-ringbuffer

de85da9

bh4t assigned klauspost May 13, 2024

bh4t added performance/optimize needs-review labels May 13, 2024

harshavardhana merged commit d4b391d into minio:master May 15, 2024
20 checks passed

klauspost deleted the putobject-ringbuffer branch May 15, 2024 06:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PutObject Ring Buffer #19605

Add PutObject Ring Buffer #19605

klauspost commented Apr 24, 2024

klauspost commented Apr 24, 2024 •

edited

harshavardhana commented May 2, 2024

klauspost commented May 2, 2024

Add PutObject Ring Buffer #19605

Add PutObject Ring Buffer #19605

Conversation

klauspost commented Apr 24, 2024

Description

How to test this PR?

Types of changes

klauspost commented Apr 24, 2024 • edited

harshavardhana commented May 2, 2024

klauspost commented May 2, 2024

klauspost commented Apr 24, 2024 •

edited