[Feature]: Allow writes during a node outage #3357

Zorlin · 2024-04-26T10:53:30Z

Contact Details

No response

Is there an existing issue for this?

I have searched all the existing issues

Is your feature request related to a problem? Please describe.

I would like to have CubeFS survive taking datanodes and metanodes offline.

Right now if I stop a datanode or have a physical failure, my cluster can no longer receive writes even on a 3-replica volume with 5 available datanodes.

Read workloads continue to work fine, but this stops me from doing certain kinds of work during the read-only mode.

Describe the solution you'd like.

We should implement a file/chunk versioning system combined with CRDTs (https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) to allow us to continue writing data when a DP shard is offline, and then update that shard when it returns.

Describe an alternate solution.

No response

Anything else? (Additional Context)

Other filesystems such as Ceph, MooseFS do things like this to allow for writing even during major node outages.

leonrayang · 2024-04-29T01:59:32Z

@Zorlin I don't know if you have any practical experience, but the problem should not exist. If one out of five nodes is faulty, writing can still continue without affecting users. The data will be appended to a new node, and modifications will be made to the remaining two replicas, which have a leader, so everything is also okay.

Zorlin · 2024-04-29T07:44:09Z

@Zorlin I don't know if you have any practical experience, but the problem should not exist. If one out of five nodes is faulty, writing can still continue without affecting users. The data will be appended to a new node, and modifications will be made to the remaining two replicas, which have a leader, so everything is also okay.

Hi, agreed it shouldn't be a problem, but in my cluster with 5 datanodes and 5 metanodes, when I am writing to the filesystem and I take a node offline, everything "goes haywire" and I cannot write again until I put that node back online.

NaturalSelect · 2024-05-14T13:56:48Z

@Zorlin I don't know if you have any practical experience, but the problem should not exist. If one out of five nodes is faulty, writing can still continue without affecting users. The data will be appended to a new node, and modifications will be made to the remaining two replicas, which have a leader, so everything is also okay.

Hi, agreed it shouldn't be a problem, but in my cluster with 5 datanodes and 5 metanodes, when I am writing to the filesystem and I take a node offline, everything "goes haywire" and I cannot write again until I put that node back online.

How many data partitions does your volume have?

Zorlin added the enhancement New feature or request label Apr 26, 2024

Zorlin assigned leonrayang and xiaochunhe Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Allow writes during a node outage #3357

[Feature]: Allow writes during a node outage #3357

Zorlin commented Apr 26, 2024

leonrayang commented Apr 29, 2024

Zorlin commented Apr 29, 2024

NaturalSelect commented May 14, 2024

[Feature]: Allow writes during a node outage #3357

[Feature]: Allow writes during a node outage #3357

Comments

Zorlin commented Apr 26, 2024

Contact Details

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

Describe the solution you'd like.

Describe an alternate solution.

Anything else? (Additional Context)

leonrayang commented Apr 29, 2024

Zorlin commented Apr 29, 2024

NaturalSelect commented May 14, 2024