chunkserver: High speed rebalance blocks deletions? #544

onlyjob · 2023-06-23T03:20:56Z

A chunkserver re-balancing its disks in high-speed mode (HDD_HIGH_SPEED_REBALANCE_LIMIT > 0) is not deleting any chunks, which seems contradictory because there would be less chunks to move if deletions are processed.

The text was updated successfully, but these errors were encountered:

chogata · 2023-06-26T12:17:04Z

MooseFS should be deleting chunks even when in high speed rebalance. Possible reasons for not deleting chunks in the system include:

a disconnected chunk server in maintenance mode
a connecting/disconnecting chunk server (still in registration/de-registration of chunks phase - be aware de-registration is done by the master, so even if the chunk server process is already shut down, it might take the master a couple of minutes to finish, if the chunk server had a lot of chunks to begin with)
operation limit is reached: either deletion limit is set very low and you don't see the effect of deleting chunks as new chunks ready for deletion appear, or general number of other jobs is so high, that there are not enough resources for deletions (deletions are quite low on the priority scale)

Are you sure none of the above applies in your case? If you are, please tell us the version of MooseFS you are using and any config settings that differ from the defaults (you can omit custom instance name and custom pathnames).

onlyjob · 2023-06-28T23:11:43Z

Yes, absolutely sure. Latest MooseFS release (3.0.117). One disk marked for rebalance < in preparation for its removal (to move its chunks to other disks).

chogata · 2023-06-30T11:40:23Z

The one disk with < is in the same chunk server that has high speed rebalance on or in another one? You wrote "A chunk server is not deleting", but just making sure: you have only one chunk server in high speed rebalance? Or more? If yes, how many? And how many chunk servers in total in this instance?

I want to re-create your setup in our lab and run tests.

onlyjob · 2023-07-01T04:26:45Z

The same chunkserver, obviously. Only one chunkserver is in high-speed rebalance -- the very chunkserver that is not deleting chunks (until rebalance is finished).

A dozen chunkservers total, but only one is in active high-speed replication mode, because one of its HDDs is marked < to empty it by relocating its chunks to other HDDs, altogether with HDD_HIGH_SPEED_REBALANCE_LIMIT = 3.

Total number of chunkservers hardly matters, as long as there is more than one...

A particular chunkserver is busy with load highlighted with <N> in "Servers" view, as well as corresponding "Server Chart" indicating internal high-speed rebalance.

inkdot7 · 2023-08-15T15:26:50Z

Hi, a just barely related question:

If one HDD (or more) is marked < to empty them, why is that particular chunkserver involved in having more traffic than other? I would have assumed that MooseFS would copy the chunks from other replicas on other chunkservers, to parallelise the rebalance operation to finish soon?

chogata · 2023-08-17T10:40:54Z

@inkdot7 I'm not 100% sure I understand your question, but maybe this will explain: the master is not aware of chunk server's internal going-ons and that includes internal disk rebalance. For the master it's a normal chunk server. If a chunk server is in high speed rebalance mode and the high speed "tempo" is high, the chunk server may say to the master that it is overloaded and the master will try not to send it any tasks for a while.

The only thing master is aware of are marked for removal disks, but it's the * designation in mfshdd.cfg, not < or >. Marked for removal is different, in that the disk is considered damaged in some way and then the replications try not to use it at all i.e. if a chunk needs to be replicated, because it is on MFR disk, but a copy of this chunk also exists elsewhere, this other copy will be used as a source of replication.

MFR (*) is for replicating data on an endangered disk and the whole instance takes part in that, the disk itself is spared i/o whenever possible. Internal rebalance, whether "organic" or forced using < and/or > is done 100% internally and the rest of the system doesn't know and doesn't care about it.

chogata · 2023-08-17T10:45:02Z

@onlyjob I tried to replicate your issue, but I can't, my instance deleted all the unnecessary chunks while one of the chunk servers was in high speed rebalance mode. I want to try to test it again on a larger scale (the instance that I used for testing was small and the internal rebalance on the one chunk server was completed in minutes), but I need to wait for completion of some other test we are currently running before I can try to do that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunkserver: High speed rebalance blocks deletions? #544

chunkserver: High speed rebalance blocks deletions? #544

onlyjob commented Jun 23, 2023

chogata commented Jun 26, 2023

onlyjob commented Jun 28, 2023

chogata commented Jun 30, 2023

onlyjob commented Jul 1, 2023 •

edited

inkdot7 commented Aug 15, 2023

chogata commented Aug 17, 2023

chogata commented Aug 17, 2023

chunkserver: High speed rebalance blocks deletions? #544

chunkserver: High speed rebalance blocks deletions? #544

Comments

onlyjob commented Jun 23, 2023

chogata commented Jun 26, 2023

onlyjob commented Jun 28, 2023

chogata commented Jun 30, 2023

onlyjob commented Jul 1, 2023 • edited

inkdot7 commented Aug 15, 2023

chogata commented Aug 17, 2023

chogata commented Aug 17, 2023

onlyjob commented Jul 1, 2023 •

edited