Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: CubeFS doesn't seem to clean up datapartition issues quickly #3101

Open
1 task done
Zorlin opened this issue Feb 19, 2024 · 1 comment
Open
1 task done

[Bug]: CubeFS doesn't seem to clean up datapartition issues quickly #3101

Zorlin opened this issue Feb 19, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@Zorlin
Copy link

Zorlin commented Feb 19, 2024

Contact Details

[email protected]

Is there an existing issue for this?

  • I have searched all the existing issues

Priority

low (Default)

Environment

- CubeFS version: v3.3.1 w/Authnode patches
- Deployment mode(docker or standalone or cluster): multiple bare metal nodes
- Dependent components: All
- OS kernel version(Ubuntu or CentOS): Debian 12
- CPU/Memory: At least 6 cores, 12GiB RAM per node
- Others:

Current Behavior

After using the cluster for a while, with datanodes coming and going, I have a bunch of unavailable replicas and other issues:

wings:~/ $ cfs-cli datapartition check                                                                                                      [11:53:07]
[Inactive Data nodes]:
ID        ZONE      ADDRESS                                                              USED      TOTAL     STATUS    REPORT TIME

[Corrupt data partitions](no leader):
ID          VOLUME      REPLICAS       STATUS          MEMBERS           

[Partition lack replicas]:
ID          VOLUME      REPLICAS       STATUS          MEMBERS           

[Bad data partitions(decommission not completed)]:
PATH        PARTITION ID

[Partition has unavailable replica]:
DP_ID       VOLUME      REPLICAS    DP_STATUS    MEMBERS                     UNAVAILABLE_REPLICAS    
615         pbs.per     3           Read only    [10.0.20.16:17310, 10.0.20.14:17310, 10.0.20.15:17310]    [10.0.20.16:17310]      
616         pbs.per     3           Read only    [10.0.20.16:17310, 10.0.20.14:17310, 10.0.20.15:17310]    [10.0.20.16:17310]      
617         pbs.per     3           Read only    [10.0.20.16:17310, 10.0.20.14:17310, 10.0.20.15:17310]    [10.0.20.16:17310]      
618         pbs.per     3           Read only    [10.0.20.16:17310, 10.0.20.14:17310, 10.0.20.15:17310]    [10.0.20.16:17310]      
619         pbs.per     3           Read only    [10.0.20.16:17310, 10.0.20.15:17310, 10.0.20.14:17310]    [10.0.20.16:17310]      
620         pbs.per     3           Read only    [10.0.20.16:17310, 10.0.20.15:17310, 10.0.20.14:17310]    [10.0.20.16:17310]      
621         pbs.per     3           Read only    [10.0.20.16:17310, 10.0.20.14:17310, 10.0.20.15:17310]    [10.0.20.16:17310]      
622         pbs.per     3           Read only    [10.0.20.15:17310, 10.0.20.16:17310, 10.0.20.14:17310]    [10.0.20.16:17310]      
623         pbs.per     3           Read only    [10.0.20.15:17310, 10.0.20.16:17310, 10.0.20.14:17310]    [10.0.20.16:17310]      
624         pbs.per     3           Read only    [10.0.20.14:17310, 10.0.20.15:17310, 10.0.20.16:17310]    [10.0.20.16:17310]      
625         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.16:17310]      
626         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.16:17310]      
627         media       3           Read only    [10.0.20.14:17310, 10.0.20.16:17310, 10.0.20.15:17310]    [10.0.20.16:17310]      
628         media       3           Read only    [10.0.20.16:17310, 10.0.20.14:17310, 10.0.20.15:17310]    [10.0.20.16:17310]      
629         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.16:17310]      
630         media       3           Read only    [10.0.20.14:17310, 10.0.20.15:17310, 10.0.20.16:17310]    [10.0.20.16:17310]      
633         media       3           Read only    [10.0.20.14:17310, 10.0.20.15:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
634         media       3           Read only    [10.0.20.16:17310, 10.0.20.15:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
635         media       3           Read only    [10.0.20.14:17310, 10.0.20.16:17310, 10.0.20.15:17310]    [10.0.20.15:17310]      
636         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
637         media       3           Read only    [10.0.20.14:17310, 10.0.20.16:17310, 10.0.20.15:17310]    [10.0.20.15:17310]      
638         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
639         media       3           Read only    [10.0.20.14:17310, 10.0.20.15:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
640         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
641         media       3           Read only    [10.0.20.15:17310, 10.0.20.16:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
642         media       3           Read only    [10.0.20.16:17310, 10.0.20.14:17310, 10.0.20.15:17310]    [10.0.20.15:17310]      
643         media       3           Read only    [10.0.20.16:17310, 10.0.20.15:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
644         media       3           Read only    [10.0.20.15:17310, 10.0.20.16:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
645         media       3           Read only    [10.0.20.16:17310, 10.0.20.15:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
646         media       3           Read only    [10.0.20.14:17310, 10.0.20.16:17310, 10.0.20.15:17310]    [10.0.20.15:17310]      
647         media       3           Read only    [10.0.20.14:17310, 10.0.20.16:17310, 10.0.20.15:17310]    [10.0.20.15:17310]      
648         media       3           Read only    [10.0.20.15:17310, 10.0.20.16:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
649         media       3           Read only    [10.0.20.14:17310, 10.0.20.16:17310, 10.0.20.15:17310]    [10.0.20.15:17310]      
650         media       3           Read only    [10.0.20.15:17310, 10.0.20.16:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
651         media       3           Read only    [10.0.20.16:17310, 10.0.20.15:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
652         media       3           Read only    [10.0.20.16:17310, 10.0.20.15:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
653         media       3           Read only    [10.0.20.15:17310, 10.0.20.16:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
654         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
655         media       3           Read only    [10.0.20.15:17310, 10.0.20.16:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
656         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
657         media       3           Read only    [10.0.20.14:17310, 10.0.20.15:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
658         media       3           Read only    [10.0.20.14:17310, 10.0.20.16:17310, 10.0.20.15:17310]    [10.0.20.15:17310]      
659         media       3           Read only    [10.0.20.16:17310, 10.0.20.15:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
660         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
661         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
662         media       3           Read only    [10.0.20.16:17310, 10.0.20.15:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
663         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
664         media       3           Read only    [10.0.20.15:17310, 10.0.20.16:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
665         media       3           Read only    [10.0.20.15:17310, 10.0.20.16:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
666         media       3           Read only    [10.0.20.16:17310, 10.0.20.15:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
667         media       3           Read only    [10.0.20.15:17310, 10.0.20.14:17310, 10.0.20.16:17310]    [10.0.20.15:17310]      
668         media       3           Read only    [10.0.20.16:17310, 10.0.20.14:17310, 10.0.20.15:17310]    [10.0.20.15:17310]      
669         media       3           Read only    [10.0.20.16:17310, 10.0.20.15:17310, 10.0.20.14:17310]    [10.0.20.15:17310]      
670         media       3           Read only    [10.0.20.16:17310, 10.0.20.14:17310, 10.0.20.15:17310]    [10.0.20.15:17310]      

[Partition with replica file count differ significantly]:
DP_ID       VOLUME      REPLICAS    DP_STATUS    MEMBERS(fileCount)      
113         sia.per     3           Writable    [10.0.20.14:17310(546),10.0.20.18:17310(544),10.0.20.15:17310(546 isLeader)]
121         sia.per     3           Writable    [10.0.20.15:17310(506 isLeader),10.0.20.18:17310(506),10.0.20.14:17310(506)]
122         sia.per     3           Writable    [10.0.20.15:17310(505 isLeader),10.0.20.18:17310(503),10.0.20.14:17310(505)]
125         sia.per     3           Writable    [10.0.20.15:17310(517 isLeader),10.0.20.18:17310(516),10.0.20.14:17310(517)]
133         sia.per     3           Writable    [10.0.20.14:17310(466),10.0.20.18:17310(465),10.0.20.15:17310(466 isLeader)]
134         sia.per     3           Writable    [10.0.20.14:17310(476),10.0.20.15:17310(476 isLeader),10.0.20.18:17310(475)]
136         sia.per     3           Writable    [10.0.20.14:17310(430),10.0.20.18:17310(429),10.0.20.15:17310(430 isLeader)]
141         sia.per     3           Writable    [10.0.20.16:17310(349 isLeader),10.0.20.14:17310(349),10.0.20.18:17310(348)]
157         sia.per     3           Writable    [10.0.20.15:17310(491 isLeader),10.0.20.18:17310(490),10.0.20.14:17310(491)]
161         sia.per     3           Writable    [10.0.20.16:17310(378 isLeader),10.0.20.18:17310(376),10.0.20.14:17310(378)]
162         sia.per     3           Writable    [10.0.20.14:17310(473),10.0.20.18:17310(472),10.0.20.15:17310(473 isLeader)]
164         sia.per     3           Writable    [10.0.20.15:17310(454 isLeader),10.0.20.18:17310(451),10.0.20.14:17310(454)]
174         sia.per     3           Writable    [10.0.20.14:17310(537),10.0.20.18:17310(536),10.0.20.15:17310(537 isLeader)]
180         sia.per     3           Writable    [10.0.20.15:17310(530 isLeader),10.0.20.14:17310(530),10.0.20.18:17310(529)]
181         sia.per     3           Writable    [10.0.20.15:17310(499 isLeader),10.0.20.14:17310(499),10.0.20.18:17310(498)]
184         sia.per     3           Writable    [10.0.20.15:17310(551 isLeader),10.0.20.14:17310(551),10.0.20.18:17310(549)]
187         sia.per     3           Writable    [10.0.20.15:17310(514 isLeader),10.0.20.18:17310(513),10.0.20.14:17310(514)]
188         sia.per     3           Writable    [10.0.20.15:17310(480),10.0.20.14:17310(479),10.0.20.16:17310(480 isLeader)]
189         sia.per     3           Writable    [10.0.20.15:17310(500),10.0.20.18:17310(500 isLeader),10.0.20.14:17310(499)]
192         sia.per     3           Writable    [10.0.20.15:17310(540 isLeader),10.0.20.14:17310(540),10.0.20.18:17310(539)]
194         sia.per     3           Writable    [10.0.20.16:17310(407 isLeader),10.0.20.18:17310(406),10.0.20.14:17310(406)]
199         sia.per     3           Writable    [10.0.20.15:17310(508 isLeader),10.0.20.18:17310(508),10.0.20.14:17310(507)]
201         sia.per     3           Writable    [10.0.20.14:17310(500),10.0.20.15:17310(500 isLeader),10.0.20.18:17310(499)]
203         sia.per     3           Writable    [10.0.20.18:17310(372),10.0.20.16:17310(373 isLeader),10.0.20.14:17310(372)]
204         sia.per     3           Writable    [10.0.20.15:17310(471 isLeader),10.0.20.18:17310(470),10.0.20.14:17310(470)]
210         sia.per     3           Writable    [10.0.20.14:17310(443),10.0.20.18:17310(441),10.0.20.15:17310(443 isLeader)]
211         sia.per     3           Writable    [10.0.20.14:17310(506),10.0.20.15:17310(506 isLeader),10.0.20.18:17310(505)]
272         sia.per     3           Writable    [10.0.20.16:17310(389 isLeader),10.0.20.18:17310(388),10.0.20.15:17310(389)]
276         sia.per     3           Writable    [10.0.20.15:17310(371),10.0.20.18:17310(370),10.0.20.16:17310(371 isLeader)]
277         sia.per     3           Writable    [10.0.20.15:17310(381),10.0.20.16:17310(381 isLeader),10.0.20.18:17310(380)]
278         sia.per     3           Writable    [10.0.20.16:17310(445 isLeader),10.0.20.18:17310(444),10.0.20.15:17310(445)]
279         sia.per     3           Writable    [10.0.20.15:17310(344),10.0.20.16:17310(344 isLeader),10.0.20.18:17310(341)]
283         sia.per     3           Writable    [10.0.20.15:17310(353),10.0.20.18:17310(351),10.0.20.16:17310(353 isLeader)]
284         sia.per     3           Writable    [10.0.20.16:17310(334 isLeader),10.0.20.18:17310(332),10.0.20.15:17310(334)]
286         sia.per     3           Writable    [10.0.20.15:17310(369),10.0.20.18:17310(368),10.0.20.16:17310(369 isLeader)]
290         sia.per     3           Writable    [10.0.20.15:17310(381),10.0.20.16:17310(381 isLeader),10.0.20.18:17310(379)]
294         sia.per     3           Writable    [10.0.20.15:17310(372),10.0.20.16:17310(372 isLeader),10.0.20.18:17310(370)]
297         sia.per     3           Writable    [10.0.20.15:17310(331),10.0.20.16:17310(331 isLeader),10.0.20.18:17310(330)]
304         sia.per     3           Writable    [10.0.20.15:17310(436),10.0.20.18:17310(435),10.0.20.16:17310(436 isLeader)]
305         sia.per     3           Writable    [10.0.20.15:17310(396),10.0.20.18:17310(396),10.0.20.16:17310(396 isLeader)]
306         sia.per     3           Writable    [10.0.20.15:17310(421),10.0.20.18:17310(419),10.0.20.16:17310(421 isLeader)]
307         sia.per     3           Writable    [10.0.20.15:17310(411),10.0.20.16:17310(411 isLeader),10.0.20.18:17310(409)]
308         sia.per     3           Writable    [10.0.20.15:17310(413),10.0.20.18:17310(411),10.0.20.16:17310(413 isLeader)]
309         sia.per     3           Writable    [10.0.20.15:17310(406),10.0.20.18:17310(405),10.0.20.16:17310(406 isLeader)]
321         sia.per     3           Writable    [10.0.20.16:17310(428 isLeader),10.0.20.18:17310(426),10.0.20.15:17310(428)]
322         sia.per     3           Writable    [10.0.20.16:17310(436 isLeader),10.0.20.18:17310(434),10.0.20.15:17310(436)]
331         sia.per     3           Writable    [10.0.20.15:17310(344),10.0.20.16:17310(344 isLeader),10.0.20.18:17310(343)]
335         sia.per     3           Writable    [10.0.20.16:17310(332 isLeader),10.0.20.18:17310(331),10.0.20.15:17310(332)]
336         sia.per     3           Writable    [10.0.20.16:17310(345 isLeader),10.0.20.15:17310(345),10.0.20.18:17310(343)]
340         sia.per     3           Writable    [10.0.20.16:17310(422 isLeader),10.0.20.18:17310(420),10.0.20.15:17310(422)]
345         sia.per     3           Writable    [10.0.20.15:17310(362),10.0.20.18:17310(361),10.0.20.16:17310(362 isLeader)]
348         sia.per     3           Writable    [10.0.20.15:17310(374),10.0.20.18:17310(373),10.0.20.16:17310(374 isLeader)]
351         sia.per     3           Writable    [10.0.20.15:17310(400),10.0.20.18:17310(399),10.0.20.16:17310(400 isLeader)]
352         sia.per     3           Writable    [10.0.20.15:17310(356),10.0.20.18:17310(354),10.0.20.16:17310(356 isLeader)]
354         sia.per     3           Writable    [10.0.20.15:17310(381),10.0.20.18:17310(380),10.0.20.16:17310(381 isLeader)]
355         sia.per     3           Writable    [10.0.20.15:17310(348),10.0.20.18:17310(346),10.0.20.16:17310(348 isLeader)]
357         sia.per     3           Writable    [10.0.20.15:17310(402),10.0.20.18:17310(400),10.0.20.16:17310(402 isLeader)]
359         sia.per     3           Writable    [10.0.20.15:17310(413),10.0.20.18:17310(413),10.0.20.16:17310(413 isLeader)]
361         sia.per     3           Writable    [10.0.20.15:17310(423),10.0.20.16:17310(423 isLeader),10.0.20.18:17310(422)]
362         sia.per     3           Writable    [10.0.20.15:17310(380),10.0.20.18:17310(379),10.0.20.16:17310(380 isLeader)]
365         sia.per     3           Writable    [10.0.20.15:17310(369),10.0.20.18:17310(367),10.0.20.16:17310(369 isLeader)]
488         pbs.per     3           Writable    [10.0.20.15:17310(160 isLeader),10.0.20.18:17310(158),10.0.20.14:17310(160)]
547         sia.per     3           Writable    [10.0.20.16:17310(255 isLeader),10.0.20.14:17310(255),10.0.20.18:17310(254)]
554         sia.per     3           Writable    [10.0.20.14:17310(255),10.0.20.18:17310(253),10.0.20.16:17310(255 isLeader)]
563         sia.per     3           Writable    [10.0.20.14:17310(276),10.0.20.18:17310(274),10.0.20.16:17310(276 isLeader)]

[Partition with replica used size differ significantly]:
DP_ID       VOLUME      REPLICAS    DP_STATUS    MEMBERS(usedSize)       

[Partition with excessive replicas]:
ID          VOLUME      REPLICAS       STATUS          MEMBERS           

I was wondering what the fastest way to clean these up is and why CubeFS doesn't seem to clean this up naturally

Expected Behavior

CubeFS should manage recovery and rebalancing automatically

Steps To Reproduce

Not sure, sorry

CubeFS Log

No response

Anything else? (Additional Context)

No response

@Zorlin Zorlin added the bug Something isn't working label Feb 19, 2024
@Victor1319
Copy link
Member

The current system does not automatically handle these abnormal partitions. You need to manually take them offline for processing. Please refer to the user documentation for instructions, eg: cfs-cli datapartition decommission xxx xx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants