Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shut down Volume Server due to duplicate volume directories #5439

Open
dyeldandi opened this issue Mar 30, 2024 · 8 comments
Open

Shut down Volume Server due to duplicate volume directories #5439

dyeldandi opened this issue Mar 30, 2024 · 8 comments

Comments

@dyeldandi
Copy link
Contributor

Describe the bug
Occasionally (like once in a couple of weeks) volume server shuts down with the following log messages:

I0330 07:05:18.266249 volume_grpc_client_to_master.go:71 heartbeat to MASTER_IP_ADDRESS:9333 error: rpc error: code = Unavailable desc = error reading from server: EOF
I0330 07:05:23.582099 volume_grpc_client_to_master.go:109 Heartbeat to: MASTER_IP_ADDRESS:9333
E0330 07:05:25.166441 volume_grpc_client_to_master.go:130 Shut down Volume Server due to duplicate volume directories: [/VOLUME/DIRECTORY]

At the same time in master server log:

I0329 14:45:04.328856 master_grpc_server.go:138 added volume server 0: VOLUMESERVER1_IP_ADDRESS:8071 [9322c3e6-f8e4-4a53-9129-902e5b24bdb3]
I0329 14:45:04.329442 master_grpc_server.go:49 found new uuid:VOLUMESERVER1_IP_ADDRESS:8071 [9322c3e6-f8e4-4a53-9129-902e5b24bdb3] , map[VOLUMESERVER2_IP_ADDRESS:8071:[a9ced06f-a886-40d9-942e-73bf40cf5d74] VOLUMESERVER1_IP_ADDRESS:8071:[9322c3e6-f8e4-4a53-9129-902e5b24bdb3]]
I0329 14:45:04.332164 volume_layout.go:396 Volume 773 becomes writable

...
I0329 14:45:04.338761 master_grpc_server.go:199 master see new volume 820 from VOLUMESERVER1_IP_ADDRESS:8071
I0330 07:05:25.135336 master_grpc_server.go:138 added volume server 1: VOLUMESERVER1_IP_ADDRESS:8071 [9322c3e6-f8e4-4a53-9129-902e5b24bdb3]
E0330 07:05:25.136146 master_grpc_server.go:40 directory of 9322c3e6-f8e4-4a53-9129-902e5b24bdb3 on VOLUMESERVER1_IP_ADDRESS:8071 has been loaded
I0330 07:05:25.152003 node.go:262 topo:DATACENTER:rack1 removes VOLUMESERVER1_IP_ADDRESS:8071
I0330 07:05:25.152260 master_grpc_server.go:87 unregister disconnected volume server VOLUMESERVER1_IP_ADDRESS:8071
I0330 07:05:25.152269 master_grpc_server.go:58 remove volume server VOLUMESERVER1_IP_ADDRESS:8071, online volume server: map[VOLUMESERVER2_IP_ADDRESS:8071:[a9ced06f-a886-40d9-942e-73bf40cf5d74]]
W0330 07:05:26.588397 master_grpc_server.go:100 SendHeartbeat.Recv server VOLUMESERVER1_IP_ADDRESS:8071 : rpc error: code = Canceled desc = context canceled
I0330 07:05:26.588485 master_grpc_server.go:87 unregister disconnected volume server VOLUMESERVER1_IP_ADDRESS:8071
I0330 07:05:26.588501 master_grpc_server.go:58 remove volume server VOLUMESERVER1_IP_ADDRESS:8071, online volume server: map[VOLUMESERVER2_IP_ADDRESS:8071:[a9ced06f-a886-40d9-942e-73bf40cf5d74]]


System Setup

  • List the command line to start "weed master", "weed volume", "weed filer", "weed s3", "weed mount".
/opt/weed-3.58/bin/weed -logdir=/var/log/weed-volume-3.58/ volume -dataCenter=DATACENTER -rack rack1 -mserver=MASTER_IP_ADDRESS:9333 -max=2400 -ip=VOLUMESERVER1_IP_ADDRESS -dir=/VOLUME/DIRECTORY -port=8071 -publicUrl=PUBLIC_URL
/opt/weed-3.58/bin/weed -logdir=/var/log/weed-3.58/ master -defaultReplication=001 -mdir=/MASTER/DIRECTORY -ip=MASTER_IP_ADDRESS -port=9333 -ip.bind=0.0.0.0
  • OS version
    CentOS 7

  • output of weed version
    version 8000GB 3.58 d1e83a3 linux amd64

  • if using filer, show the content of filer.toml
    not using filer

Expected behavior
Volume server shouldn't shut down

Additional context
Network connection between master server and volume servers can be unstable sometimes.

@kmlebedev
Copy link
Contributor

kmlebedev commented Mar 30, 2024

See #3059

@dyeldandi
Copy link
Contributor Author

@kmlebedev thanks, but AFAIU #3630 has been already merged into version I'm running, so probably a different issue?

@kmlebedev
Copy link
Contributor

@kmlebedev thanks, but AFAIU #3630 has been already merged into version I'm running, so probably a different issue?

You need to make sure that the contents of vol_dir.uuid are different and that you have not manually copied these files from one directory to another

@dyeldandi
Copy link
Contributor Author

Yeah, they are definitely different.
1st is 9322c3e6-f8e4-4a53-9129-902e5b24bdb3
2nd is a9ced06f-a886-40d9-942e-73bf40cf5d74
the volume servers run fine for a week or two then one of them suddenly shuts down.

@kmlebedev
Copy link
Contributor

It may be that you accidentally launch a second instance of weed volume at this time, which will erase the same folder with the file vol_dir.uuid

@dyeldandi
Copy link
Contributor Author

I really doubt it. Nobody logged in on that server for a few months before it crashed. There is nothing in the logs that would indicate this case. Besides, every time it crashes like this there is a failed heartbeat just a few seconds before. E.g.

I0330 07:05:18.266249 volume_grpc_client_to_master.go:71 heartbeat to MASTER_IP_ADDRESS:9333 error: rpc error: code = Unavailable desc = error reading from server: EOF

@kmlebedev
Copy link
Contributor

kmlebedev commented Mar 30, 2024

Heartbeat fails already in the process of stopping the revolutionary server.
Look in the logs on master server there is a message below:

volume: Duplicated volume directories were loaded
directory of %s on %s has been loaded
image

@dyeldandi
Copy link
Contributor Author

There is no such message in the logs (I checked both volume and master servers)
Only this one from volume server that mentions duplicity:

E0330 07:05:25.166441 volume_grpc_client_to_master.go:130 Shut down Volume Server due to duplicate volume directories: [/VOLUME/DIRECTORY]

And master reports that directory has been loaded without mentioning duplicated volumes

I0330 07:05:25.135336 master_grpc_server.go:138 added volume server 1: VOLUMESERVER1_IP_ADDRESS:8071 [9322c3e6-f8e4-4a53-9129-902e5b24bdb3]
E0330 07:05:25.136146 master_grpc_server.go:40 directory of 9322c3e6-f8e4-4a53-9129-902e5b24bdb3 on VOLUMESERVER1_IP_ADDRESS:8071 has been loaded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants