The master is experiencing intermittent timeouts when connecting to the chunkserver. #564

crystal927 · 2024-01-31T11:58:48Z

crystal927
Jan 31, 2024

My main server stopped running due to server memory issues, reporting "can't find metadata.mfs" during startup. When restarted using a backup of metadata.mfs, it was able to start successfully, and clients could mount successfully. However, after some time, I observed errors on the master server as follows:
Jan 31 16:50:59 mfsmaster-10 mfsmaster [7436]: connection with 10.0.0.1:9422 timed out
Jan 31 16:50:59 mfsmaster-10 mfsmaster[7436]: chunkserver disconnected - ip: 10.0.0.1 / port: 9422, usedspace: (1020.59 GiB), totalspace: 1003072294912 (1500.18 GiB)
Jan 31 16:51:00 mfsmaster-10 mfsmaster [7436]: connection with 10.0.0.1:9422 timed out,
and chunkserver will appear:
connection failed, error: ECONNREFUSED (Connection refused)
Jan 31 16:51:28 file-10 mfschunkserver[975]: connecting
Jan 31 16:51:28 file-10 mfschunkserver[975]: connected to Master

There are two issues:
Whether the above timeouts are caused by the timeout parameter in the configuration file or an error due to too many files?
Occasionally, the master node reports: "chunk 0000000012A232CD_00000001: there are no copies." Are there corrupt blocks, and how can they be recovered?
1、Whether the above timeouts are caused by the timeout parameter in the configuration file or an error due to too many files?
2、Occasionally, the master node reports: "chunk 0000000012A232CD_00000001: there are no copies." Are there corrupt blocks, and how can they be recovered?

chogata · 2024-01-31T15:59:27Z

chogata
Jan 31, 2024
Maintainer

Connection timed out type of errors have no direct relation to the number of files stored on your MooseFS instance. However, if you have a lots of files and your master starts using SWAP because it run out of physical RAM, this can have an impact and in some cases cause timeouts. First thing to check would be the swap usage of your server. Also, if the server you run your master module has some other processes that use RAM, it may be that other processes are causing your issues, if master is forced to use SWAP.
This message means that a chunk (a part of a file) is completely missing from the system. There can be two reasons for that: first, you had some kind of hardware failure and the chunk really went missing. But it would be suspicious if all copies went missing at the same time, unless you store all your chunk servers on the same physical machine or you use the (not recommended) goal 1 to store your data. Second: you wrote that after the memory related crash you were forced to use backup metadata. How old was the backup? If between the time the backup was saved and the crash some files were deleted from MooseFS, then the metadata would still see those files that were deleted after. But the chunks would have been deleted from the system at the time the file itself was deleted and now the system considers them missing. Do you see any missing files when you check your CGI (or CLI)?

5 replies

crystal927 Jan 31, 2024
Author

Thank you for your reply！

Regarding the first issue, I noticed that SWAP is currently not being utilized. When the master was just restarted, the memory usage wasn't high, but there were still continuous reconnecting issues. However, the memory on this master is indeed somewhat limited, and we plan to try upgrading the memory in the future.
Concerning the second issue, there was a master crash leading to the use of backup metadata, with an approximate 8-hour gap during the incident. Since this occurred in the early morning, it's uncertain if there were any data insertion or deletion activities. However, based on the monitoring, it seems that the master backup node did not switch successfully after the crash. The architecture involves using VIP+keepalived+1 master+1 master backup+3 chunk servers+1 metalogger.

I observed the following logs from the backup node at that time:

Jan 31 00:45:20 mfsmaster-10 mfsmaster[2362]: csdb: found cs using ip:port and csid (10.88.0.30:9422,1)
Jan 31 00:45:20 mfsmaster-10 mfsmaster[2362]: chunkserver register begin (packet version: 6) - ip: 10.88.0.30 / port: 9422, usedspace: 1118095060992 (1041.31 GiB), totalspace: 2006278963200 (1868.49 GiB)
Jan 31 00:45:20 mfsmaster-10 mfsmaster[2362]: chunkserver (10.88.0.30) has nonexistent chunk (00000000050008D1_00000001), so create it for future deletion
...
Jan 31 00:45:20 mfsmaster-10 mfsmaster[2362]: there are more nonexistent chunks to create - stop logging
Jan 31 00:45:20 mfsmaster-10 mfsmaster[2362]: write to ML(10.88.0.31) error: EPIPE (Broken pipe)
Jan 31 00:45:20 mfsmaster-10 mfsmaster[2362]: created new sessionid:2
Jan 31 00:45:20 mfsmaster-10 mfsmaster[2362]: created new sessionid:3
...
Jan 31 00:45:21 mfsmaster-10 mfsmaster[2362]: csdb: server not found (10.88.0.219:9422,2), add it to database
Jan 31 00:45:21 mfsmaster-10 mfsmaster[2362]: chunkserver register begin (packet version: 6) - ip: 10.88.0.219 / port: 9422, usedspace: 1118092087296 (1041.30 GiB), totalspace: 2006278963200 (1868.49 GiB)
Jan 31 00:45:21 mfsmaster-10 mfsmaster[2362]: created new sessionid:18
...
Jan 31 00:45:22 mfsmaster-10 mfsmaster[2362]: csdb: server not found (10.5.39.25:9422,3), add it to database
Jan 31 00:45:22 mfsmaster-10 mfsmaster[2362]: chunkserver register begin (packet version: 6) - ip: 10.5.39.25 / port: 9422, usedspace: 558988587008 (520.60 GiB), totalspace: 1003072294912 (934.18 GiB)
Jan 31 00:45:24 mfsmaster-10 mfsmaster[2362]: chunkserver disconnected - ip: 10.5.39.25 / port: 9422, usedspace: 558988587008 (520.60 GiB), totalspace: 1003072294912 (934.18 GiB)
Jan 31 00:45:24 mfsmaster-10 mfsmaster[2362]: created new sessionid:31
...
Jan 31 00:45:24 mfsmaster-10 mfsmaster[2362]: server ip: 10.5.39.25 / port: 9422 has been fully removed from data structures
Jan 31 00:45:25 mfsmaster-10 mfsmaster[2362]: created new sessionid:44
...
Jan 31 00:45:25 mfsmaster-10 mfsmaster[2362]: chunkserver disconnected - ip: 10.88.0.30 / port: 9422, usedspace: 1118095060992 (1041.31 GiB), totalspace: 2006278963200 (1868.49 GiB)
Jan 31 00:45:25 mfsmaster-10 mfsmaster[2362]: chunkserver disconnected - ip: 10.88.0.219 / port: 9422, usedspace: 1118092087296 (1041.30 GiB), totalspace: 2006278963200 (1868.49 GiB)
Jan 31 00:45:26 mfsmaster-10 mfsmaster[2362]: csdb: found cs using ip:port and csid (10.88.0.30:9422,1)
Jan 31 00:45:26 mfsmaster-10 mfsmaster[2362]: chunkserver register begin (packet version: 6) - ip: 10.88.0.30 / port: 9422, usedspace: 1118095060992 (1041.31 GiB), totalspace: 2006278963200 (1868.49 GiB)
Jan 31 00:45:26 mfsmaster-10 mfsmaster[2362]: csdb: found cs using ip:port and csid (10.88.0.219:9422,2)
Jan 31 00:45:26 mfsmaster-10 mfsmaster[2362]: chunkserver register begin (packet version: 6) - ip: 10.88.0.219 / port: 9422, usedspace: 1118092087296 (1041.30 GiB), totalspace: 2006278963200 (1868.49 GiB)
Jan 31 00:45:26 mfsmaster-10 mfsmaster[2362]: created new sessionid:54
...
Jan 31 00:45:26 mfsmaster-10 mfsmaster[2362]: server ip: 10.88.0.30 / port: 9422 has been fully removed from data structures
Jan 31 00:45:26 mfsmaster-10 mfsmaster[2362]: created new sessionid:67
...
Jan 31 00:45:27 mfsmaster-10 mfsmaster[2362]: csdb: found cs using ip:port and csid (10.5.39.25:9422,3)
Jan 31 00:45:27 mfsmaster-10 mfsmaster[2362]: chunkserver register begin (packet version: 6) - ip: 10.5.39.25 / port: 9422, usedspace: 558988587008 (520.60 GiB), totalspace: 1003072294912 (934.18 GiB)
Jan 31 00:45:27 mfsmaster-10 mfsmaster[2362]: server ip: 10.88.0.219 / port: 9422 has been fully removed from data structures
Jan 31 00:45:27 mfsmaster-10 mfsmaster[2362]: chunkserver register end (packet version: 6) - ip: 10.88.0.30 / port: 9422
Jan 31 00:45:28 mfsmaster-10 mfsmaster[2362]: created new sessionid:69
...
Jan 31 00:45:28 mfsmaster-10 mfsmaster[2362]: chunkserver register end (packet version: 6) - ip: 10.88.0.219 / port: 9422
Jan 31 00:45:28 mfsmaster-10 mfsmaster[2362]: chunkserver register end (packet version: 6) - ip: 10.5.39.25 / port: 9422
Jan 31 00:45:30 mfsmaster-10 mfsmaster[2362]: created new sessionid:76
...
Jan 31 00:45:31 mfsmaster-10 mfsmaster[2362]: chunk 000000000000116E_00000001: there are no copies
Jan 31 00:45:31 mfsmaster-10 mfsmaster[2362]: chunk 0000000000001546_00000001: there are no copies

The log of the metalogger server at that time is as follows:

Jan 31 00:45:08 metalogger-10 mfsmetalogger[2324]: connection was reset by Master
Jan 31 00:45:10 metalogger-10 mfsmetalogger[2324]: connecting ...
Jan 31 00:45:10 metalogger-10 mfsmetalogger[2324]: connection failed, error: ECONNREFUSED (Connection refused)
Jan 31 00:45:15 metalogger-10 mfsmetalogger[2324]: connecting ...
Jan 31 00:45:15 metalogger-10 mfsmetalogger[2324]: connection failed, error: ECONNREFUSED (Connection refused)
Jan 31 00:45:20 metalogger-10 mfsmetalogger[2324]: connecting ...
Jan 31 00:45:20 metalogger-10 mfsmetalogger[2324]: connected to Master
Jan 31 00:45:20 metalogger-10 mfsmetalogger[2324]: some changes lost: [6829163859-33011], download metadata again
Jan 31 00:45:25 metalogger-10 mfsmetalogger[2324]: connecting ...
Jan 31 00:45:26 metalogger-10 mfsmetalogger[2324]: connected to Master
Jan 31 00:45:26 metalogger-10 mfsmetalogger[2324]: metadata downloaded 412354B/0.010055s (41.010 MB/s)
Jan 31 00:45:26 metalogger-10 mfsmetalogger[2324]: meta data version: 23012, meta data id: 0x5A3D042DAFCAB780
Jan 31 00:45:28 metalogger-10 mfsmetalogger[2324]: changelog_0 downloaded 60513348B/2.062308s (29.343 MB/s)
Jan 31 00:45:28 metalogger-10 mfsmetalogger[2324]: changelog_1 downloaded 0B/0.000023s (0.000 MB/s)
Jan 31 00:45:35 metalogger-10 mfsmetalogger[2324]: connecting ...
Jan 31 00:45:35 metalogger-10 mfsmetalogger[2324]: connection failed, error: ECONNREFUSED (Connection refused)
Jan 31 00:45:40 metalogger-10 mfsmetalogger[2324]: connecting ...
Jan 31 00:45:40 metalogger-10 mfsmetalogger[2324]: connected to Master

Now, on the chunkserver server, I can find files reporting "there are no copies." How can I determine whether this file or block is genuinely lost, and are there any recovery solutions?

Additionally, there is another issue: I occasionally observe error messages on the master，Does this indicate that the metadata of metaloggers has not been updated?
Feb 1 07:00:00 mfsmaster-10 mfsmaster[7436]: no metaloggers connected !!!

crystal927 Feb 1, 2024
Author

already deleted.

crystal927 Feb 1, 2024
Author

Checked the memory monitoring, and currently, there is no usage of SWAP.

crystal927 Feb 4, 2024
Author

Thank you very much. After upgrading the server memory, the reconnection issue has been resolved.

chogata Feb 13, 2024
Maintainer

Hi,
I'm glad the reconnection issue was resolved.

This message: Feb 1 07:00:00 mfsmaster-10 mfsmaster[7436]: no metaloggers connected !!! means that, yes, there are no metaloggers connected so your metadata is not being backed up by a metalogger server. Do you have a metalogger active? What does it say in logs?

If a file says "there are no copies", you can try the mfsfileinfo tool to check if maybe there are wrong version or invalid chunks that could be manually repaired. But if mfsfileinfo shows that a chunk has no copies, then that means there are no copies, period. Again, I strongly suspect that any files that have no copies of chunks left are due to 8h metadata rollback - any files that were deleted during those 8 hours would "reappear", but only in metadata, their data from chunk servers would have already been deleted.

To avoid metadata rollbacks, you have to make sure your metalogger is always on-line and connected and gathering metadata. Use more than one if necessary. Or, if you use MooseFS in an enterprise solution and need your metadata safe and your MooseFS available all the time, contact us about the pro version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The master is experiencing intermittent timeouts when connecting to the chunkserver. #564

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

The master is experiencing intermittent timeouts when connecting to the chunkserver. #564

crystal927 Jan 31, 2024

Replies: 1 comment · 5 replies

chogata Jan 31, 2024 Maintainer

crystal927 Jan 31, 2024 Author

crystal927 Feb 1, 2024 Author

crystal927 Feb 1, 2024 Author

crystal927 Feb 4, 2024 Author

chogata Feb 13, 2024 Maintainer

crystal927
Jan 31, 2024

Replies: 1 comment 5 replies

chogata
Jan 31, 2024
Maintainer

crystal927 Jan 31, 2024
Author

crystal927 Feb 1, 2024
Author

crystal927 Feb 1, 2024
Author

crystal927 Feb 4, 2024
Author

chogata Feb 13, 2024
Maintainer