Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Slowdown after 160k files written #518

Open
Nexus2k opened this issue Dec 23, 2022 · 4 comments
Open

[BUG] Slowdown after 160k files written #518

Nexus2k opened this issue Dec 23, 2022 · 4 comments

Comments

@Nexus2k
Copy link

Nexus2k commented Dec 23, 2022

Have you read through available documentation, open Github issues and Github Q&A Discussions?

Yes

System information

Your moosefs version and its origin (moosefs.com, packaged by distro, built from source, ...).

apt install moosefs-* 3.0.116

Operating system (distribution) and kernel version.

Ubuntu 22.04 LTS

Hardware / network configuration, and underlying filesystems on master, chunkservers, and clients.

Mostly 1GBe connections between 22 servers,
20 chunkservers 2x512GB SSD (in Software RAID1) with ext4 FS (non-dedicated disks)

How much data is tracked by moosefs master (order of magnitude)?

4.1 TB in ~160k files

  • All fs objects: 159180
  • Total space: 8.7 TiB
  • Free space: 4.6 TiB
  • RAM used: 503 MiB
  • last metadata save duration: ~0.2s

Describe the problem you observed.

After I've wrote about 1.2TB of data to my mfs path at /opt/mfs/pub1/, I've started writing similiar data from another host to /opt/mfs/pub2/ I've noticed that the writes from the second host were already must slower as the copying of it's 1.2TB (as said same application data) was still running the next morning. Also the first server (which happens to be the mfsmaster too) seemingly experienced some data corruption.
Is there chunk deduplication or something when there's two similiar looking files?

Can you reproduce it? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes still slow for some reason.

Troubleshooting steps:

  • Restarted master, metalogger
  • Rebooted mfsmaster
  • Restarted chunkservers

Include any warning/errors/backtraces from the system logs.

image
There was something seemingly happen around 08:30 which made the system slow down significantly.

Any help appreciated.

@chogata
Copy link
Member

chogata commented Jan 11, 2023

Do I understand correctly: you first wrote data from one physical machine (machine A) to one path on MooseFS (path X) and then from another machine (machine B) to another path on MooseFS (path Y)? And writing from A to X was much faster, than writing from B to Y? If yes, the obvious answer would be, that there is some problem with B's connection to your MooseFS instance. Do you have any messages in logs (all of them: master, chunk servers, client B) about timeouts, disconnections, a "long loop" message?
The other possibility is, of course, that something happened to your hardware exactly at the time you switched from A to B. Less probable, but still possible. I you look at "Server Charts" -> "time of data write operations" do you see any change?

@Nexus2k
Copy link
Author

Nexus2k commented Feb 6, 2023

Do I understand correctly: you first wrote data from one physical machine (machine A) to one path on MooseFS (path X) and then from another machine (machine B) to another path on MooseFS (path Y)?

Correct

And writing from A to X was much faster, than writing from B to Y? If yes, the obvious answer would be, that there is some problem with B's connection to your MooseFS instance.

Both A and B have direct connections (Wireguard VPN) to the chunkservers (iiuc the writing does go directly to the chunkserver from the writing machine, right?). Wierdly enough after B wrote to the MooseFS filesystem all subsequent writes from A were also rather slow.

The other possibility is, of course, that something happened to your hardware exactly at the time you switched from A to B. Less probable, but still possible. I you look at "Server Charts" -> "time of data write operations" do you see any change?

Yeah after B finished writing the time at least doubled on all future writes. Including the ones from A.
What I was more concerned about was that the written data seemingly wasn't consistent as I couldn't bring up the application on A using the MooseFS anymore after B finished writing. Is there some built in deduplication in the free version? (The data was two copies of substrate blockchain nodes data (kusama/khala incase it matters)).

@borkd
Copy link
Collaborator

borkd commented Feb 6, 2023

Can you describe the network side of your cluster in more detail in addition to all relevant storage class definitions and chunkserver labels? Topology, bandwidth (iperf) and typical RTTs between all nodes with and without witeguard, wiregurd in full mesh or something else, known bottlenecks, etc..?

"Some kind of corruption" on your mfsmaster node sounds too vague. Please be more specific

@chogata
Copy link
Member

chogata commented Feb 14, 2023

There is no deduplication in MooseFS (aka the system does not in any way analyse the content of the data it writes, the only exception being trailing or ending zeros, which are not physically written to disks).
To help you I would need more info. What @borkd requested would be a good start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants