Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force FSYNC #115

Open
guestisp opened this issue May 20, 2018 · 25 comments
Open

Force FSYNC #115

guestisp opened this issue May 20, 2018 · 25 comments
Assignees
Labels
data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation. documentation Issue related to documentation feature Idea of a new feature to make MooseFS even better! :) help wanted need feedback performance PR welcome question Question

Comments

@guestisp
Copy link

Is possible to force MooseFS to issue fsync on every operation before returning an ACK to the client?

What if i want to be 100% sure that data is properly written to disk even if the client is not asking for fsync?

@oszafraniec
Copy link

https://moosefs.com/Content/Downloads/moosefs-2-0-users-manual.pdf
6.3.1 mfschunkserver.cfg
[...]
• HDD FSYNC BEFORE CLOSE – enables/disables fsync before chunk closing; deafult is 0 (off)

I think this is what you want :)

@guestisp
Copy link
Author

So, setting that to 1, means to call FSYNC on every fclose ?

What if client is asking for fsync by it's own and HDD FSYNC BEFORE CLOSE is set as 0? fsync is honored or ignored due to the config parameter ?

@oszafraniec
Copy link

@OXide94 can help here...

@pkonopelko
Copy link
Member

pkonopelko commented May 20, 2018

Hi @guestisp and @oszafraniec,

The parameter @oszafraniec mentioned is about fsync before chunk closing, so let's not mix up two completely different things. What @guestisp wants to achieve (I believe) is to get ACK after every write operation (so at "Client <--> Chunkserver" level, not at the level of Chunk writing).

In MooseFS write is a transaction, let's assume somebody is writing 2 MiB. These 2 mebibytes are anyway divided into 64 kiB blocks. In current implementation Client connects to the CS, sends 64 kiB, does not wait for ACK and sends next 64 kiB.

We can consider adding such parameter. The questions is: @guestisp, would you like to have fsync after whole group of blocks (in this example 2 MiB) or after every 64 kiB? As stated above, when data is being sent, Client does not wait for ACK of every 64 kiB block, so ACKs are a bit delayed (milliseconds). Please keep in mind, that it would probably slow down whole transaction (maybe not that much, because ACKs in this case would probably just reach client a bit later because of fsync (some more milliseconds)).

This is theory, we would probably need to make some comparison tests (just add this fsync in code and see performance differences with and without it).

Thanks,
Peter / MooseFS Team

@pkonopelko pkonopelko added feature Idea of a new feature to make MooseFS even better! :) question Question labels May 20, 2018
@guestisp
Copy link
Author

@OXide94 Preface: I'm not an expert.

There are AFAIK, two ways to be sure that a write operation has really reached disk and not any cache buffer: by opening a file with O_SYNC or by issuing fsync on a file hadler.

Now, what this means in MooseFS, I don't know.

My questions are (obviously, i'm talking about files stored on MooseFS)

  1. what happens by opening a file with O_SYNC set ?
  2. what happens by issuing a fsync after an fwrite and before fclose ?
  3. what happens without any of 2 above, but setting HDD_FSYNC_BEFORE_CLOSE to 1?

We can consider adding such parameter

From my point of view, anything aimed to improve data consistency should be added so that any sysop is able to choose based on their requirements. If adding a flag is not an issue, yes, add it. If adding 2 flags (one for fsync after every 64kb, one for fsync after the whole group of block) is not an issue, than please add both.

Anyway, I don't think adding fsync after every 64kb is useful as long you force clients to wait for the final fsync. If the whole file can't be properly flushed on disk, you should block the write operation and notify the client (that is still here waiting for a write ACK). Why you should send fsync after each 64kb?

In example, if you run dd if=/dev/zero of=test bs=100M count=5 conv=fsync, a single fsync is issued after all 5 chunks.

If you run dd if=/dev/zero of=test bs=100M count=5 oflag=direct no fsync is issued because file is opened with O_SYNC, probably the Kernel flush automatically to disk thanks to the O_SYNC flag

Will MooseFS honor these two cases ? Can we force one of these cases (or both) by setting a configuration parameter even if client is not asking for fsync or O_SYNC ?

@guestisp
Copy link
Author

Trying to figure out how this works.
Even setting HDD_FSYNC_BEFORE_CLOSE to 1, i don't see any fsync call by stracing the mfschunkserver.

even writing with O_SYNC flag set when writing a file, the flag seems to be ignored by MooseFS, as chunk file is wrote without O_SYNC flag:

open("/mnt/moosefs//00/chunk_000000000000B072_00000001.mfs", O_RDWR|O_CREAT|O_TRUNC, 0666)

did i miss something?

@guestisp
Copy link
Author

Small correction: by setting HDD_FSYNC_BEFORE_CLOSE to 1, fsync is called, but flags set by the client are still ignored. Thus, the client can't ask for a sync write.

Or we can set all writes to be sync with HDD_FSYNC_BEFORE_CLOSE or all writes are made async, regardless what client is asking. If a client is asking for O_SYNC or similiar, it should be honored, because that kind of data could be very valuable.

@acid-maker
Copy link
Member

ok. This is my fault. I didn't know that FUSE passes such flags as O_SYNC to userspace, but I've just checked and it does. It passes O_SYNC,O_ASYNC,O_NONBLOCK and O_NOATIME. Now I need to think how to take them into account. The most important is probably O_SYNC. We have many options here:

  1. do internal 'fsync' after each 'write' - likely the worst choice (safest one, but very slow - expected write speed would be less than 1MB/s)
  2. Pass this O_SYNC flag to CS and open chunks with such flag (or do fsync on CS after each write - probably same result). In such case client's write will return immediatelly without sync'ing data but each ACK will be send back to the mfsmount only after successful write of each portion of data.
  3. After sending all data to CS send new packet "perform fsync" and wait for ACK. Similar to previous one - the main difference is that it would sync whole stream from mfsmount to CS once, not each write. Likely the same result, but much more efficient.

In 2 and 3 successful fsync/close (but not write) done by client on descriptor opened with O_SYNC will mean that your data are synced to disks on CS.

What do you think? In my opinion option 3 is the best. Is it safe enough?

@zcalusic
Copy link
Contributor

O_SYNC means "Write operations on the file will complete according to the requirements of synchronized I/O file integrity completion", so if you do the first part of 2. "Pass this O_SYNC flag to CS and open chunks with such flag" it should be enough to cover the semantics, and no additional fsync() should be needed.

@zcalusic
Copy link
Contributor

While at it, see if O_DSYNC can also be implemented, it's similar:

   _O_SYNC provides synchronized I/O file integrity completion, meaning
   write operations will flush data and all associated metadata to the
   underlying hardware.  O_DSYNC provides synchronized I/O data
   integrity completion, meaning write operations will flush data to the
   underlying hardware, but will only flush metadata updates that are
   required to allow a subsequent read operation to complete
   successfully.  Data integrity completion can reduce the number of
   disk operations that are required for applications that don't need
   the guarantees of file integrity completion._

http://man7.org/linux/man-pages/man2/open.2.html

@guestisp
Copy link
Author

@zcalusic the problem is that MooseFS is totally ignoring this flag and this flag is not being passed when opening chunk file for writing...

@zcalusic
Copy link
Contributor

@guestisp , please read the comments before replying, you have missed at least one from @acid-maker

@guestisp
Copy link
Author

Sorry my fault
I've only received your email notification and not the first one

@guestisp
Copy link
Author

Anyway, O_SYNC should return to the client only when write is properly stored on disk and not immediatly as @acid-maker said or the client will be unaware of any failures

@zcalusic
Copy link
Contributor

Yes, of course. But, as the write() call is synchronous, it's just a matter of passing its return status to the caller, hopefully that can be easily integrated with the current workflow, don't know MooseFS internals well, but @acid-maker will. 😄

@guestisp
Copy link
Author

@acid-maker wrote differently: "In such case client's write will return immediatelly without sync'ing data"

So, writes won't be synchronous but still in writeback.

If, as client, i'm asking for o_sync is because i want to be 100% sure that data is really flushed to disk so write must return after the real flush, even if much slower
(Not all writes needs to be sync, thus the write penalty shouldn't be an issue)

@zcalusic
Copy link
Contributor

I see. I mostly ignored that part thinking that opening chunk with O_SYNC should be enough, and that write() propagates its return code back upstream, already. I base my understanding on the following figure:

https://www.researchgate.net/profile/Weigang_Wu/publication/271464202/figure/fig1/AS:295235751038978@1447401096555/The-read-write-process-of-MooseFS-9.png

So, O_SYNC and similar would be passed via pt. 4/5/6, and write() would return status code via pt. 7. Of course, that figure is much simplified, and the real world is certainly more complicated. :)

In any case, supporting these flags would bring MooseFS closer to POSIX compatibility, so it would be great if they could be added and properly supported.

@acid-maker acid-maker self-assigned this May 25, 2018
@borkd
Copy link
Collaborator

borkd commented Oct 4, 2018

@acid-maker: to follow up on our conversation - one idea was to keep current fsync behavior as-is, but use an extended attribute to mark files or directory trees where FSYNC or DIRECT compliance is required, and honoring that flag all the way to the chunk writes.

@borkd borkd pinned this issue Dec 19, 2018
@borkd borkd added data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation. performance labels Jan 30, 2019
@dumblob
Copy link

dumblob commented Oct 7, 2021

Any news on this rather fundamental behavior?

@chogata
Copy link
Member

chogata commented Oct 11, 2021

This is still on our roadmap.

@Motophan
Copy link

Motophan commented Jun 3, 2022

Hi, please add core filesystem functionality for basic operation, thank you.

@guestisp
Copy link
Author

guestisp commented Jun 3, 2022

they will never add something useful, the promised, years ago, a v4 free for everyone, i also had binaries to test and use with an awesome HA mode but the v4 is still closed source and unavailable. they promise a lot of things ....

@guestisp
Copy link
Author

guestisp commented Jun 3, 2022

it's a shame because mfs is by far the best distributed storage available

@Motophan
Copy link

Motophan commented Feb 3, 2023

Hi, how is roadmap doing these days?

@guestisp
Copy link
Author

guestisp commented Feb 8, 2023

Hi, how is roadmap doing these days?

they does nothing except bug fix.
they promised a lot of things like open source v4, years ago, but still nothing.

they talk, talk, talk, talk, .....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data safety Tag issues and questions regarding potential data safety issues. Improve existing documentation. documentation Issue related to documentation feature Idea of a new feature to make MooseFS even better! :) help wanted need feedback performance PR welcome question Question
Projects
None yet
Development

No branches or pull requests

9 participants