-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshots that differ only by renamed or duplicated files can't be committed. #20
Comments
Just to add. It is not necessary to create a duplicate file to see this bug, simply renaming an existing file will have the same effect. The commit will not go through: |
Thanks for the bug, I will look into that. Rolling hash deduplication is not supported at the moment, here is a pointer at a package (https://github.com/restic/chunker) that should not be too difficult to integrate in order to define dynamic boundaries. See also next comment down that shows an experiment. So for now only appends at the end will work optimally. |
Below are the results of a mock-up test that prepends data at the beginning in order to see the effects on the BLAKE2 Tree mode:
As expected, just the first chunk is updated and obviously the root hash as well. Note that the NodeOffset that is an input into BLAKE2 may have an undesired affect, imagine the first chunk being split in two chunks, then all subsequent hashes would differ purely due to a different NodeOffset although the underlying streams would be the same — but maybe this effect rarely happens in practice and would not be an issue/unimportant side effect (or alternatively always set the NodeOffset to 0 but that goes a bit against the BLAKE2 tree mode). |
Hi, I just wanted to take a second to thank you for your responses here and your work on s3git. Your reference to the chunking lib above introduced me to the Restic project which I hadn't previously encountered. After investigating it, that package seems to satisfy our immediate requirements and perhaps more importantly, seems stable and production ready. Longer term, I'm still interested in a more "git like" workflow for data, be it through s3git, noms, etc. But for now we've decided to go with Restic for this project. Thanks again. |
Hi thanks for this package.
First the bug: In snapshot mode, if the only difference between the current state and the previous snapshot is the addition of a duplicate file, the snapshot will fail to complete, even though the directory state has been updated (through the addition of the duplicated file).
Repro:
I also have a couple of quick questions:
How do you enable the rolling hash deduplication? It does not appear to be on by default. If I continue the example above by modifying the end and then the beginning of the file:
It appears that appending to files will be deduplicated, but prepending (or otherwise modifying) the file will not be. That doesn't fit my definition of "rolling hash" (e.g. how
rsync
or rabin file chunking work). Is this implemented? If so, how to enable it?Finally, a general question that may be answered automatically by your response, but I'm curious about the status of this package. Is it being maintained? Are there plans to move forward beyond the "pre-release" and "use at your own peril (for now)" stage? It looks like a tremendously useful package that is currently more fully baked in comparison to the newer Noms or Dat projects, which have somewhat overlapping goals and approaches...
Thanks in advance for your timely response!
The text was updated successfully, but these errors were encountered: