Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshots that differ only by renamed or duplicated files can't be committed. #20

Open
vsivsi opened this issue Feb 28, 2017 · 5 comments

Comments

@vsivsi
Copy link

vsivsi commented Feb 28, 2017

Hi thanks for this package.

First the bug: In snapshot mode, if the only difference between the current state and the previous snapshot is the addition of a duplicate file, the snapshot will fail to complete, even though the directory state has been updated (through the addition of the duplicated file).

Repro:

mkdir dup-test
cd dup-test
s3git init
# Initialized empty s3git repository in <directory>
head -n 100000 /dev/urandom > file1.bin
s3git snapshot create . -m 'Initial version'
# [commit <long hash>]
cp file1.bin file1.bin.bak
s3git snapshot create . -m 'Added backup file'
# No changes to snapshot
s3git log -p
# <long hash> Initial version

I also have a couple of quick questions:

How do you enable the rolling hash deduplication? It does not appear to be on by default. If I continue the example above by modifying the end and then the beginning of the file:

du -sh .s3git
# 24M	.s3git
echo 'woot' | cat file1.bin - > file1.bin.bak
s3git snapshot create . -m 'Added post-wooted backup file'
# [commit <long hash>]
du -sh .s3git
# 29M	.s3git         # The last chunk changed, as expected
echo 'woot' | cat - file1.bin > file1.bin.bak2
s3git snapshot create . -m 'Added pre-wooted backup file'
# [commit <long hash>]
du -sh .s3git
# 53M	.s3git         # The every chunk changed, NOT as expected

It appears that appending to files will be deduplicated, but prepending (or otherwise modifying) the file will not be. That doesn't fit my definition of "rolling hash" (e.g. how rsync or rabin file chunking work). Is this implemented? If so, how to enable it?

Finally, a general question that may be answered automatically by your response, but I'm curious about the status of this package. Is it being maintained? Are there plans to move forward beyond the "pre-release" and "use at your own peril (for now)" stage? It looks like a tremendously useful package that is currently more fully baked in comparison to the newer Noms or Dat projects, which have somewhat overlapping goals and approaches...

Thanks in advance for your timely response!

@vsivsi vsivsi changed the title Snapshots that differ only by duplicate files can't be committed Snapshots that differ only by duplicate files can't be committed. Also, is this project still alive? Feb 28, 2017
@vsivsi vsivsi changed the title Snapshots that differ only by duplicate files can't be committed. Also, is this project still alive? Snapshots that differ only by duplicate files can't be committed. Also, is this project still active/alive? Feb 28, 2017
@vsivsi
Copy link
Author

vsivsi commented Mar 1, 2017

Just to add. It is not necessary to create a duplicate file to see this bug, simply renaming an existing file will have the same effect. The commit will not go through: No changes to snapshot

@vsivsi vsivsi changed the title Snapshots that differ only by duplicate files can't be committed. Also, is this project still active/alive? Snapshots that differ only by renamed or duplicated files can't be committed. Also, is this project still active/alive? Mar 1, 2017
@fwessels
Copy link
Collaborator

Thanks for the bug, I will look into that.

Rolling hash deduplication is not supported at the moment, here is a pointer at a package (https://github.com/restic/chunker) that should not be too difficult to integrate in order to define dynamic boundaries. See also next comment down that shows an experiment. So for now only appends at the end will work optimally.

@fwessels
Copy link
Collaborator

Below are the results of a mock-up test that prepends data at the beginning in order to see the effects on the BLAKE2 Tree mode:

franks-mbp:rolling-hash frankw$ ./rolling-hash < movie.mp4 > output.txt && more output.txt 
  0:0  391d2a67da42ff23abb6985906a485d107def98cb21a10b3d29f5c3ef0b1512e57cafc80fac294d2d37987f454929dd7f1003979f12a196e527638d4f9c7bfb7 (3052 kb)
  1:0  009017341617da903604452377e3baa0f2071c86bd02ae9e92930e7c555aa4d98737dc4f78a5dffc5f55c0f9bb012c70378aa422e5f1bb813dd701e77985b92b (1236 kb)
  2:0  872ad4546565d74bb75677a77cccbec54faddb317f5ce0468a57f57d3ab8485a38a178ffbedb5ea7c56fcc67331d220ef12802d18fe64a874322a0d9d0ddfdd6 (1058 kb)
  3:0  14e9f28133efb345b6e753553e4253984fd24290a6a0825d4d7eef9350a645986466c48252b8299744556fc7ad0c9bc883d1b0fdc3c08fcea3515560fcb184c2 (3193 kb)
  4:0  6bd50073d0fd2e295de83ee673afb4abda9d11da0734dff813be462ec5a86e6aaf8d31ae0f92dfc8824e910c72edb8fa9e6f06ddea092999b323497b88edd6df ( 808 kb)
  5:0  0f796f244b416cf451db5b4bdf809c84146d4c3ea849cc94399811a411062e36ed86e080424384814393986c54a47c2f85d4679687dccbb07dbbb13d68de1a77 (2007 kb)
  6:0  2fc58d2fb02d1d0f541042c4b2a0f12eaaa15ce0256c6ce1ff23f50d3bba52588119a66ae27dab3325f3bd676ee62fa586cc7c407b5f88b843eef8918325a17e (2805 kb)
  7:0* 5b25eca44b4dbd32fb32f519e8152a0af841936671f6e8fa90319a3bbde4659e7b297b30a744b0b9e3775c581d86035d1812905a425a64a0c07d0b89a8cc672f (1068 kb)
       ================================================================================================================================
  0:1* 35274c1b2e7428b55f22f5d4b2422e6128b077ef085b23dfc2d22a0818b967c42b2bc0f9adf402bf0d1dbbe22e5c899f40297eb3519aae3b447bc001e77ce8b0 ( 808 ≤ 2023 ≤ 3193 kb)
franks-mbp:rolling-hash frankw$ (echo "will it work?"; cat movie.mp4; ) > movie-v2.mp4
franks-mbp:rolling-hash frankw$ ./rolling-hash < movie-v2.mp4 > output-v2.txt && more output-v2.txt 
  0:0  0e7f199e93d449a6056a98dc2d75282d2744b40debbaae9b95676ed6e6ea8fc86ce83abf1f75a8e1065455f3d749021c53c00fda3ae6a6f7047fde3253cdd5c0 (3052 kb)
  1:0  009017341617da903604452377e3baa0f2071c86bd02ae9e92930e7c555aa4d98737dc4f78a5dffc5f55c0f9bb012c70378aa422e5f1bb813dd701e77985b92b (1236 kb)
  2:0  872ad4546565d74bb75677a77cccbec54faddb317f5ce0468a57f57d3ab8485a38a178ffbedb5ea7c56fcc67331d220ef12802d18fe64a874322a0d9d0ddfdd6 (1058 kb)
  3:0  14e9f28133efb345b6e753553e4253984fd24290a6a0825d4d7eef9350a645986466c48252b8299744556fc7ad0c9bc883d1b0fdc3c08fcea3515560fcb184c2 (3193 kb)
  4:0  6bd50073d0fd2e295de83ee673afb4abda9d11da0734dff813be462ec5a86e6aaf8d31ae0f92dfc8824e910c72edb8fa9e6f06ddea092999b323497b88edd6df ( 808 kb)
  5:0  0f796f244b416cf451db5b4bdf809c84146d4c3ea849cc94399811a411062e36ed86e080424384814393986c54a47c2f85d4679687dccbb07dbbb13d68de1a77 (2007 kb)
  6:0  2fc58d2fb02d1d0f541042c4b2a0f12eaaa15ce0256c6ce1ff23f50d3bba52588119a66ae27dab3325f3bd676ee62fa586cc7c407b5f88b843eef8918325a17e (2805 kb)
  7:0* 5b25eca44b4dbd32fb32f519e8152a0af841936671f6e8fa90319a3bbde4659e7b297b30a744b0b9e3775c581d86035d1812905a425a64a0c07d0b89a8cc672f (1068 kb)
       ================================================================================================================================
  0:1* 9ddf00eb2900eb401fe05109f980e4a8e70b36bfb65f6936b2bd335f02348da8fb3ed4e97b69472ea05dc2eeb68724173eded51edd19483020d17bfc4342a6db ( 808 ≤ 2023 ≤ 3193 kb)
franks-mbp:rolling-hash frankw$ diff output.txt output-v2.txt 
1c1
<   0:0  391d2a67da42ff23abb6985906a485d107def98cb21a10b3d29f5c3ef0b1512e57cafc80fac294d2d37987f454929dd7f1003979f12a196e527638d4f9c7bfb7 (3052 kb)
---
>   0:0  0e7f199e93d449a6056a98dc2d75282d2744b40debbaae9b95676ed6e6ea8fc86ce83abf1f75a8e1065455f3d749021c53c00fda3ae6a6f7047fde3253cdd5c0 (3052 kb)
10c10
<   0:1* 35274c1b2e7428b55f22f5d4b2422e6128b077ef085b23dfc2d22a0818b967c42b2bc0f9adf402bf0d1dbbe22e5c899f40297eb3519aae3b447bc001e77ce8b0 ( 808 ≤ 2023 ≤ 3193 kb)
---
>   0:1* 9ddf00eb2900eb401fe05109f980e4a8e70b36bfb65f6936b2bd335f02348da8fb3ed4e97b69472ea05dc2eeb68724173eded51edd19483020d17bfc4342a6db ( 808 ≤ 2023 ≤ 3193 kb)

As expected, just the first chunk is updated and obviously the root hash as well.

Note that the NodeOffset that is an input into BLAKE2 may have an undesired affect, imagine the first chunk being split in two chunks, then all subsequent hashes would differ purely due to a different NodeOffset although the underlying streams would be the same — but maybe this effect rarely happens in practice and would not be an issue/unimportant side effect (or alternatively always set the NodeOffset to 0 but that goes a bit against the BLAKE2 tree mode).

@vsivsi vsivsi changed the title Snapshots that differ only by renamed or duplicated files can't be committed. Also, is this project still active/alive? Snapshots that differ only by renamed or duplicated files can't be committed. Mar 10, 2017
@vsivsi
Copy link
Author

vsivsi commented Mar 29, 2017

Hi, I just wanted to take a second to thank you for your responses here and your work on s3git. Your reference to the chunking lib above introduced me to the Restic project which I hadn't previously encountered. After investigating it, that package seems to satisfy our immediate requirements and perhaps more importantly, seems stable and production ready. Longer term, I'm still interested in a more "git like" workflow for data, be it through s3git, noms, etc. But for now we've decided to go with Restic for this project. Thanks again.

@fwessels
Copy link
Collaborator

@vsivsi Great to hear that you found something that fits your needs. Restic is a nice project which is actively being developed, and give our regards to @fd0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants