Snapshots that differ only by renamed or duplicated files can't be committed. #20

vsivsi · 2017-02-28T01:03:51Z

Hi thanks for this package.

First the bug: In snapshot mode, if the only difference between the current state and the previous snapshot is the addition of a duplicate file, the snapshot will fail to complete, even though the directory state has been updated (through the addition of the duplicated file).

Repro:

mkdir dup-test
cd dup-test
s3git init
# Initialized empty s3git repository in <directory>
head -n 100000 /dev/urandom > file1.bin
s3git snapshot create . -m 'Initial version'
# [commit <long hash>]
cp file1.bin file1.bin.bak
s3git snapshot create . -m 'Added backup file'
# No changes to snapshot
s3git log -p
# <long hash> Initial version

I also have a couple of quick questions:

How do you enable the rolling hash deduplication? It does not appear to be on by default. If I continue the example above by modifying the end and then the beginning of the file:

du -sh .s3git
# 24M	.s3git
echo 'woot' | cat file1.bin - > file1.bin.bak
s3git snapshot create . -m 'Added post-wooted backup file'
# [commit <long hash>]
du -sh .s3git
# 29M	.s3git         # The last chunk changed, as expected
echo 'woot' | cat - file1.bin > file1.bin.bak2
s3git snapshot create . -m 'Added pre-wooted backup file'
# [commit <long hash>]
du -sh .s3git
# 53M	.s3git         # The every chunk changed, NOT as expected

It appears that appending to files will be deduplicated, but prepending (or otherwise modifying) the file will not be. That doesn't fit my definition of "rolling hash" (e.g. how rsync or rabin file chunking work). Is this implemented? If so, how to enable it?

Finally, a general question that may be answered automatically by your response, but I'm curious about the status of this package. Is it being maintained? Are there plans to move forward beyond the "pre-release" and "use at your own peril (for now)" stage? It looks like a tremendously useful package that is currently more fully baked in comparison to the newer Noms or Dat projects, which have somewhat overlapping goals and approaches...

Thanks in advance for your timely response!

The text was updated successfully, but these errors were encountered:

vsivsi · 2017-03-01T22:14:19Z

Just to add. It is not necessary to create a duplicate file to see this bug, simply renaming an existing file will have the same effect. The commit will not go through: No changes to snapshot

fwessels · 2017-03-10T21:22:04Z

Thanks for the bug, I will look into that.

Rolling hash deduplication is not supported at the moment, here is a pointer at a package (https://github.com/restic/chunker) that should not be too difficult to integrate in order to define dynamic boundaries. See also next comment down that shows an experiment. So for now only appends at the end will work optimally.

fwessels · 2017-03-10T21:22:55Z

Below are the results of a mock-up test that prepends data at the beginning in order to see the effects on the BLAKE2 Tree mode:

franks-mbp:rolling-hash frankw$ ./rolling-hash < movie.mp4 > output.txt && more output.txt 
  0:0  391d2a67da42ff23abb6985906a485d107def98cb21a10b3d29f5c3ef0b1512e57cafc80fac294d2d37987f454929dd7f1003979f12a196e527638d4f9c7bfb7 (3052 kb)
  1:0  009017341617da903604452377e3baa0f2071c86bd02ae9e92930e7c555aa4d98737dc4f78a5dffc5f55c0f9bb012c70378aa422e5f1bb813dd701e77985b92b (1236 kb)
  2:0  872ad4546565d74bb75677a77cccbec54faddb317f5ce0468a57f57d3ab8485a38a178ffbedb5ea7c56fcc67331d220ef12802d18fe64a874322a0d9d0ddfdd6 (1058 kb)
  3:0  14e9f28133efb345b6e753553e4253984fd24290a6a0825d4d7eef9350a645986466c48252b8299744556fc7ad0c9bc883d1b0fdc3c08fcea3515560fcb184c2 (3193 kb)
  4:0  6bd50073d0fd2e295de83ee673afb4abda9d11da0734dff813be462ec5a86e6aaf8d31ae0f92dfc8824e910c72edb8fa9e6f06ddea092999b323497b88edd6df ( 808 kb)
  5:0  0f796f244b416cf451db5b4bdf809c84146d4c3ea849cc94399811a411062e36ed86e080424384814393986c54a47c2f85d4679687dccbb07dbbb13d68de1a77 (2007 kb)
  6:0  2fc58d2fb02d1d0f541042c4b2a0f12eaaa15ce0256c6ce1ff23f50d3bba52588119a66ae27dab3325f3bd676ee62fa586cc7c407b5f88b843eef8918325a17e (2805 kb)
  7:0* 5b25eca44b4dbd32fb32f519e8152a0af841936671f6e8fa90319a3bbde4659e7b297b30a744b0b9e3775c581d86035d1812905a425a64a0c07d0b89a8cc672f (1068 kb)
       ================================================================================================================================
  0:1* 35274c1b2e7428b55f22f5d4b2422e6128b077ef085b23dfc2d22a0818b967c42b2bc0f9adf402bf0d1dbbe22e5c899f40297eb3519aae3b447bc001e77ce8b0 ( 808 ≤ 2023 ≤ 3193 kb)
franks-mbp:rolling-hash frankw$ (echo "will it work?"; cat movie.mp4; ) > movie-v2.mp4
franks-mbp:rolling-hash frankw$ ./rolling-hash < movie-v2.mp4 > output-v2.txt && more output-v2.txt 
  0:0  0e7f199e93d449a6056a98dc2d75282d2744b40debbaae9b95676ed6e6ea8fc86ce83abf1f75a8e1065455f3d749021c53c00fda3ae6a6f7047fde3253cdd5c0 (3052 kb)
  1:0  009017341617da903604452377e3baa0f2071c86bd02ae9e92930e7c555aa4d98737dc4f78a5dffc5f55c0f9bb012c70378aa422e5f1bb813dd701e77985b92b (1236 kb)
  2:0  872ad4546565d74bb75677a77cccbec54faddb317f5ce0468a57f57d3ab8485a38a178ffbedb5ea7c56fcc67331d220ef12802d18fe64a874322a0d9d0ddfdd6 (1058 kb)
  3:0  14e9f28133efb345b6e753553e4253984fd24290a6a0825d4d7eef9350a645986466c48252b8299744556fc7ad0c9bc883d1b0fdc3c08fcea3515560fcb184c2 (3193 kb)
  4:0  6bd50073d0fd2e295de83ee673afb4abda9d11da0734dff813be462ec5a86e6aaf8d31ae0f92dfc8824e910c72edb8fa9e6f06ddea092999b323497b88edd6df ( 808 kb)
  5:0  0f796f244b416cf451db5b4bdf809c84146d4c3ea849cc94399811a411062e36ed86e080424384814393986c54a47c2f85d4679687dccbb07dbbb13d68de1a77 (2007 kb)
  6:0  2fc58d2fb02d1d0f541042c4b2a0f12eaaa15ce0256c6ce1ff23f50d3bba52588119a66ae27dab3325f3bd676ee62fa586cc7c407b5f88b843eef8918325a17e (2805 kb)
  7:0* 5b25eca44b4dbd32fb32f519e8152a0af841936671f6e8fa90319a3bbde4659e7b297b30a744b0b9e3775c581d86035d1812905a425a64a0c07d0b89a8cc672f (1068 kb)
       ================================================================================================================================
  0:1* 9ddf00eb2900eb401fe05109f980e4a8e70b36bfb65f6936b2bd335f02348da8fb3ed4e97b69472ea05dc2eeb68724173eded51edd19483020d17bfc4342a6db ( 808 ≤ 2023 ≤ 3193 kb)
franks-mbp:rolling-hash frankw$ diff output.txt output-v2.txt 
1c1
<   0:0  391d2a67da42ff23abb6985906a485d107def98cb21a10b3d29f5c3ef0b1512e57cafc80fac294d2d37987f454929dd7f1003979f12a196e527638d4f9c7bfb7 (3052 kb)
---
>   0:0  0e7f199e93d449a6056a98dc2d75282d2744b40debbaae9b95676ed6e6ea8fc86ce83abf1f75a8e1065455f3d749021c53c00fda3ae6a6f7047fde3253cdd5c0 (3052 kb)
10c10
<   0:1* 35274c1b2e7428b55f22f5d4b2422e6128b077ef085b23dfc2d22a0818b967c42b2bc0f9adf402bf0d1dbbe22e5c899f40297eb3519aae3b447bc001e77ce8b0 ( 808 ≤ 2023 ≤ 3193 kb)
---
>   0:1* 9ddf00eb2900eb401fe05109f980e4a8e70b36bfb65f6936b2bd335f02348da8fb3ed4e97b69472ea05dc2eeb68724173eded51edd19483020d17bfc4342a6db ( 808 ≤ 2023 ≤ 3193 kb)

As expected, just the first chunk is updated and obviously the root hash as well.

Note that the NodeOffset that is an input into BLAKE2 may have an undesired affect, imagine the first chunk being split in two chunks, then all subsequent hashes would differ purely due to a different NodeOffset although the underlying streams would be the same — but maybe this effect rarely happens in practice and would not be an issue/unimportant side effect (or alternatively always set the NodeOffset to 0 but that goes a bit against the BLAKE2 tree mode).

vsivsi · 2017-03-29T17:31:47Z

Hi, I just wanted to take a second to thank you for your responses here and your work on s3git. Your reference to the chunking lib above introduced me to the Restic project which I hadn't previously encountered. After investigating it, that package seems to satisfy our immediate requirements and perhaps more importantly, seems stable and production ready. Longer term, I'm still interested in a more "git like" workflow for data, be it through s3git, noms, etc. But for now we've decided to go with Restic for this project. Thanks again.

fwessels · 2017-03-29T21:05:13Z

@vsivsi Great to hear that you found something that fits your needs. Restic is a nice project which is actively being developed, and give our regards to @fd0.

vsivsi changed the title ~~Snapshots that differ only by duplicate files can't be committed~~ Snapshots that differ only by duplicate files can't be committed. Also, is this project still alive? Feb 28, 2017

vsivsi changed the title ~~Snapshots that differ only by duplicate files can't be committed. Also, is this project still alive?~~ Snapshots that differ only by duplicate files can't be committed. Also, is this project still active/alive? Feb 28, 2017

vsivsi changed the title ~~Snapshots that differ only by duplicate files can't be committed. Also, is this project still active/alive?~~ Snapshots that differ only by renamed or duplicated files can't be committed. Also, is this project still active/alive? Mar 1, 2017

vsivsi changed the title ~~Snapshots that differ only by renamed or duplicated files can't be committed. Also, is this project still active/alive?~~ Snapshots that differ only by renamed or duplicated files can't be committed. Mar 10, 2017

fwessels mentioned this issue Mar 14, 2017

s3git snapshots read entire (potentially huge) data files into memory with probability 1 in 64 #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshots that differ only by renamed or duplicated files can't be committed. #20

Snapshots that differ only by renamed or duplicated files can't be committed. #20

vsivsi commented Feb 28, 2017 •

edited

vsivsi commented Mar 1, 2017

fwessels commented Mar 10, 2017

fwessels commented Mar 10, 2017

vsivsi commented Mar 29, 2017

fwessels commented Mar 29, 2017

Snapshots that differ only by renamed or duplicated files can't be committed. #20

Snapshots that differ only by renamed or duplicated files can't be committed. #20

Comments

vsivsi commented Feb 28, 2017 • edited

vsivsi commented Mar 1, 2017

fwessels commented Mar 10, 2017

fwessels commented Mar 10, 2017

vsivsi commented Mar 29, 2017

fwessels commented Mar 29, 2017

vsivsi commented Feb 28, 2017 •

edited