Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicate shared bytes as git does for texty files #293

Open
danimesq opened this issue Nov 13, 2022 · 7 comments
Open

De-duplicate shared bytes as git does for texty files #293

danimesq opened this issue Nov 13, 2022 · 7 comments

Comments

@danimesq
Copy link

No description provided.

@danimesq
Copy link
Author

@sebastianrath I was expecting snow-fs already had this

@sebastianrath
Copy link
Contributor

sebastianrath commented Dec 26, 2022

SnowFS supports copy-on-write for certain file systems like APFS, but it does not yet have deduplication implemented in the application layer. Currently, the main reason for this is performance, as fragmentation in binaries can have a higher impact on CPU and I/O. For the first implementation of SnowFS speed had a higher priority over disk space. However, we are considering adding this as an opt-in option, as these impacts may not be relevant for every project.

@danimesq
Copy link
Author

I'm here cheering for this to become an opt-in feature (personally ASAP but for y'all no pressure)

@sebastianrath
Copy link
Contributor

Could you share some background info? What type of projects would that be beneficial to? How many files, and what are the overall file sizes? Thanks!

@danimesq
Copy link
Author

danimesq commented Dec 27, 2022

@sebastianrath

What type of projects would that be beneficial to? How many files, and what are the overall file sizes? Thanks!

To have an idea, I have tons of GB of screenshots both on mobile and on desktop.
And it is sad to know that most of the GB of these files have shared bytes that could be dedupliced.

Imagine a screenshot of a notepad, where most of its pixels are white; so all of that could be dedupliced (for example, Windows start menu icon on these screenshots wouldn't be repeated).
I imagine GIFs and video file formats uses a similar approach for overlapping frames.

@danimesq
Copy link
Author

BTW I'm working at a new symlink daemon that will support to form a single file from shared objects.
Its here: https://github.com/Floflis/witchlink

@danimesq
Copy link
Author

@sebastianrath do you know libraries that finds duplicate bytes on files and moves these duplicates into separate files?

I would love if git natively had more than 1 object per file, so there wouldn't be "foo", "bar" and "foobar" objects but only "foo" and "bar".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants