Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment: Sparse file support #185

Draft
wants to merge 27 commits into
base: master
Choose a base branch
from
Draft

Conversation

sharifhsn
Copy link
Contributor

@sharifhsn sharifhsn commented Dec 8, 2022

I decided to take a stab at #89 just to see how feasible it is. In order to facilitate this, I have temporarily changed the API of print_all to take Input with a new generic Box <dyn Read> variant as a fallback.

So far, this implementation is only able to use sparseness when there is non-sparse content at the end of the file, and the rest of the file, including the start, is sparse. Otherwise, it will cause strange errors.

It also only works for Linux, because this was the simplest OS to implement sparse support for.

How to test this:

truncate -s 16G startzero
echo "Lorem ipsum" >> startzero
cargo run --release -- startzero

The position will be slightly wrong but it will finish extremely quickly.

Tasks:

  • correct position indexing
  • handle fully sparse files
  • handle when the file starts with non-sparse data
  • handle non-sparse data interspersed in files (this might work already to some extent)
  • fully support Linux (caveat: does not work with BufReader)
  • fully support Windows
  • fully support MacOS

@sharkdp
Copy link
Owner

sharkdp commented Dec 8, 2022

I decided to take a stab at #89 just to see how feasible it is. In order to facilitate this, I have temporarily changed the API of print_all to take Input with a new generic Box <dyn Read> variant as a fallback.

Just a few comments for now:

  • I said from the beginning that I'm only going to maintain hexyl-as-a-library on a best-effort basis. I have no problems with breaking changes to the API. Especially if it can advance hexyl-as-an-application. It's also not like we have hundreds of dependents.
  • I really appreciate your effort here. If you are interested in the problem itself - great! But I just want to warn you that this might very well turn out to be too complex (or too risky, especially concerning cross-platform testability) for an actual merge. I'm not saying it definitely is. Just want to mention that this is a possibility.
  • Maybe it would be useful to use sth like nix instead of libc directly?

@sharifhsn
Copy link
Contributor Author

If you are interested in the problem itself - great! But I just want to warn you that this might very well turn out to be too complex (or too risky in terms of bugs) for an actual merge.

Yeah, I'm pretty aware of this, at least at my progress so far. That's why I'm labeling this PR as an experiment, I just want to see how possible it is. At least so far, it seems to work fine on Linux (with a major caveat that I'm trying to work around at the moment).

I hadn't looked into nix before, only rustix. It seems that nix supports SEEK_DATA while rustix doesn't, so I think I'll replace this use of libc with nix (and also maybe raise an issue for rustix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants