Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle sparse files #89

Open
clarfonthey opened this issue May 17, 2020 · 9 comments
Open

Handle sparse files #89

clarfonthey opened this issue May 17, 2020 · 9 comments

Comments

@clarfonthey
Copy link

Because hexyl truncates repeating sections, it would be nice to be able to have hexyl quickly skip over these sections instead of scanning them byte-by-byte.

@sharkdp
Copy link
Owner

sharkdp commented May 19, 2020

So this is a request for hexyl to be faster?

@sharkdp
Copy link
Owner

sharkdp commented May 26, 2020

Closing for now, as it's unclear to me.

@sharkdp sharkdp closed this as completed May 26, 2020
@clarfonthey
Copy link
Author

clarfonthey commented May 28, 2020

Sorry I never got to this. And, well, yes.

All the major filesystems expose the concept of sparse files, where the file is fragmented on disk and entire sections are marked as being zeroes without actually being stored.

For example, take an 8GiB disk image that is freshly formatted from a zeroed-out disk, with one or two files written to it that are both a few KiB. The actual image will only take up a few KiB on disk because the OS simply marks the file as being all zero except for a few sections.

While hexyl going through the file naively would take several minutes of reading the zero bytes, ultimately it would truncate these sections in the actual output. Reading these files would become feasible if hexyl used the OS subroutines to query the file for its zero ranges, then used this information to skip reading those unnecessarily.

I should also add, on filesystems where sparse files, the system calls that request for zeroed out sections will just always return nothing, making any implementations not have to worry about those cases.

@sharkdp sharkdp reopened this May 29, 2020
@sharifhsn
Copy link
Contributor

Unfortunately, due to the type signature of the print_all function, this isn't really possible. In order to access sparse file search on Windows and Linux (haven't checked MacOS) the data type must be a file. However, the print_all function accepts any type that implements Read, which makes it impossible to check whether it's reading a file or not. One way to change this is to make print_all accept the Input enum, but that would limit any generic usage of hexyl to read any data except a File or Stdin.

@sharkdp
Copy link
Owner

sharkdp commented Dec 7, 2022

While hexyl going through the file naively would take several minutes of reading the zero bytes, ultimately it would truncate these sections in the actual output. Reading these files would become feasible if hexyl used the OS subroutines to query the file for its zero ranges, then used this information to skip reading those unnecessarily.

@clarfonthey Is this really the case? Can you provide any benchmark/timing results? Ideally, in a reproducable setup.

@clarfonthey
Copy link
Author

I mean, it's pretty easy to just truncate -s 2G test and then run hexyl test to see that while the file takes up no space on disk, it will loop for quite a while.

@sharkdp
Copy link
Owner

sharkdp commented Dec 7, 2022

Sure. I was questioning if it is really taking several minutes or not. In fact, for an 8G sparse file, hexyl takes 5.329 s ± 0.042 s (benchmark below).

But okay, the problem is real. If not for 8G, then for 800G. And I trust you that this is a valid use case.

So the next question is: how could we potentially implement this (ignoring for now what @sharifhsn talked about above). Would the idea be to do something like the following?

If we are in squeezing mode (i.e. we already detected a "full line" of zeros), and haven't read any data for X bytes: switch to a special "fast-forward" mode which would call lseek(2) with the current position as the offset parameter and whence = SEEK_DATA to jump over the hole?

(since that call could 'fail' silently and simply return the current position, we would have to make sure to only switch into fast-forward mode once until we continue to find non-zero bytes)

@clarfonthey Are you aware of any tools that do implement something similar?

Benchmark:

▶ truncate -s 3G test; echo "needle" >> test; truncate -s 8G test

▶ hyperfine --warmup=1 --runs=5 'hexyl test' 'hexdump test'
Benchmark 1: hexyl test
  Time (mean ± σ):      5.329 s ±  0.042 s    [User: 4.099 s, System: 1.220 s]
  Range (min … max):    5.287 s …  5.388 s    5 runs
 
Benchmark 2: hexdump test
  Time (mean ± σ):     16.443 s ±  0.071 s    [User: 14.715 s, System: 1.713 s]
  Range (min … max):   16.346 s … 16.544 s    5 runs
 
Summary
  'hexyl test' ran
    3.09 ± 0.03 times faster than 'hexdump test'

(xxd had to be excluded since it does not seem to have a squeezing mode... and therefore extremely slow)

@wasamasa
Copy link

xxd actually has a squeezing mode, it calls it autoskip and exposes it with -a and -autoskip.

@sharkdp
Copy link
Owner

sharkdp commented Jan 11, 2023

Thank you. I did not know that. xxd -a test takes almost 3 minutes (2:57 min), i.e. it's still an order of magnitude slower than both hexdump and hexyl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants