Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiently read part of a file by seek offsets #107

Open
kimo-k opened this issue Jul 6, 2023 · 3 comments
Open

Efficiently read part of a file by seek offsets #107

kimo-k opened this issue Jul 6, 2023 · 3 comments

Comments

@kimo-k
Copy link

kimo-k commented Jul 6, 2023

It would be nice to read a slice of a file, based on the start & end char indices.

chatgpt attempt, seems to work:

Yes, there is a more efficient way to handle large files by reading only the necessary parts of the file instead of slurping the entire content. You can accomplish this using Java's java.nio.file API. Here is a way to do this using interop:

(defn lazy-substring-of-file [filename start end]
  (let [path (java.nio.file.Paths/get filename (into-array String []))
        options (into-array java.nio.file.OpenOption [java.nio.file.StandardOpenOption/READ])
        fc (.newByteChannel java.nio.file.Files path options)
        bb (java.nio.ByteBuffer/allocate (- end start))]
    (.position fc start)
    (.read fc bb)
    (.close fc)
    (String. (.array bb) "UTF-8")))

In this function:

  • java.nio.file.Paths/get is used to get a java.nio.file.Path object from the filename.
  • java.nio.file.Files/newByteChannel is used to create a new java.nio.channels.SeekableByteChannel to the file.
  • java.nio.ByteBuffer/allocate is used to create a ByteBuffer of the right size.
  • .position is used to set the read position of the byte channel.
  • .read is used to read the right amount of bytes from the file into the ByteBuffer.
  • String. (.array bb) "UTF-8" is used to create a new string from the ByteBuffer.

This function avoids reading the whole file into memory by only reading the necessary bytes. It works best when start and end are relatively small compared to the size of the file.

@borkdude
Copy link
Contributor

borkdude commented Jul 6, 2023

Can you tell more about the use case (rather than the implementation) aside from "would be nice"?

@kimo-k
Copy link
Author

kimo-k commented Jul 6, 2023

Sure, I'm imagining a program that has to read many large files at predictable locations. For instance, media & archive headers. AFAIK, using slurp or fs/read-all-bytes could incur a high performance cost per operation, due to loading each entire file into memory. For a developer, it would be nice to have a readymade cross-platform function, and not have to delve into the host API.

That said, I've only done this sort of thing in C, so apologies if it's inaccurate or out of scope.

@borkdude
Copy link
Contributor

borkdude commented Jul 7, 2023

I'll keep this issue open to see if more people are interested. "lazily" might not be an accurate description: you want to read some specific segment from a file, without reading all of the file into memory, right?

@kimo-k kimo-k changed the title Read part of a file lazily Efficiently read part of a file by seek offsets Jul 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants