Efficiently read part of a file by seek offsets #107

kimo-k · 2023-07-06T17:50:46Z

It would be nice to read a slice of a file, based on the start & end char indices.

chatgpt attempt, seems to work:

Yes, there is a more efficient way to handle large files by reading only the necessary parts of the file instead of slurping the entire content. You can accomplish this using Java's java.nio.file API. Here is a way to do this using interop:

(defn lazy-substring-of-file [filename start end]
  (let [path (java.nio.file.Paths/get filename (into-array String []))
        options (into-array java.nio.file.OpenOption [java.nio.file.StandardOpenOption/READ])
        fc (.newByteChannel java.nio.file.Files path options)
        bb (java.nio.ByteBuffer/allocate (- end start))]
    (.position fc start)
    (.read fc bb)
    (.close fc)
    (String. (.array bb) "UTF-8")))

In this function:

java.nio.file.Paths/get is used to get a java.nio.file.Path object from the filename.

java.nio.file.Files/newByteChannel is used to create a new java.nio.channels.SeekableByteChannel to the file.

java.nio.ByteBuffer/allocate is used to create a ByteBuffer of the right size.

.position is used to set the read position of the byte channel.

.read is used to read the right amount of bytes from the file into the ByteBuffer.

String. (.array bb) "UTF-8" is used to create a new string from the ByteBuffer.

This function avoids reading the whole file into memory by only reading the necessary bytes. It works best when start and end are relatively small compared to the size of the file.

The text was updated successfully, but these errors were encountered:

borkdude · 2023-07-06T20:21:32Z

Can you tell more about the use case (rather than the implementation) aside from "would be nice"?

kimo-k · 2023-07-06T23:32:40Z

Sure, I'm imagining a program that has to read many large files at predictable locations. For instance, media & archive headers. AFAIK, using slurp or fs/read-all-bytes could incur a high performance cost per operation, due to loading each entire file into memory. For a developer, it would be nice to have a readymade cross-platform function, and not have to delve into the host API.

That said, I've only done this sort of thing in C, so apologies if it's inaccurate or out of scope.

borkdude · 2023-07-07T09:22:00Z

I'll keep this issue open to see if more people are interested. "lazily" might not be an accurate description: you want to read some specific segment from a file, without reading all of the file into memory, right?

kimo-k changed the title ~~Read part of a file lazily~~ Efficiently read part of a file by seek offsets Jul 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiently read part of a file by seek offsets #107

Efficiently read part of a file by seek offsets #107

kimo-k commented Jul 6, 2023

borkdude commented Jul 6, 2023

kimo-k commented Jul 6, 2023

borkdude commented Jul 7, 2023

Efficiently read part of a file by seek offsets #107

Efficiently read part of a file by seek offsets #107

Comments

kimo-k commented Jul 6, 2023

borkdude commented Jul 6, 2023

kimo-k commented Jul 6, 2023

borkdude commented Jul 7, 2023