Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File.reading a file bigger than 2GB results in an OverflowError #14485

Open
koute opened this issue Apr 14, 2024 · 7 comments
Open

File.reading a file bigger than 2GB results in an OverflowError #14485

koute opened this issue Apr 14, 2024 · 7 comments

Comments

@koute
Copy link

koute commented Apr 14, 2024

Unlike in Ruby it seems like File.reading a file that's at least 2GB in size doesn't work.

Consider the following:

$ dd if=/dev/zero of=/tmp/dummy bs=1M count=2048
$ echo 'File.read("/tmp/dummy")' > test.cr
$ crystal ./test.cr
Unhandled exception: Arithmetic overflow (OverflowError)
  from /usr/lib/crystal/file.cr:748:26 in 'read'
  from /usr/lib/crystal/file.cr:739:3 in 'read'
  from test.cr:1:1 in '__crystal_main'
  from /usr/lib/crystal/crystal/main.cr:129:5 in 'main_user_code'
  from /usr/lib/crystal/crystal/main.cr:115:7 in 'main'
  from /usr/lib/crystal/crystal/main.cr:141:3 in 'main'
  from /usr/lib/libc.so.6 in '??'
  from /usr/lib/libc.so.6 in '__libc_start_main'

If you change the count=2048 to count=2047 then it works.

Related: #3209

Crystal 1.11.2 (2024-03-02)

LLVM: 17.0.6
Default target: x86_64-pc-linux-gnu
@straight-shoota
Copy link
Member

straight-shoota commented Apr 15, 2024

The limitation is based on Int32 being the type to represent the size of a collection (String in this case), so it can only have Int32::MAX elements.

Now I'm wondering if it really makes sense to actually have such a large String instance. It might be quite inconvenient because char indices are non-linear. So that specific detail might be up for debate.

But other collections, particularly Slice, should certainly be able to hold more than 2GB.

@koute Do you have a real use case where you need this or is it just a theoretical discussion?
As a workaround, you can open the file and read it in chunks, performing operations on each individual chunk of data.

Related: #8111 (comment), https://forum.crystal-lang.org/t/is-increasing-array-index-size-in-the-future/6610/2

@straight-shoota
Copy link
Member

Also related is #4011. It was originally about getting no error when accessing an out-of-range array index (which is fixed). Now it's only about improving the error message to point out the reason for overflow. I suppose this would be an improvement for this use case as well.

@koute
Copy link
Author

koute commented Apr 15, 2024

@koute Do you have a real use case where you need this or is it just a theoretical discussion?

Yes, this is a real use case.

I was looking to switch from Ruby to Crystal when writing my scripts to speed them up, and this was the very first issue I've encountered, which really surprised me (sure, I expected Crystal to not have automatic bigint promotion like Ruby has, but I didn't expect it to use 32-bit integers for something like this, which seems baffling to me considering how tiny 2GB is nowadays).

I use File.read in Ruby on big files all the time to process them, especially in my quick & dirty scripts. I know this can be worked around by e.g. reading the file in chunks, and if this was not a quick & dirty script I would certainly do that, but for quick one-off scripts I just want to minimize the friction while writing them.

@HertzDevil
Copy link
Contributor

HertzDevil commented Apr 16, 2024

Not sure about Ruby's situation, but contiguous allocations in that size range will perform poorly with the Boehm GC, especially on Windows (contrast with #14395), so I don't think the standard library is going to accommodate them in the near future.

Memory-mapped I/O is a notable exception where you could have huge contiguous memory ranges without any allocation. To create a read-only view on Windows:

lib LibC
  PAGE_READONLY = 0x02

  FILE_MAP_READ = 0x0004

  fun CreateFileMappingA(hFile : HANDLE, lpFileMappingAttributes : SECURITY_ATTRIBUTES*, flProtect : DWORD, dwMaximumSizeHigh : DWORD, dwMaximumSizeLow : DWORD, lpName : LPSTR) : HANDLE
  fun MapViewOfFile(hFileMappingObject : HANDLE, dwDesiredAccess : DWORD, dwFileOffsetHigh : DWORD, dwFileOffsetLow : DWORD, dwNumberOfBytesToMap : SizeT) : Void*
  fun UnmapViewOfFile(lpBaseAddress : Void*) : BOOL
end

File.open(...) do |file|
  handle = Crystal::System::FileDescriptor.windows_handle(file.fd)
  size = file.size
  mapping = LibC.CreateFileMappingA(handle, nil, LibC::PAGE_READONLY,
    LibC::DWORD.new!(size >> 32), LibC::DWORD.new!(size), nil)
  view = LibC.MapViewOfFile(mapping, LibC::FILE_MAP_READ, 0, 0, 0).as(UInt8*)

  # this should be okay even if `size > Int32::MAX`
  # bytes = Bytes.new(view, size, read_only: true)
  # io = IO::Memory.new(bytes, writeable: false)

  LibC.UnmapViewOfFile(view)
  LibC.CloseHandle(mapping)
end

or on Unix-like systems:

File.open(...) do |file|
  size = file.size
  view = LibC.mmap(nil, size, LibC::PROT_READ, LibC::MAP_PRIVATE, file.fd, 0).as(UInt8*)

  # this should be okay even if `size > Int32::MAX`
  # bytes = Bytes.new(view, size, read_only: true)
  # io = IO::Memory.new(bytes, writeable: false)
  
  LibC.munmap(view, size)
end

If Slice does support 64-bit sizes, then bytes could probably act as a drop-in replacement for File.read or, more precisely, File.open(..., &.getb_to_end). io would also work as long as you don't need any IO::FileDescriptor-specific functionality.

@ysbaddaden
Copy link
Contributor

@HertzDevil Nice! I've been wondering about mmap for this use case, and this looks exciting.

@crysbot
Copy link

crysbot commented Apr 18, 2024

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/built-in-support-for-mmap/6772/1

@crysbot
Copy link

crysbot commented Apr 18, 2024

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/built-in-support-for-mmap/6772/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants