You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Seems like if we are really going to take advantage of object stores, we are going to need to make concurrent calls to get http range subsets? Noting the RandomAccessFile.java is called out as not being thread safe -- does that basically mean a hard no, it's not going to happen... or?
There is hope, even if RandomAccessFile and friends are not themselves threads-safe. When reading a "remote" random access file, we use a loading cache, where the cache entries are chunks of data that are the size of the remote read buffer (configurable). If a read is requested that requires multiple buffers to be filled first (for example, reading data from a variable), there is code that fills the loading cache with those individual buffers, and then RandomAccessFile returns the data to a higher level. Currently, the cache is loaded serially, but this could be done in in multiple threads, and that's where you would see the speedup you are looking for. The code that would need to be changed to support parallel reads is:
Ok -- wow this would be a huge help for me if it would make cdms3 work for larger read chunks. Right now, the latency of this serial pattern means traversing largish chunks of data stored in S3 take a really long time. If we could bring that down with some parallel read capability, I think it would have a really significant affect on overall performance.
Trying to wrap my head around what this would do -- we have all our data chunked for fast per-time-step reads of county to state scale spacial domains.
If I increased the s3.bufferSize would it reduce the total number of S3 requests but potentially increase the latency for single requests? Is there any way to see a log of these S3 calls so I can play around with it?
If I increased the s3.bufferSize would it reduce the total number of S3 requests but potentially increase the latency for single requests?
Correct. The default is 256 kiB, which means there could be a large number of requests for even a modest slice of data, and each one would have the overhead of making an http call. Increasing the buffer size should reduce the overall time reading the data. It won't be as dramatic as reading data into the buffer using parallel http calls, but it should be noticeable, for sure.
Is there any way to see a log of these S3 calls so I can play around with it?
Right now, the best way to play around with it would to time running the ncdump utility for a slice of data, but using the toolsUI jar instead of netcdfAll (as netcdfAll does not include the cdm-s3 project).
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I was snooping around in
netcdf-java/cdm/core/src/main/java/ucar/unidata/io/RandomAccessFile.java
Line 51 in b24beca
Seems like if we are really going to take advantage of object stores, we are going to need to make concurrent calls to get http range subsets? Noting the RandomAccessFile.java is called out as not being thread safe -- does that basically mean a hard no, it's not going to happen... or?
Beta Was this translation helpful? Give feedback.
All reactions