Any hope of a multi-threaded cdms3 implementation? #1034

dblodgett-usgs · 2022-06-16T01:53:51Z

dblodgett-usgs
Jun 16, 2022

I was snooping around in

netcdf-java/cdm/core/src/main/java/ucar/unidata/io/RandomAccessFile.java

Line 51 in b24beca

  * Must be thread confined - that is, can only be used by a single thread at a time.. 

and https://github.com/Unidata/netcdf-java/blob/maint-5.x/cdm/s3/src/main/java/ucar/unidata/io/s3/S3RandomAccessFile.java

Seems like if we are really going to take advantage of object stores, we are going to need to make concurrent calls to get http range subsets? Noting the RandomAccessFile.java is called out as not being thread safe -- does that basically mean a hard no, it's not going to happen... or?

lesserwhirls · 2022-06-16T13:44:39Z

lesserwhirls
Jun 16, 2022
Collaborator

There is hope, even if RandomAccessFile and friends are not themselves threads-safe. When reading a "remote" random access file, we use a loading cache, where the cache entries are chunks of data that are the size of the remote read buffer (configurable). If a read is requested that requires multiple buffers to be filled first (for example, reading data from a variable), there is code that fills the loading cache with those individual buffers, and then RandomAccessFile returns the data to a higher level. Currently, the cache is loaded serially, but this could be done in in multiple threads, and that's where you would see the speedup you are looking for. The code that would need to be changed to support parallel reads is:

netcdf-java/cdm/core/src/main/java/ucar/unidata/io/RemoteRandomAccessFile.java

Lines 128 to 138 in 011ed19

 // Now fill the buffer using whole cache blocks, up until the last cache block (as reading from the last cache 

 // block might be a partial read). 

 // LOOK: might benefit from concurrency? Need to see how often we end up inside this while statement. 

 // Initial testing shows this only happens for small readCacheBlockSize (which is equal to the read buffer size) 

 // when reading netCDF-4 files, but I think might depend on the specific IOSP in use. 

 long currentCacheBlockNumber = firstCacheBlockNumber + 1; 

 while (currentCacheBlockNumber < lastCacheBlockNumber) { 

 totalBytesRead += readCacheBlockFull(currentCacheBlockNumber, currentOffsetIntoBuffer, buff); 

 currentOffsetIntoBuffer += readCacheBlockSize; 

 currentCacheBlockNumber += 1; 

 }

4 replies

dblodgett-usgs Jun 16, 2022
Author

Ok -- wow this would be a huge help for me if it would make cdms3 work for larger read chunks. Right now, the latency of this serial pattern means traversing largish chunks of data stored in S3 take a really long time. If we could bring that down with some parallel read capability, I think it would have a really significant affect on overall performance.

lesserwhirls Jun 16, 2022
Collaborator

You can currently control the size of the read chunks that feed the cache, so that might help you without any code changes (see here).

dblodgett-usgs Jun 16, 2022
Author

Trying to wrap my head around what this would do -- we have all our data chunked for fast per-time-step reads of county to state scale spacial domains.

If I increased the s3.bufferSize would it reduce the total number of S3 requests but potentially increase the latency for single requests? Is there any way to see a log of these S3 calls so I can play around with it?

lesserwhirls Jun 22, 2022
Collaborator

If I increased the s3.bufferSize would it reduce the total number of S3 requests but potentially increase the latency for single requests?

Correct. The default is 256 kiB, which means there could be a large number of requests for even a modest slice of data, and each one would have the overhead of making an http call. Increasing the buffer size should reduce the overall time reading the data. It won't be as dramatic as reading data into the buffer using parallel http calls, but it should be noticeable, for sure.

Is there any way to see a log of these S3 calls so I can play around with it?

Right now, the best way to play around with it would to time running the ncdump utility for a slice of data, but using the toolsUI jar instead of netcdfAll (as netcdfAll does not include the cdm-s3 project).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any hope of a multi-threaded cdms3 implementation? #1034

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Any hope of a multi-threaded cdms3 implementation? #1034

dblodgett-usgs Jun 16, 2022

Replies: 1 comment · 4 replies

lesserwhirls Jun 16, 2022 Collaborator

dblodgett-usgs Jun 16, 2022 Author

lesserwhirls Jun 16, 2022 Collaborator

dblodgett-usgs Jun 16, 2022 Author

lesserwhirls Jun 22, 2022 Collaborator

dblodgett-usgs
Jun 16, 2022

Replies: 1 comment 4 replies

lesserwhirls
Jun 16, 2022
Collaborator

dblodgett-usgs Jun 16, 2022
Author

lesserwhirls Jun 16, 2022
Collaborator

dblodgett-usgs Jun 16, 2022
Author

lesserwhirls Jun 22, 2022
Collaborator