Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with small chunks #601

Open
cofinoa opened this issue May 6, 2020 · 18 comments
Open

Performance issues with small chunks #601

cofinoa opened this issue May 6, 2020 · 18 comments
Milestone

Comments

@cofinoa
Copy link
Contributor

cofinoa commented May 6, 2020

We are facing some performance issue accessing to metadata, i.e. values for time variable, because the number of I/O reading operations required to access all the chunks.

In particular the time coordinate variable it's created with chunk size 1, requiring one chunk per time value, therefore if in the netcdf-4 there is a lot of time steps (in 6-hr or 3-hr > 10k), the netcdf-4 library has too look and read for each chunk (i.e. 8 bytes per chunk).

A better explanation of this pitfall, can be found on [1]:

Chunks are too small
There is a certain amount of overhead associated with finding chunks. When chunks are made smaller, there are more of them in the dataset. When performing I/O on a dataset, if there are many chunks in the selection, it will take extra time to look up each chunk. In addition, since the chunks are stored independently, more chunks results in more I/O operations, further compounding the issue. The extra metadata needed to locate the chunks also causes the file size to increase as chunks are made smaller. Making chunks larger results in fewer chunk lookups, smaller file size, and fewer I/O operations in most cases.

This relates to: #99, #100, #164

[1] https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/

@taylor13
Copy link
Collaborator

taylor13 commented May 6, 2020

@cofinoa You indicated in one of the related postings that in netCDF3 making larger chunks for the time coordinate means that it can't be declared "unlimited". In netCDF4 is that also true or can it be declared "unlimited" and be made into bigger chunks?
Thanks.

@durack1
Copy link
Contributor

durack1 commented May 6, 2020

@mauzey1 is there a preset chunking value set in the code somewhere? I recall going over this in some detail many years ago, but a quick search of the repo for "chunk" doesn't appear to have shown any defaults, at least in my viewing

@cofinoa
Copy link
Contributor Author

cofinoa commented May 6, 2020

@taylor13 to mitigate the problem in netcdf-3 the only solution it's not to make unlimited time dimension.

In netcdf4/hdf5 you can select different chunksizes to make it bigger size for the time coordinate variable and chunksize of 1 for the principal variable.

@cofinoa
Copy link
Contributor Author

cofinoa commented May 6, 2020

@durack1 and @mauzey1 the PR #100 just merge a change to impose a chunking size of 1 to time coordinate.

fc738df

@taylor13
Copy link
Collaborator

taylor13 commented May 6, 2020

@cofinoa - In netCDF4/HDF5, if you want a chunk size larger than 1 for an unlimited time dimension, do you have pass multiple time-slices (equal or larger than the chunk size) to be written in a single call to the netCDF library? If so, then I would say we shouldn't change the default from 1 because many people write their files one time slice at a time (i.e., they write a single time coordinate value and a corresponding data field that applies to that single time slice.

@durack1
Copy link
Contributor

durack1 commented May 6, 2020

@cofinoa we tried to optimize the deflation, shuffling and chunking settings for the best performance vs file sizes. It is a difficult balancing act, as the only way to squeeze the best performance for output formats is to know both the 1) data that you're writing and 2) the use of this data once written before the file is created. We focused more on deflation (to minimize file sizes) rather than chunking (reading written data) as no default for chunking was defined in Balaji et al., 2018

Some of the history about this can be found in #135 (comment), #164, #403. Long story short, we opted to prioritize file size first, while selecting a chunking default that provided reasonable read performance for most use cases we anticipated.

If you had a better suggestion as to how these should be set, by deploying an algorithm to assess the data being written this would be a useful update.

I note there are some comments about the version of the netcdf library playing a role in slow read speeds, see
Unidata/netcdf-c#489

This ref was also an interesting find https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf_4_chunking_performance_results

@cofinoa
Copy link
Contributor Author

cofinoa commented May 7, 2020

@taylor13 with respecto to:

In netCDF4/HDF5, if you want a chunk size larger than 1 for an unlimited time dimension, do you have pass multiple time-slices (equal or larger than the chunk size) to be written in a single call to the netCDF library?

No. The unlimited dimension logical size will increased independen from chunk size.

@durack1, about:

shuffling and chunking settings for the best performance vs file sizes. It is a difficult balancing act, as the only way to squeeze the best performance for output formats is to know both the 1) data that you're writing and 2) the use of this data once written before the file is created. We focused more on deflation (to minimize file sizes) rather than chunking (reading written data) as no default for chunking was defined in Balaji et al., 2018

I agree and I'm not proposing to modify the chunking properties (size, deflate, shuffling, ...) for the principal netcdf variable (i.e. tas). Those performance and size optimizations analysis, focus on the accessing (read/write) the actual data (principal variable) . The problem on performance I'm raising it's about exploring the netcdf metadata and coordinates, which is been affected by the chunking`storage` strategy used for them which is independent from the strategy for the principal variable. The issue #164 just mention to put chunking size equals to 1, but no performance impact was considered, is what I'm proposing to fix it.

To support my point, I have defined a netcdf-4/hdf5 with just one unlimited dimension, and 2 variables with 2 different chunks :

netcdf example {
    dimensions:
        time = UNLIMITED ; // (2 currently)
    variables:
        double time(time) ;
            time:_ChunkSizes = 10 ;
        double par(time) ;
            par:_ChunkSizes = 1 ;
    data:
        time = 1, 2 ;
        par = 1, 4 ;
}

The par variable is the principal variable and the time is coordinate variable, both uses time as unlimited dimension and currently with size 2. But both use different chunksize, 1 and 10 respectively.

You can generate the actual netcdf file with the above CDL:

$ ncgen -7 example.cdl

and compile this simplistic (no error control, ....) program which add a value to each variable on the unlimited dimension every time is executed:

#include <netcdf.h>

int main() {
    int  ncid, time_dimid, time_varid, par_varid;
    size_t time_len, pos[1];
    double value;

    nc_open("example.nc", NC_WRITE, &ncid);
    
    nc_inq_dimid(ncid, "time", &time_dimid);
    nc_inq_dimlen(ncid, time_dimid, &time_len);

    pos[0] = time_len;

    value = (double) time_len * 2 ;
    nc_inq_varid(ncid, "time", &time_varid);
    nc_put_var1_double(ncid, time_varid, pos, &value);
    
    value = value * 2;
    nc_inq_varid(ncid, "par", &par_varid);
    nc_put_var1_double(ncid, par_varid, pos, &value);

    nc_close(ncid);
}

If you execute it:

$ ./addOneValue

The content of the existing netcdf file will be:

netcdf example {
    dimensions:
        time = UNLIMITED ; // (3 currently)
    variables:
        double time(time) ;
            time:_ChunkSizes = 10 ;
        double par(time) ;
            par:_ChunkSizes = 1 ;
    data:
        time = 1, 2, 4 ;
        par = 1, 4, 8 ;
}

With respect to Unidata/netcdf-c#489 issue it mentions performance issues with metadata, but it relates to number netcdf entities itself (variables, attributes, dimension) and the library strategy to cache them when netcdf file it's open.

Hope this helps. Let me know if you need more info.

@durack1
Copy link
Contributor

durack1 commented May 7, 2020

@cofinoa in the #601 (comment) above there was no obvious next step regarding chunking coordinate variables. Have I missed something? As noted in #164, this is currently set at 1, what is your proposal (and what is the performance improvement with this)?

@taylor13
Copy link
Collaborator

taylor13 commented May 7, 2020

thank you @cofinoa for providing all this good background and information and bringing to our attention the performance issue in reading time-coordinates only.

If we can write individual time-slices and their associated time-coordinate value one at a time to a file (i.e., in separate calls to the nc "write" function), then I agree that a vector of coordinates values should probably never be "chunked", i.e., the entire vector of coordinate values should be written as a single chunk. I wouldn't think changing the default for chunking of coordinates would be that difficult, and it would apply to the "unlimited" time coordinate as well as other "limited" coordinates.

It appears no changes would be needed for the chunking of the data array itself.

Please let us know if this would be satisfactory.

@cofinoa
Copy link
Contributor Author

cofinoa commented May 7, 2020

@taylor13, yes, the data array (principal variable) it's not been affected. Its chunking strategy it's a different discussion.

@durack1 my proposal it's to define a chunk size balancing size issues (#164) and performance. The performance issue it's being explained at the issue description with an excerpt from HDF5 which explains the performance issue with small chunks size.

Currently, the netcdf-c library defines a DEFAULT_CHUNK_SIZE of 4MB, for general case, but for unlimited 1D variables has a DEFAULT_1D_UNLIM_SIZE of 4KB. See [1].

Then for the time coordinate variable, the chunk size can be 512, and with _DeflateLevel=1 to mitigate wasted space (4KB at most) on no filled chunks (issues #164 and #99):

netcdf example {
    dimensions:
        time = UNLIMITED ; // (2 currently)
    variables:
        double time(time) ;
            time:_ChunkSizes = 512 ;
            time:_DeflateLevel = 1 ;
        double par(time) ;
            par:_ChunkSizes = 1 ;
    data:
        time = 1, 2 ;
        par = 1, 4 ;
}

This will reduce chunk search and I/O in a factor (maximum) of 512 (see [2]).

[1] https://github.com/Unidata/netcdf-c/blob/15e1bbbd43e5deede72c34ad0674083c7805b6bd/libhdf5/hdf5var.c#L191-L227
[2] https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/

@durack1
Copy link
Contributor

durack1 commented Apr 7, 2024

@cofinoa this issue has been stale for ~4 years, so will close. If there are additional tweaks that make sense, please comment and reopen

@durack1 durack1 closed this as completed Apr 7, 2024
@taylor13
Copy link
Collaborator

taylor13 commented Apr 8, 2024

Perhaps, the suggested changes should be implemented prior to closing?

@cofinoa
Copy link
Contributor Author

cofinoa commented Apr 8, 2024

@durack1 as you pointed, it has been stale for a long period, but I don't know if it has been considered to be taken into account for the next "release" of the archiving specifications for data producers and its status of implementation (as @taylor13 suggest).

@durack1
Copy link
Contributor

durack1 commented Apr 8, 2024

@cofinoa to be honest, your suggestions are probably better directed at updating defaults for the netcdf-c library, as CMOR is a downstream user of this.

If there is some obvious defaults that could be updated in CMOR, which optimizes file sizes and file/variable accesses then this would be useful to incorporate.

Reading the above, it is not obvious to me what is required to fully address the issue - if you wanted to submit a PR for consideration this would be the fastest path to a solution.

As I noted, feel free to reopen if you wanted to submit a PR

This was referenced Apr 9, 2024
@cofinoa
Copy link
Contributor Author

cofinoa commented Apr 9, 2024

@durack1 I have opened the PR #733 where I guess the fixing for the CMOR should be applied.

The issue it's not with netCDF-C library, the issue it's with CMOR itself where assumption of having unlimited dimensions, enforces chunking A) with size 1 on unlimited diemsion and B) same chunking size for all netcdf vars which shared the unlimited dimension in the same file. This asumption it's right for netCDF3 data and storage model, not any more for netCDF4 data and storage model.

@taylor13 and @durack1 I would like also to suggest introducing a recommendation on this issue for DATA producers, when they start to encode data for the next CMIP7, but I don't know where is the appropriate forum: https://pcmdi.llnl.gov/CMIP6/Guide/modelers.html#7-archivingpublishing-output

@cofinoa
Copy link
Contributor Author

cofinoa commented Apr 9, 2024

@durack1 I can't re-open this issue, can you re-open it for me?

@durack1 durack1 reopened this Apr 9, 2024
@durack1 durack1 added this to the 3.9.0 milestone Apr 9, 2024
@durack1
Copy link
Contributor

durack1 commented Apr 9, 2024

@cofinoa thanks for PR #733, we'll pull that in and see if there are any impacts across the test suite and some usage file sizes, and merge into the planned 3.9.0 release next month if everything checks out great

@durack1
Copy link
Contributor

durack1 commented Apr 29, 2024

#733 merges the changes, but we need to add a test to ensure that we're not a) breaking anything, and b) not causing performance issues for "standard" datasets - for 3.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants