Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMOR_MAX_STRING #530

Open
wachsylon opened this issue Aug 9, 2019 · 13 comments
Open

CMOR_MAX_STRING #530

wachsylon opened this issue Aug 9, 2019 · 13 comments
Milestone

Comments

@wachsylon
Copy link
Collaborator

Hi,
is there a reason why CMOR_MAX_STRING is 1024 or could we increase it?

#define CMOR_MAX_STRING 1024

Best regards,
Fabi

@taylor13
Copy link
Collaborator

taylor13 commented Aug 9, 2019

Originally, we had different string lengths for different attributes, but that go simplified, as I recall, so that all have the same length 1024 (internally in CMOR). I think there was some issue with these strings occupying lots of storage, so we didn't make it huge. I don't think there is a fundamental reason for limiting the attributes to 1024 characters, but it would be better to not have a bunch of unused space reserved for such things as source_id and experiment_id. Which attribute needs more space?

@wachsylon
Copy link
Collaborator Author

We discussed to specify a paper for references .

@taylor13
Copy link
Collaborator

Would it be difficult to double the length of the references attribute, but not the others? Would this cause any problems? Would doubling be sufficient? Note that if any single attribute is very long it will make the ncdump -h result less easy to digest.

@mauzey1
Copy link
Collaborator

mauzey1 commented Sep 16, 2019

There have been changes made in CMIP6_CV.json in cmip6-cmor-tables that add strings with lengths that exceed CMOR's current max length of 1024. This has caused tests in that repo to fail.

I think we should consider increasing CMOR_MAX_LENGTH to 2048 or 4096 to allow longer strings for attributes. I'm not sure how much that would impact file size.

@durack1 @doutriaux1 Do you know why 1024 was selected for the maximum string length? Do you think it could be increased now?

@taylor13
Copy link
Collaborator

Trouble is CMOR_MAX_LENGTH is for almost all strings created by CMOR, and the internal table is already huge with wasted empty space. I don't think we want to double it. A better solution would be to define different groups of string variables with different max. length.

@durack1
Copy link
Contributor

durack1 commented Sep 16, 2019

@mauzey1 I tend to agree with @taylor13 that we should think a little about memory usage when the source attribute is the only string that needs changing, other fields (such as activity_participation etc) are controlled, well under the 1024 limit and will not change. Only the institution_id and source in the source_id are likely to need expansion.

@doutriaux1
Copy link
Collaborator

@taylor13 is right the tables size is now huge we should create groups:

CMOR_TINY_LENGTH=16
CMOR_SHORT_LNGTH=32
CMOR_MESSAGE_LENGTH=128
CMOR_PARAGRAPH_LENGTH=1024
CMOR_TEXT_LENGTH=2048
CMOR_BLOG_LENGTH=4096

Or someting like that.

@doutriaux1
Copy link
Collaborator

most can probably fit in the first 3 and it would save memory and speed up things

@durack1
Copy link
Contributor

durack1 commented Sep 19, 2019

@mauzey1 as part of the fix for this, we'll need to implement a test so that the nightly changes to CMIP6_CV.json are tested and not merged in the case that such an error occurs

@taylor13
Copy link
Collaborator

taylor13 commented May 6, 2024

I support the @doutriaux1 suggestion above. As I recall the original FORTRAN version was implemented that way. We would need to document the max length of each string under the control of users (or in the CV's read by CMOR).

@durack1
Copy link
Contributor

durack1 commented May 7, 2024

@taylor13 with a pure python implementation string lengths would not be fixed at compilation time, so the technical challenge would be removed (python can dynamically size any string as required), but we'd need to implement some guidance so ridiculous entries are flagged/warning at a minimum, and potentially error is particularly egregious cases

@taylor13
Copy link
Collaborator

taylor13 commented May 7, 2024

Good point. I understood that we were dropping the FORTRAN option, but didn't realize that nothing would be written in C. Will pure python be fast enough? Can it be parallelized, or aren't we going to pursue that?

@durack1
Copy link
Contributor

durack1 commented May 7, 2024

@taylor13 @mauzey1 had planned to test the performance by leveraging the latest libraries associated with xarray, so DASK, etc. From my understanding, the current performance bottlenecks relate to the netcdf library, but this will be important to benchmark as we progress. I know that @matthew-mizielinski has been monitoring the CMOR performance on their new systems, so it will be important to keep track of this as we go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants