Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for years with 5 digits #648

Open
wachsylon opened this issue Feb 11, 2022 · 23 comments
Open

Support for years with 5 digits #648

wachsylon opened this issue Feb 11, 2022 · 23 comments
Milestone

Comments

@wachsylon
Copy link
Collaborator

Hi,
for paleo simulations, we have simulation runs which go beyond 10k years. CMOR only writes 4 digits for the years which may leads to parsing problems when the simulation time goes beyond 10000 years.

Maybe, CMOR could support a parameter 'DIGITS_YEARS'?

Best,
Fabi

@wachsylon
Copy link
Collaborator Author

Ok that might be more complicated than I thought first. With 6 digits it is unclear if it is years or months.
But 5 should be ok, or do I miss sth?

@durack1
Copy link
Contributor

durack1 commented Feb 11, 2022

@wachsylon interesting suggestion, I wonder if @taylor13 has some insights about how this has all been dealt with within PMIP (and the ISMIP6 experiment offshoots as noted below) which considers time periods very long timescales not well represented in modern calendars.

I just took a peek, and ism-lig127k-std is the only CMIP6 experiment that requests more than 9999 years:

        "ism-lig127k-std":{
            "activity_id":[
                "ISMIP6"
            ],
            "additional_allowed_model_components":[
                ""
            ],
            "description":"Last interglacial simulation of ice sheet evolution driven by PMIP lig127k",
            "end_year":"",
            "experiment":"offline ice sheet forced by ISMIP6-specified AGCM last interglacial output",
            "experiment_id":"ism-lig127k-std",
            "min_number_yrs_per_sim":"20000",
            "parent_activity_id":[
                "no parent"
            ],
            "parent_experiment_id":[
                "no parent"
            ],
            "required_model_components":[
                "ISM"
            ],
            "start_year":"",
            "sub_experiment_id":[
                "none"
            ],
            "tier":"3"
        },

It's worth pulling @jypeter into this discussion too

@taylor13
Copy link
Collaborator

The CMIP6 specifications for the "time_range" appearing in the filenames are:

The <time_range> is a string generated consistent with the following:
If frequency = “fx” then
                  <time_range>=””
else
                <time_range> = N1-N2 where N1 and N2 are integers of the form
                                      ‘yyyy[MM[dd[hh[mm[ss]]]]][<suffix>]’ (expressed as a string, 
                                      where where ‘yyyy’, ‘MM’, ‘dd’, ‘hh’ ‘mm’ and ‘ss’ are 
                                      integer year, month, day, hour, minute, and second, 
                                      respectively)
endif
 
where <suffix> is defined as follows:
if the variable identified by variable_id has a time dimension with a “climatology” 
          attribute then
                   suffix = “-clim”
else
                   suffix = “”
endif
 
and where the precision of the time_range strings is determined by the “frequency” 
global attribute as specified in Table 2.

see https://goo.gl/v1drZl

So as @wachsylon has noted, if we allow 6 digits for year, unambiguous interpretation of the date is impossible without also determining the frequency. Since all current options have an even number of digits for the dates, we could allow year to be either 4 or 5 digits without knowledge of the frequency. The template would become [Y]YYYY[MM[dd[...

Is that a good idea? I don't think modifying CMOR would be a problem, but folks trying to parse the date with a 5-digit year might have problems. Does anyone (@durack1 @mauzey1 @matthew-mizielinski @jypeter @mjuckes @davidhassell @martinjuckes) know of any CMIP infrastructure software that parses the dates in the CMIP6 file names?

@durack1
Copy link
Contributor

durack1 commented Feb 15, 2022

@MartinaSt pinging you here

@wachsylon
Copy link
Collaborator Author

If we allow [Y]YYYY, that would include allowing different amount of digits within atomic datasets. E.g. starting from 0001 up to 99 999 would look awkward however I cannot think of an issue any software would have. As another example, variant_label also begins with r1 instead of r01/r001 when there more than 9/99 realizations.

For ism-lig127k-std, it could be that the request only includes yearly frequencies so that there will be no ambiguities for that experiment. I learned that in our paleo project PalMod2, we have experiments going beyond 100 k AND monthly frequency output to be published.

A solution might be to use sub_experiment_id. The experiment then can be split up into parts registered and published as sub experiments.

@wachsylon
Copy link
Collaborator Author

For ism-lig127k-std, it could be that the request only includes yearly frequencies so that there will be no ambiguities for that experiment.

Never mind! Even daily output is requested :)

@matthew-mizielinski
Copy link

matthew-mizielinski commented Feb 16, 2022

For this edge case I don't have a big problem with extending the format to allow for one extra digit to cover years 10k-99k, but as Karl notes going to a 6 digit year will make interpretation of the date numbering with the current naming scheme tricky. We need to have a think about whether there are some sensible tweaks to the naming convention we use for the future to explicitly include frequency, without introducing too much in the way of disruption for users.

I wouldn't be surprised if some downstream tools will struggle to interpret the new date strings as and when they come across data formatted in this way, but as noted above this is the only experiment within CMIP6 that has this extent.

As an experiment I've just run a test and have managed to produce a file for an existing CMIP6 simulation with a 5 digit year; tas_Amon_HadGEM3-GC31-MM_amip_r1i1p1f3_gn_1190001-1190012.nc. No changes to CMOR were required here (although I had to adjust my tools slightly), and PrePARE passed this fine.

The next question I would pose would be whether the ESGF publisher and associated systems will be happy with this (@sashakames -- any thoughts).

@taylor13
Copy link
Collaborator

One clarification. I wrote the template as [Y]YYYY because we want to make it to be generally backward compatible. For runs that might be expected to have values larger than 9999, we might recommend or insist that all 5 digits be included for all years, so, for example, "02022", not "2022" would designate this year in such runs.

@durack1
Copy link
Contributor

durack1 commented Feb 16, 2022

Haven't thought this through, but the time format could be tweaked from 20220215-20220216 to 2022-02-15-2022-02-16 this would then naturally allow any number of years, e.g. 100000 in the case of PMIP. Of course, we are adding 6 characters (-), but that does provide flexibility. I haven't through about extending this to sub daily (including hour info)

@taylor13
Copy link
Collaborator

Yes, for a future DRS version, we could alter it as you suggest (although the hyphen separating the two dates would be more difficult to identify; I guess you could require the year to be at least 3 digits and search for the hyphen that precedes a string segment with more than 2 characters and no hyphen, but that is a bit complicated). The new template would not be backward compatible with the current DRS, so probably not a good option for immediate adoption.

@matthew-mizielinski
Copy link

matthew-mizielinski commented Feb 16, 2022

Haven't thought this through, but the time format could be tweaked from 20220215-20220216 to 2022-02-15-2022-02-16 this would then naturally allow any number of years, e.g. 100000 in the case of PMIP. Of course, we are adding 6 characters (-), but that does provide flexibility. I haven't through about extending this to sub daily (including hour info)

If we take this route we could go with a double dash as the separator; e.g. 2022-02-15--2022-02-16 and 2022-02--2052-03, but as Karl notes this is one for the future. There is a whole ISO standard on date times that we could use; for sub daily frequencies we could use, 2022-02-15T0000--2022-02-16T0000 for example. ISO 8601 appears to use / to separate the start and end dates of a period, but I think that would just be too confusing here.

@matthew-mizielinski
Copy link

One clarification. I wrote the template as [Y]YYYY because we want to make it to be generally backward compatible. For runs that might be expected to have values larger than 9999, we might recommend or insist that all 5 digits be included for all years, so, for example, "02022", not "2022" would designate this year in such runs.

Just thinking aloud; to have 5 digits used for years within an experiment we'd need to have the start / end dates or number of years included in the CMIP6_CV.json file, and then alter the behaviour of CMOR based on that value. However, there are also some experiments with a minimum number of years, but no maximum (e.g. piControl), which could (in theory) cross the 10k year boundary*. Trying to consistently handle this could get messy.

*A suitably fast model and commitment from the scientists running it would be required.

@durack1
Copy link
Contributor

durack1 commented Feb 16, 2022

Just thinking aloud; to have 5 digits used for years within an experiment we'd need to have the start / end dates or number of years included in the CMIP6_CV.json file, and then alter the behaviour of CMOR based on that value. However, there are also some experiments with a minimum number of years, but no maximum (e.g. piControl), which could (in theory) cross the 10k year boundary*. Trying to consistently handle this could get messy.

*A suitably fast model and commitment from the scientists running it would be required.

Exactly, I don't see a path forward that doesn't break the existing YYYYMMDD DRS-defined format that is expected by CMIP6, but maybe I am missing something?

@sashakames
Copy link
Collaborator

sashakames commented Feb 16, 2022

As far as publishing, the first concern is ensuring that Python can parse the "days since YYYY[Y]-MM-DD" We have been tripped up by several atypically formatted years with preceding 0's. The second is whether python timedelta supports such long year intervals in order to give the full range. I'm not sure to what extent those are tested.

To clarify, publishing is unaffected by the file naming scheme.

@taylor13
Copy link
Collaborator

You raise a good (different) point. If the usual python codes can't handle the "units" attribute when year exceeds "9999", or if it can't calculate elapsed time for those units, we're in trouble. Anyone know on limitations of cdtime and similar modules?

@durack1
Copy link
Contributor

durack1 commented Feb 16, 2022

@sashakames that was where my mind had started to wander too, within CF there are no examples that default from the "days since YYYY-MM-DD HH:MM:SS.x -x.xx" or their example "seconds since 1992-10-8 15:15:42.5 -6:00".

They also include a paleoclimate calendar, which is:

double time(time) ;
  time:long_name = "time" ;
  time:units = "days since 1-1-1 0:0:0" ;
  time:calendar = "126 kyr B.P." ;
  time:month_lengths = 34, 31, 32, 30, 29, 27, 28, 28, 28, 32, 32, 34 ;

Details are from https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch04s04.html.

I agree that testing whatever we work toward through software packages is a key test.

@durack1
Copy link
Contributor

durack1 commented Feb 16, 2022

Ok and that answers that:

In [5]: import cdtime

In [6]: cdtime.relativetime(31,"".join(['days since 10000-01-01 0:0:0.0']))
Out[6]: 31.000000 days since 10000-01-01 0:0:0.0

In [7]: cdtime.relativetime(31,"".join(['days since 100000-01-01 0:0:0.0']))
Out[7]: 31.000000 days since 100000-01-01 0:0:0.0

In [8]: a = cdtime.relativetime(31,"".join(['days since 100000-01-01 0:0:0.0']))

In [9]: a
Out[9]: 31.000000 days since 100000-01-01 0:0:0.0

In [10]: a.torel('days since 1-1-1')
Out[10]: 36523917.000000 days since 1-1-1

In [11]: a.torel('days since 1-1-1 12:12:12.5 -8.0')
Out[11]: 36523916.491522 days since 1-1-1 12:12:12.5 -8.0

Looks like cdtime can deal with arbitrary stuff easily, I wonder how other packages work?

@sashakames
Copy link
Collaborator

@durack1 Good to know cdtime appears rather flexible, so a potential solution if problems with timedelta.

@durack1
Copy link
Contributor

durack1 commented Aug 17, 2022

It would be useful to pick up this thread with the experience that @tomvothecoder and @pochedls have been generating using xcdat with cftime

@davidhassell
Copy link

davidhassell commented Aug 17, 2022 via email

@taylor13
Copy link
Collaborator

taylor13 commented Aug 17, 2022

For paleoclimate simulations (or simulations initiated in very early historic time -- sometime Before the Common Era), A negative year might appear (although this would rule out use of both the the "standard" and "julian" calendars). Perhaps we should think about how that would be handled too.

Perhaps insert a special character before the year? (e.g., "B" for BCE, or "M" for minus, or "N" for negative)

@durack1
Copy link
Contributor

durack1 commented Aug 17, 2022

I've just marked this as a CMOR 4.0 item, as it would be great to catch this and other tweaks as we spec out a next-gen roadmap

@durack1 durack1 added this to the 4.0/Future milestone Nov 28, 2022
@taylor13
Copy link
Collaborator

taylor13 commented May 6, 2024

As I read the above, we haven't really come to a consensus on how to proceed with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants