Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zeros in ocean post grib2 files on hera #2615

Open
JessicaMeixner-NOAA opened this issue May 22, 2024 · 95 comments · May be fixed by #2681
Open

zeros in ocean post grib2 files on hera #2615

JessicaMeixner-NOAA opened this issue May 22, 2024 · 95 comments · May be fixed by #2681
Labels
bug Something isn't working

Comments

@JessicaMeixner-NOAA
Copy link
Contributor

What is wrong?

When running with the sea-ice PR that was just merged, so essentially develop as of today, it was noticed by @SulagnaRay-NOAA that all of the ocean grib2 files are constant values (mostly zeros). The native model output is not zeros and the ice gribs also appear to be okay.

Investigation as to what is going on and why is ongoing.

What should have happened?

We should have grib2 output files that match the native model output (and have non-zero/constant values).

What machines are impacted?

Hera

Steps to reproduce

This was discovered running a C384 test case of C384mx025_3DVarAOWCDA. However, I suspect other test cases would expose this issue as well.

Some example output can be found here:
/scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/cold03/COMROOT/cold03/gfs.20210703/06/products/ocean/grib2/0p25

Log files can be found here: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/cold03/COMROOT/cold03/logs/2021070306

Additional information

@GwenChen-NOAA @jiandewang @SulagnaRay-NOAA @LydiaStefanova-NOAA @guillaumevernieres @CatherineThomas-NOAA FYI - any additional information or help is appreciated!

Do you have a proposed solution?

Not yet...

@JessicaMeixner-NOAA JessicaMeixner-NOAA added bug Something isn't working triage Issues that are triage labels May 22, 2024
@jiandewang
Copy link
Contributor

@JessicaMeixner-NOAA we need to check the regular grid ocean nc files (which is used as input for converting to grib2) but they were erased in the g-w runs. For example the following doean't exist anymore:
/scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.2629439

@JessicaMeixner-NOAA
Copy link
Contributor Author

@jiandewang I'll rewind and re-run one of them and save the rundir. I'll post back here when I have that.

@JessicaMeixner-NOAA
Copy link
Contributor Author

Here's the saved output @jiandewang :

TMP: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.4064953
LOG: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/cold03/COMROOT/cold03/logs/2021070306/gfsocean_prod_f234-f240.log
COM: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/cold03/COMROOT/cold03/gfs.20210703/06/products/ocean

@jiandewang
Copy link
Contributor

@JessicaMeixner-NOAA quick check for these three files:
ocean.nc: ocean native grid master file, looks good
ocean.0p25.nc: regular grid, all zero
ocean.1p00.nc: regular grid, all zero

so the problem happened on tripolar to regular step, let me go through log file to see if there is any clue

@jiandewang
Copy link
Contributor

@JessicaMeixner-NOAA can you re-run it but set debug to true ?
see last line /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.4064953/ocnicepost.nml

@JessicaMeixner-NOAA
Copy link
Contributor Author

@jiandewang here's the output with debug=true:
/scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181

@aerorahul
Copy link
Contributor

The output with debug = .true. is tracing the code execution.
I did a ncview and ncdump on intermediate files e.g. ocean.0p25.rdbilin3d.nc, etc., but I am unable to get any clues from them.
I wondered if there has been a change in the interpolation weights.
So, I looked at /scratch1/NCEPDEV/global/glopara/fix/mom6/20240416/post/mx025/
and the timestamp on these files is 20240403 which seems reasonable.

If needed, I can dig deeper into the interpolation code.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@GwenChen-NOAA do you have an idea as to what is going on? We'd appreciate your help to determine issues here.

@jiandewang
Copy link
Contributor

I am trying to understanding the run sequential for this post job: fcst step generate oceannativenc, then it being copied as ocean.nc and further more cut out key variables and saved as ocean_subset.nc. Which one is being used as input for post ? ocean.nc or ocean_subset.nc ?

ls -l /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181/ocean*nc

-rw-r--r-- 1 Jessica.Meixner climate 1328960900 May 22 10:46 ocean.0p25.nc
-rw-r--r-- 1 Jessica.Meixner climate 83412020 May 22 10:45 ocean.1p00.nc
-rw-r--r-- 1 Jessica.Meixner climate 2090477767 May 21 13:06 ocean.nc
-rw-r--r-- 1 Jessica.Meixner climate 1959785283 May 22 10:46 ocean_subset.nc

ocean.1p00.nc is generated 1 minute before ocean_subset.nc

looked at line 74 /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181/ocean.post.log
it shows the min/max before and after the interpolation and the # here are totally fine. But somehow when we looked at the final products, they are all zero. Really puzzled here.

@GwenChen-NOAA
Copy link
Contributor

@JessicaMeixner-NOAA, can you provide the sea-ice PR number that just merged? It will be helpful to look at the code changes.

@GwenChen-NOAA
Copy link
Contributor

I am trying to understanding the run sequential for this post job: fcst step generate ocean_native_nc, then it being copied as ocean.nc and further more cut out key variables and saved as ocean_subset.nc. Which one is being used as input for post? ocean.nc or ocean_subset.nc?

@jiandewang, the ocean.nc files are used to generate grib2 files. The ocean_subset.nc files are moved to the /products directory as the netcdf products to be distributed through NOMADS.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@jiandewang I think ocean.nc is used to create ocean_subset.nc - I could be wrong... let me look into that more.

@GwenChen-NOAA - The PR is #2584 I did just confirm that output from hera from before this PR was merged also had the issue where the grib files were zero output, so the sea-ice analysis PR is not the cause of this problem. I"m not sure how long this issue has been in the develop branch, if it's just a hera issue or something else?

@GwenChen-NOAA
Copy link
Contributor

GwenChen-NOAA commented May 22, 2024

@GwenChen-NOAA - The PR is #2584 I did just confirm that output from hera from before this PR was merged also had the issue where the grib files were zero output, so the sea-ice analysis PR is not the cause of this problem. I"m not sure how long this issue has been in the develop branch, if it's just a hera issue or something else?

@JessicaMeixner-NOAA, can you run it on WCOSS2? I know downstream package can only run on WCOSS2.

@JessicaMeixner-NOAA
Copy link
Contributor Author

JessicaMeixner-NOAA commented May 22, 2024

@GwenChen-NOAA The ocean post products should be able to be generated on RHDPCS, not just WCOSS2. I don't have a workflow set-up there right now, so it would be great if you could try that out to see if it works.

I did find an old run that I was doing when trying to update the ufs-weather-model to a more recent version and it has non-zero fields: /scratch1/NCEPDEV/climate/Jessica.Meixner/testgw2505/test02/COMROOT/test02/gfs.20191203/00/products/ocean/grib2/1p00/gfs.ocean.t00z.1p00.f072.grib2 (for example has non-zero fields). The commit of g-w was updates from an April 17th commit. We could also look into if module for hera were updated within the ufs-weather-model between the updates as I do think that this job is using the ufs-weather-model.

@JessicaMeixner-NOAA
Copy link
Contributor Author

Okay I did confirm that the ufs-weather-model modules have not changed on hera, so it's not just that.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@EricSinsky-NOAA I see that you've been running some ocean/ice post recently. Thought I'd ping you in this to see if you've noticed that grib files of the ocean were zeros or constant in any of your testing.

@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label May 22, 2024
@EricSinsky-NOAA
Copy link
Contributor

EricSinsky-NOAA commented May 22, 2024

@JessicaMeixner-NOAA I just ran the C48_S2SWA_gefs CI test case today using the most recent hash (7d2c539). I also see all zeroes in the gridded (5 degree) ocean data. The data is all zeroes in the gridded NetCDF data as well (not just the gridded grib2 data).

@JessicaMeixner-NOAA
Copy link
Contributor Author

@JessicaMeixner-NOAA I just ran the C48_S2SWA_gefs CI test case today using a the most recent hash (7d2c539). I also see all zeroes in the gridded (5 degree) ocean data. The data is all zeroes in the gridded NetCDF data as well (not just the gridded grib2 data).

@EricSinsky-NOAA thanks for the info! what machine was that on?

@EricSinsky-NOAA
Copy link
Contributor

@EricSinsky-NOAA thanks for the info! what machine was that on?

@JessicaMeixner-NOAA This test was on Cactus.

@JessicaMeixner-NOAA
Copy link
Contributor Author

Thanks @EricSinsky-NOAA, seems like this is not just a hera issue then.

I'm re-running my case on hera where i went back and found that I had output I expected. I'm then going to merge in develop and see how that goes as well. Hopefully will have an update on that this afternoon.

@JessicaMeixner-NOAA
Copy link
Contributor Author

Okay, my re-run of something where I thought I had previously had grib2 output that was non-zero, did not give me non-zeros this time.... I believe that should rule out the model version, but not sure what to look at now...

@JessicaMeixner-NOAA
Copy link
Contributor Author

@GwenChen-NOAA when you tested this: #2611 did you get non-zero grib2 output files?

@GwenChen-NOAA
Copy link
Contributor

@GwenChen-NOAA when you tested this: #2611 did you get non-zero grib2 output files?

@JessicaMeixner-NOAA, my test used an old version of the ocean.0p25.nc file (i.e., latlon netcdf file output from ocnicepost) and worked fine. I saw the ocean.0p25.nc file under /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181 also contains all zero. I found a recent closed issue (#2483) that updated fix files for CICE and MOM6/post. Perhaps @DeniseWorthen can provide some clues here.

@aerorahul
Copy link
Contributor

@GwenChen-NOAA when you tested this: #2611 did you get non-zero grib2 output files?

@JessicaMeixner-NOAA, my test used an old version of the ocean.0p25.nc file (i.e., latlon netcdf file output from ocnicepost) and worked fine. I saw the ocean.0p25.nc file under /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181 also contains all zero. I found a recent closed issue (#2483) that updated fix files for CICE and MOM6/post. Perhaps @DeniseWorthen can provide some clues here.

The issue #2483 only added/corrected the 5-degree fix file. It did not alter the 0.25-degree or 1.0-degree fix files.

@JessicaMeixner-NOAA
Copy link
Contributor Author

Thanks @aerorahul for that information!

@EricSinsky-NOAA
Copy link
Contributor

I just ran the C48_S2SW CI test case on Cactus using the 5/13/2024 commit hash (6ca106e). The gridded ocean data still consists of all zeroes as of the 5/13/2024 gw version. Will keep trying to go back to earlier commit hashes to get a better idea when and why this issue started.

@JessicaMeixner-NOAA
Copy link
Contributor Author

I updated to the latest version of ufs-weather-model on hera and ran another test and got all zeros in the gribs still. @EricSinsky-NOAA we know at least the HR3 tag 6f9afff from Feb 21st has non-zero gribs on wcoss2. On hera, the furthest back of g-w would be the rocky8 transition commit.

@EricSinsky-NOAA
Copy link
Contributor

@jiandewang After replacing the fix files with /scratch2/NCEPDEV/ensemble/noscrub/Eric.Sinsky/ocnpost_bugfix/oceanice_products.3448181/fixed-file-wcoss2 and rerunning, I am still getting all zeroes.

@JessicaMeixner-NOAA
Copy link
Contributor Author

My test run of C48 on wcoss2 did not do well: /lfs/h2/emc/couple/noscrub/jessica.meixner/testoceanpost/hr3/test01/COMROOT/c48t01/gfs.20210323/12/products/ocean/grib2/5p00

@EricSinsky-NOAA
Copy link
Contributor

Thank you, @JessicaMeixner-NOAA. It sounds like this might be an issue with the build of ocnicepost.x on WCOSS2 and Hera. @jiandewang When you ran your HR3 test and you got reasonable interpolated ocean output, did you rebuild ocnicepost.x (as well as the other executables related to HR3) during your test?

@jiandewang
Copy link
Contributor

Thank you, @JessicaMeixner-NOAA. It sounds like this might be an issue with the build of ocnicepost.x on WCOSS2 and Hera. @jiandewang When you ran your HR3 test and you got reasonable interpolated ocean output, did you rebuild ocnicepost.x (as well as the other executables related to HR3) during your test?

no I just used my original several month ago's *.x

@JessicaMeixner-NOAA
Copy link
Contributor Author

I did a new build, but I did have an old build too... I'll try the 0.25 case w/the new build and I'll also try using my old build on a C48 case and see what happens.

@JessicaMeixner-NOAA
Copy link
Contributor Author

Update:

  • With the HR3 g-w tag 6f9afff from Feb 21, 2024 on WCOSS2 using both old and new builds I get the zeros for the C48mx500 test cases, and I get actual non-zero answers for the C768mx025 test case I ran.

Therefore, I think there are likely issues with all of the 5 deg cases and so we should not be using that to see if things are working or not.

@EricSinsky-NOAA
Copy link
Contributor

EricSinsky-NOAA commented May 24, 2024

@JessicaMeixner-NOAA Glad to see you are getting non-zeroes for C768mx025. Were the C768mx025 test cases also based on the HR3 tag (not just the C48mx500 test case)? Also did you run the C768mx025 test case using both your old build and new build too?

Also, I ran an old version of ocnicepost offline. I got non-zeroes in the interpolated NetCDF output. In this test, however, the resolution of the NetCDF input (MOM6) data was mx025.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@EricSinsky-NOAA It is nice to see some non-zero values, for sure!!

The tests I ran with the HR3 tag, I ran both the old build and the new build and both had non-zeros.

@EricSinsky-NOAA
Copy link
Contributor

This is my understanding on what we know so far:

  • The C768mx025 case with the HR3 tag result in non-zero values in the interpolated ocean output for both the new build and old build on WCOSS2. This means that building ocnicepost in the present day WCOSS2 environment should be ok.
  • The C48mx500 case with the HR3 tag result in all zero values in the interpolated ocean output for both the new build and old build on WCOSS2.
  • The C384mx025 case using the most recent gw hash results in all zero values in the interpolated ocean output (based on Jessica's tests on Hera). This means that we still get zero values for mx025 cases from hashes newer than the HR3 tag.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@EricSinsky-NOAA I'd say that we get zero's with the newest hashes, where the mx025 issues come in between now and HR3 tag is an open question I think, since most of our previous testing was based on mx500, I'm not sure we have a lot of information about the in-between parts. I'm going to run a few tests on WCOSS2 to see if we can narrow down issues there.

@aerorahul
Copy link
Contributor

Thank you @EricSinsky-NOAA for the summary and @JessicaMeixner-NOAA for the additional information.

A few questions:

  • For the C768mx025 case with the HR3 tag can you drop the date of the fix files (interpolation weights) being used?
  • If we re-run the executable ocnicepost.x by replacing these weights w/ the develop version, do we get non-zero result?

I'ld say we need to find a baseline that works first; I think we have that for C768mx025 case with the HR3 tag. Unfortunately C48mx500 with the HR3 tag resulted in zeros.

@JessicaMeixner-NOAA
Copy link
Contributor Author

For the HR3 tag on WCOSS2 the mom6 fix files are:
mom6 -> /lfs/h2/emc/global/noscrub/emc.global/FIX/fix/mom6/20231219

I'm currently trying to test the commit before the fix file change on wcoss2 with mx025 to see if that works. I did find an experiment on hera that a case using the old fix files and mx025 still gave me zeros...

@JessicaMeixner-NOAA
Copy link
Contributor Author

I ran with mx025 on WCOSS2 for commit hashes 6ca106e and d5366c6 (the one that changed the mom6 fix) and they both give me non-zero output for the grib2 files....

I can share paths if that's helpful. Has anyone tried anything mx025 on orion?

@JessicaMeixner-NOAA
Copy link
Contributor Author

So some random thoughts before the weekend:

  • Is it possible that there could be a module mis-match issue causing problems on hera where gfs_utils is using 1.6.1 spack-stack, but then the ocean post job is loading the ufs-weather-model module files which is 1.5.1?
  • Did we ever confirm that the reason for the diffs between wcoss2 and hera that @jiandewang saw were because of version numbers or were there actually differences?

@EricSinsky-NOAA
Copy link
Contributor

EricSinsky-NOAA commented May 24, 2024

  • Did we ever confirm that the reason for the diffs between wcoss2 and hera that @jiandewang saw were because of version numbers or were there actually differences?

@JessicaMeixner-NOAA The diffs between WCOSS2 and Hera are because the comparisons were between two different versions of the fix files. The fix files being compared from WCOSS2 are the 20231219 version, while the fix files being compared from Hera are the 20240416 version. Both fix file versions exist on both WCOSS2 and Hera. When the fix files of the same version are compared between WCOSS2 and Hera, the file sizes are identical.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@EricSinsky-NOAA thanks for confirming that!

@jiandewang
Copy link
Contributor

some further testing results:
(1) The fix files 20231219 version vs 20240416 version: there is a 360 degree offset in longitute between them. The results generated by them are not identical but differences are on roundoff level (~E-8). So this is not the reason for the zero value in regular grid file.

(2) in HR3 run on wcoss2 which gave us correct results, ocean master files are on 40 levels. However in Jessica's HERA run (/scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_06/C384iaucold03/TMP/RUNDIRS/cold03/oceanice_products.3448181) and Eric's run, ocean.nc are on 75 levels because you are setting as DA

see https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/gfs/config.ufs#L454-L459
I used Jessica's run dir as template but replaced ocean.nc by the one from HR3 run (40L), then it generated correct regular grid file.

@jiandewang
Copy link
Contributor

more testing results:
It is the missing value that messed up the results. In HR3 run it is -e34 while in DA it is set as 0.
After I re-set missing value to -e34 in ocean.nc from Jessica's run dir, the interpolated results are correct. I think this missing value is embeded in fixed files when they were generated using one of previous HRx run output where it is -e34.
I did my test on wcoss2. Somehow I had trouble to run it on HERA due to module loading.

@EricSinsky-NOAA : you may repeat your run but use my modified input file at /scratch1/NCEPDEV/climate/Jiande.Wang/working/scratch/ocean-zero-value/ceanice_products.3448181-JM/NCO2/ocean.nc-JM-75L-E34 or you can simply repeat your C48mx500 run but set https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/gfs/config.ufs#L456C9-L456C31 as -e34

@EricSinsky-NOAA
Copy link
Contributor

EricSinsky-NOAA commented May 28, 2024

@jiandewang Thank you very much for finding the issue! I just ran the C48_S2SWA_gefs CI test case (MOM6 is set to mx500) using the most recent hash. I have set MOM6_DIAG_MISVAL to -1e34 in parm/config/gefs/config.ufs and this fixed the issue (non-zeroes in the interpolated ocean output).

EDIT: My test was on WCOSS2.

@JessicaMeixner-NOAA
Copy link
Contributor Author

The exception value will need to be resolved with @guillaumevernieres and others, as DA might need the missing value to be set as 0.

@jiandewang what module issues did you have on hera? I was curious on Friday if we had module mis-match issues as a possible issue.

@jiandewang
Copy link
Contributor

@JessicaMeixner-NOAA I followed Walter's method (the g-w I used is the cycle one you asked me to run). No error pop out after I did source ush/......... but when I ran ocnicepost.x it crashed at writing 3D mask file.

@jiandewang
Copy link
Contributor

a quick and dirty solution: apply this command in the script after DA ocean files being generated:
ncatted -a missing_value,,m,f,-1E34
that will make oceanpost happy

@DeniseWorthen
Copy link

DeniseWorthen commented May 29, 2024

Apologies for being late to the party. Am I understanding that the missing value is defined as 0.0 in the history file? A missing value of 0.0 makes no sense to me, since it is also a valid value. How do you distinguish where Temp=0 because it really is 0.0C and where it is 0 because it is a land point?

@jiandewang
Copy link
Contributor

Apologies for being late to the party. Am I understanding that the missing value is defined as 0.0 in the history file? A missing value of 0.0 makes no sense to me, since it is also a valid value. How do you distinguish where Temp=0 because it really is 0.0C and where it is 0 because it is a land point?

@DeniseWorthen see https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/gfs/config.ufs#L456C9-L456C31

@DeniseWorthen
Copy link

@jiandewang Thanks, but that doesn't answer my question really. How is a missing value of 0.0 being distinguished from a physical value of 0.0?

@guillaumevernieres
Copy link
Contributor

@jiandewang Thanks, but that doesn't answer my question really. How is a missing value of 0.0 being distinguished from a physical value of 0.0?

@DeniseWorthen , you just don't construct your mask based on the fill value.

@DeniseWorthen
Copy link

DeniseWorthen commented May 29, 2024

@guillaumevernieres Thanks. So where does your mask come from?

edit: I mean, which file? Are you retrieving it from the model output or are you using something else?

@guillaumevernieres
Copy link
Contributor

@guillaumevernieres Thanks. So where does your mask come from?

edit: I mean, which file? Are you retrieving it from the model output or are you using something else?

We use the mom6 grid generation functionality but this is overkill for this issue. The mask could simply be constructed using the layer thicknesses.

@JessicaMeixner-NOAA
Copy link
Contributor Author

A PR has been created so that for GFS or GEFS versus GDAS/ENKF we have different exception values and number of layers for MOM6. This should be able to resolve this problem, although in the future, it might be good to still explore updating how the mask is defined in the ocean post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants