GSI convergence problems in scout runs in 2020 #755

jderber-NOAA · 2024-06-10T14:57:50Z

Jeff Whitaker's group is reporting convergence issues in their 3dvar C96L127 atm-only scout run starting in 2020.

"Initial cost function = 5.330609621864649467E+06
Initial gradient norm = 6.047187774679628015E+07
cost,grad,step,b,step? = 1 0 5.330609621864649467E+06 6.047187774679628015E+07 1.325505088896680807E-09 0.000000000000000000E+00 SMALL
cost,grad,step,b,step? = 1 1 5.337598182749103755E+06 4.183988366714026779E+07 1.268375413166407207E-09 4.731939262212094266E-01 SMALL
PCGSOI: WARNING **** Stopping inner iteration ***
Penalty increase or constant 1 1 0.100131102470077527E+01 0.100000000000000000E+01

I've tried various things to get around this:

different initial times (from ops and/or replay) in 2020 and 2021 - no impact
zero initial bias correction or bias correction from ops - no impact
leaving out various observing systems (no radiances, no sat winds, no gps etc) - no impact"

Examining runs to determine the source of the issues.

jderber-NOAA · 2024-06-10T17:28:59Z

Script I was provided did not work properly on Hera. The issue appeared to be in the loading the modules. I replaced these lines with what I normally used for running the GSI.
. /apps/lmod/lmod/init/ksh
module purge
module use /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/develop/modulefiles
module load gsi_hera.intel

module list

This appeared to make it run.

The second issue was that the output files gsitest_hera.err and gsitest_hera.out were not being deleted and the output from the latest run was being appended to these files. This creates some confusion especially when the job did not run properly. So now I am deleting these files before running the test scripts.

jderber-NOAA · 2024-06-10T17:34:28Z

Examining the stepsizes predicted by each term within stpcalc indicates that the problem is coming from the winds, and the radiances. There is also a short stepsize from the background term. This indicates that there is a problem with the gradient being calculated from the winds and radiances. Will try turning off these two observation types to see if it minimizes properly. If it does, will be necessary to look at the gradients generated from this data more closely to see why it is creating large values.

jderber-NOAA · 2024-06-10T17:45:23Z

Looking more closely at the output says that the airs_aqua, metop-a iasi,metop-b iasi, npp atms, n20 atms, npp cris-fsr, n20 cris-fsr, and metop-c amsua are the suspicious obs. For the winds, not seeing anything particularly suspicious and the wind signal may be coming from the radiances. So will turn off these radiances first.

jderber-NOAA · 2024-06-10T18:51:35Z

Didn't help much. Trying to turn off all amsu-a instruments.

jderber-NOAA · 2024-06-11T17:46:34Z

Notes.

If you start with a smaller stepsize(1.e-6), the minimization runs the full number of iterations. However, the stepsizes are very small and there is not a lot of reduction in the total penalty. This indicates that the minimization algorithm is probably OK. The problem is probably just very poorly conditioned. Need to determine reason for poor conditioning.

Turn off all bias correction - no significant change.
Turn of satellite error covariances - no significant change.
Use observation variances from input file rather than prepbufr - no significant change.
Remove moisture constraint - no significant change.
Remove all sat. obs (except gps bending) - no significant change.
Remove gps bending + above - no significant change.
Remove all winds + above - seems to minimize properly
5+remove sat winds, profiler winds - as in 5.
all data except remove all winds - as in 5.

jderber-NOAA · 2024-06-13T21:53:44Z

Seeing some strange things in the search direction for winds. Attempting to print out intermediate values as search direction is being calculated to see where strange values appear.

jderber-NOAA · 2024-06-14T15:22:07Z

Looks to me that there is an inconsistency between the background errors and the analysis resolution. JCAP=188 - never seen that resolution run before - maybe you run that all the time. Does NLAT=194, NLON=384 work for this JCAP? I would suggest trying to run the analysis at the operational resolution with the operational input files. I think that may converge properly further indicating an issue with the resolution of the analysis or input stats files.

jswhit · 2024-06-16T02:51:19Z

We're using global_berror.l127y194.f77. I just checked the global workflow and it uses JCAP=190 for C96 using that berror file. Don't know why we have it set to 188 - but I will try 190 and see what happens.

jswhit · 2024-06-16T16:31:25Z

Same problem with JCAP=190. I wonder if we need to regenerate the berror file for C96 using the backgrounds and analyses we have already generated for the scout run. The current berror file is simply interpolated from the operation C384 file.

jderber-NOAA · 2024-06-16T17:10:42Z

Jeff,Thanks for doing the experiment!I would think horizontal interpolation would be ok. Your not doing any vertical interp. Right?I may need to print out more stuff.JohnSent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: Jeff Whitaker ***@***.***> Date: 6/16/24 12:31 PM (GMT-05:00) To: NOAA-EMC/GSI ***@***.***> Cc: jderber-NOAA ***@***.***>, Assign ***@***.***> Subject: Re: [NOAA-EMC/GSI] GSI convergence problems in scout runs in 2020 (Issue #755) Same problem with JCAP=190. I wonder if we need to regenerate the berror file for C96 using the backgrounds and analyses we have already generated for the scout run. The current berror file is simply interpolated from the operation C384 file. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were assigned.Message ID: ***@***.***>

jswhit · 2024-06-18T17:41:54Z

no vertical interp, just horizontal

jderber-NOAA · 2024-06-20T20:31:11Z

It looks like the problem is just very poorly conditioned (i.e., the eigenvalues of the Hessian are far from each other and 1.). This can happen if the background errors are strange, there are very small obs errors for a few obs or there are many similar observations very close together. The first two of these do not appear to be true. Making modifications to the duplicate checking for wind obs to see if this helps (all radiances turned off). First try (with errors in code (forgot abs)) seems to be better.

jswhit · 2024-06-21T20:35:12Z

I've found that the solution for this case is sensitive to the number of MPI tasks used. On hercules, using 8 nodes and 10 MPI tasks per node, the error occurs. Changing the layout to 5 mpi tasks per node allows the minimization to converge (although I then get a segfault when trying to write the analyses, presumably running out of memory).

jderber-NOAA · 2024-06-21T20:41:51Z

Sounds like it might be a threading issue. I am back to the drawing board, trying to print out a bunch of stuff to see what is happening.

jswhit · 2024-06-24T14:54:04Z

I've got the 2020 stream running again by allocating 32 80-core hercules nodes (with 4 MPI tasks per node) to the GSI. Reducing the node count to 20 or below results in the convergence error.

jderber-NOAA · 2024-06-24T18:59:03Z

Jeff,

I think you are on the issue. The original script you gave me used 16 nodes and 40 tasks/node with 8 threads.

I think the number of tasks/node * number of threads should be less than or equal to the total number of processors on a node. I don't think the nodes have 320 processors. (I think it is more like 40/node). With the binding and the oversubscription to the processors, I think this is causing the issues.

I have a test in using fewer tasks per node (5 tasks/node * 8 threads = 40 processors on node), but it doesn't seem to be running. Will let you know my results.

John

jderber-NOAA · 2024-06-25T01:11:46Z

Still not working right for me. Will continue to look for issue with grid2sub and sub2grid for the u,v and sf,vp transform.

jderber-NOAA · 2024-07-01T22:32:27Z

Looks like the s2guv%rdispls_s array is being corrupted somewhere. Have to find where the corruption occurs.

jderber-NOAA · 2024-07-03T02:40:57Z

I think I have solved the problem! Test is waiting to run. Looks like one of the radiance covariance files (I think AIRS sea (correction - cris-fsr_nppsea)) is inconsistent with more active channels (coun=100) than nch_chan (92). (around line 463 of correlated_obsmod.F90). Because of this the indxRf array (dimensioned nch_chan) goes out of bounds and messes up some of the all-to-all communication arrays. Everything goes downhill from there.

The best solution is to remove the inconsistency in the definition of the nch_active input variable and the number of active channels (iuse_rad > 0) . We should also put a check in the correlated_obsmod routine to check for this case and print out a warning message (and stop?).

It is late and I will be busy most of tomorrow. So later tomorrow I will give more details.

jderber-NOAA · 2024-07-03T11:22:27Z

cris-fsr_npp sea not AIRS sea above. My run failed. I suspect my quick fix for getting around issue. Will do more later.

jderber-NOAA assigned jswhit2, dtkleist, jack-woollen and jderber-NOAA Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSI convergence problems in scout runs in 2020 #755

GSI convergence problems in scout runs in 2020 #755

jderber-NOAA commented Jun 10, 2024

jderber-NOAA commented Jun 10, 2024

jderber-NOAA commented Jun 10, 2024

jderber-NOAA commented Jun 10, 2024

jderber-NOAA commented Jun 10, 2024

jderber-NOAA commented Jun 11, 2024

jderber-NOAA commented Jun 13, 2024

jderber-NOAA commented Jun 14, 2024 •

edited

Loading

jswhit commented Jun 16, 2024

jswhit commented Jun 16, 2024

jderber-NOAA commented Jun 16, 2024 via email

jswhit commented Jun 18, 2024

jderber-NOAA commented Jun 20, 2024

jswhit commented Jun 21, 2024

jderber-NOAA commented Jun 21, 2024

jswhit commented Jun 24, 2024

jderber-NOAA commented Jun 24, 2024

jderber-NOAA commented Jun 25, 2024

jderber-NOAA commented Jul 1, 2024

jderber-NOAA commented Jul 3, 2024 •

edited

Loading

jderber-NOAA commented Jul 3, 2024

GSI convergence problems in scout runs in 2020 #755

GSI convergence problems in scout runs in 2020 #755

Comments

jderber-NOAA commented Jun 10, 2024

jderber-NOAA commented Jun 10, 2024

jderber-NOAA commented Jun 10, 2024

jderber-NOAA commented Jun 10, 2024

jderber-NOAA commented Jun 10, 2024

jderber-NOAA commented Jun 11, 2024

jderber-NOAA commented Jun 13, 2024

jderber-NOAA commented Jun 14, 2024 • edited Loading

jswhit commented Jun 16, 2024

jswhit commented Jun 16, 2024

jderber-NOAA commented Jun 16, 2024 via email

jswhit commented Jun 18, 2024

jderber-NOAA commented Jun 20, 2024

jswhit commented Jun 21, 2024

jderber-NOAA commented Jun 21, 2024

jswhit commented Jun 24, 2024

jderber-NOAA commented Jun 24, 2024

jderber-NOAA commented Jun 25, 2024

jderber-NOAA commented Jul 1, 2024

jderber-NOAA commented Jul 3, 2024 • edited Loading

jderber-NOAA commented Jul 3, 2024

jderber-NOAA commented Jun 14, 2024 •

edited

Loading

jderber-NOAA commented Jul 3, 2024 •

edited

Loading