Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSI convergence problems in scout runs in 2020 #755

Open
jderber-NOAA opened this issue Jun 10, 2024 · 20 comments
Open

GSI convergence problems in scout runs in 2020 #755

jderber-NOAA opened this issue Jun 10, 2024 · 20 comments
Assignees

Comments

@jderber-NOAA
Copy link
Contributor

Jeff Whitaker's group is reporting convergence issues in their 3dvar C96L127 atm-only scout run starting in 2020.

"Initial cost function = 5.330609621864649467E+06
Initial gradient norm = 6.047187774679628015E+07
cost,grad,step,b,step? = 1 0 5.330609621864649467E+06 6.047187774679628015E+07 1.325505088896680807E-09 0.000000000000000000E+00 SMALL
cost,grad,step,b,step? = 1 1 5.337598182749103755E+06 4.183988366714026779E+07 1.268375413166407207E-09 4.731939262212094266E-01 SMALL
PCGSOI: WARNING **** Stopping inner iteration ***
Penalty increase or constant 1 1 0.100131102470077527E+01 0.100000000000000000E+01

I've tried various things to get around this:

  1. different initial times (from ops and/or replay) in 2020 and 2021 - no impact
  2. zero initial bias correction or bias correction from ops - no impact
  3. leaving out various observing systems (no radiances, no sat winds, no gps etc) - no impact"

Examining runs to determine the source of the issues.

@jderber-NOAA
Copy link
Contributor Author

Script I was provided did not work properly on Hera. The issue appeared to be in the loading the modules. I replaced these lines with what I normally used for running the GSI.
. /apps/lmod/lmod/init/ksh
module purge
module use /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/develop/modulefiles
module load gsi_hera.intel

module list

This appeared to make it run.

The second issue was that the output files gsitest_hera.err and gsitest_hera.out were not being deleted and the output from the latest run was being appended to these files. This creates some confusion especially when the job did not run properly. So now I am deleting these files before running the test scripts.

@jderber-NOAA
Copy link
Contributor Author

Examining the stepsizes predicted by each term within stpcalc indicates that the problem is coming from the winds, and the radiances. There is also a short stepsize from the background term. This indicates that there is a problem with the gradient being calculated from the winds and radiances. Will try turning off these two observation types to see if it minimizes properly. If it does, will be necessary to look at the gradients generated from this data more closely to see why it is creating large values.

@jderber-NOAA
Copy link
Contributor Author

Looking more closely at the output says that the airs_aqua, metop-a iasi,metop-b iasi, npp atms, n20 atms, npp cris-fsr, n20 cris-fsr, and metop-c amsua are the suspicious obs. For the winds, not seeing anything particularly suspicious and the wind signal may be coming from the radiances. So will turn off these radiances first.

@jderber-NOAA
Copy link
Contributor Author

Didn't help much. Trying to turn off all amsu-a instruments.

@jderber-NOAA
Copy link
Contributor Author

Notes.

If you start with a smaller stepsize(1.e-6), the minimization runs the full number of iterations. However, the stepsizes are very small and there is not a lot of reduction in the total penalty. This indicates that the minimization algorithm is probably OK. The problem is probably just very poorly conditioned. Need to determine reason for poor conditioning.

  1. Turn off all bias correction - no significant change.
  2. Turn of satellite error covariances - no significant change.
  3. Use observation variances from input file rather than prepbufr - no significant change.
  4. Remove moisture constraint - no significant change.
  5. Remove all sat. obs (except gps bending) - no significant change.
  6. Remove gps bending + above - no significant change.
  7. Remove all winds + above - seems to minimize properly
  8. 5+remove sat winds, profiler winds - as in 5.
  9. all data except remove all winds - as in 5.

@jderber-NOAA
Copy link
Contributor Author

Seeing some strange things in the search direction for winds. Attempting to print out intermediate values as search direction is being calculated to see where strange values appear.

@jderber-NOAA
Copy link
Contributor Author

jderber-NOAA commented Jun 14, 2024

Looks to me that there is an inconsistency between the background errors and the analysis resolution. JCAP=188 - never seen that resolution run before - maybe you run that all the time. Does NLAT=194, NLON=384 work for this JCAP? I would suggest trying to run the analysis at the operational resolution with the operational input files. I think that may converge properly further indicating an issue with the resolution of the analysis or input stats files.

@jswhit
Copy link
Contributor

jswhit commented Jun 16, 2024

We're using global_berror.l127y194.f77. I just checked the global workflow and it uses JCAP=190 for C96 using that berror file. Don't know why we have it set to 188 - but I will try 190 and see what happens.

@jswhit
Copy link
Contributor

jswhit commented Jun 16, 2024

Same problem with JCAP=190. I wonder if we need to regenerate the berror file for C96 using the backgrounds and analyses we have already generated for the scout run. The current berror file is simply interpolated from the operation C384 file.

@jderber-NOAA
Copy link
Contributor Author

jderber-NOAA commented Jun 16, 2024 via email

@jswhit
Copy link
Contributor

jswhit commented Jun 18, 2024

no vertical interp, just horizontal

@jderber-NOAA
Copy link
Contributor Author

It looks like the problem is just very poorly conditioned (i.e., the eigenvalues of the Hessian are far from each other and 1.). This can happen if the background errors are strange, there are very small obs errors for a few obs or there are many similar observations very close together. The first two of these do not appear to be true. Making modifications to the duplicate checking for wind obs to see if this helps (all radiances turned off). First try (with errors in code (forgot abs)) seems to be better.

@jswhit
Copy link
Contributor

jswhit commented Jun 21, 2024

I've found that the solution for this case is sensitive to the number of MPI tasks used. On hercules, using 8 nodes and 10 MPI tasks per node, the error occurs. Changing the layout to 5 mpi tasks per node allows the minimization to converge (although I then get a segfault when trying to write the analyses, presumably running out of memory).

@jderber-NOAA
Copy link
Contributor Author

Sounds like it might be a threading issue. I am back to the drawing board, trying to print out a bunch of stuff to see what is happening.

@jswhit
Copy link
Contributor

jswhit commented Jun 24, 2024

I've got the 2020 stream running again by allocating 32 80-core hercules nodes (with 4 MPI tasks per node) to the GSI. Reducing the node count to 20 or below results in the convergence error.

@jderber-NOAA
Copy link
Contributor Author

Jeff,

I think you are on the issue. The original script you gave me used 16 nodes and 40 tasks/node with 8 threads.

I think the number of tasks/node * number of threads should be less than or equal to the total number of processors on a node. I don't think the nodes have 320 processors. (I think it is more like 40/node). With the binding and the oversubscription to the processors, I think this is causing the issues.

I have a test in using fewer tasks per node (5 tasks/node * 8 threads = 40 processors on node), but it doesn't seem to be running. Will let you know my results.

John

@jderber-NOAA
Copy link
Contributor Author

Still not working right for me. Will continue to look for issue with grid2sub and sub2grid for the u,v and sf,vp transform.

@jderber-NOAA
Copy link
Contributor Author

Looks like the s2guv%rdispls_s array is being corrupted somewhere. Have to find where the corruption occurs.

@jderber-NOAA
Copy link
Contributor Author

jderber-NOAA commented Jul 3, 2024

I think I have solved the problem! Test is waiting to run. Looks like one of the radiance covariance files (I think AIRS sea (correction - cris-fsr_nppsea)) is inconsistent with more active channels (coun=100) than nch_chan (92). (around line 463 of correlated_obsmod.F90). Because of this the indxRf array (dimensioned nch_chan) goes out of bounds and messes up some of the all-to-all communication arrays. Everything goes downhill from there.

The best solution is to remove the inconsistency in the definition of the nch_active input variable and the number of active channels (iuse_rad > 0) . We should also put a check in the correlated_obsmod routine to check for this case and print out a warning message (and stop?).

It is late and I will be busy most of tomorrow. So later tomorrow I will give more details.

@jderber-NOAA
Copy link
Contributor Author

cris-fsr_npp sea not AIRS sea above. My run failed. I suspect my quick fix for getting around issue. Will do more later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants