Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCO Bugzilla tickets to be addressed in GFS v17 DA #356

Open
1 task
RussTreadon-NOAA opened this issue Apr 6, 2022 · 12 comments
Open
1 task

NCO Bugzilla tickets to be addressed in GFS v17 DA #356

RussTreadon-NOAA opened this issue Apr 6, 2022 · 12 comments
Assignees

Comments

@RussTreadon-NOAA
Copy link
Contributor

NCO opened numerous GFS v16 related bugzilla tickets which must be addressed in GFS v17 or beyond. GSI issue #137 documents GFS v16 DA bugzillas. New GFS DA bugzillas have been opened since #137.

  • 1301: gfs - write to files in working directory instead of links pointing to COMOUT

NCO requests that these bugzilla and remaining GFS v16 DA bugzillas be addressed in GFS v17 DA.

Note: The above list will likely grow as GFS v17 progresses, GFS v16 issues are discovered, and NCO provides feedback.

@RussTreadon-NOAA
Copy link
Contributor Author

FYI: global-workflow issue #712 has been opened to track bugzilla 1301 from the g-w side.

@StevenEarle-NCO
Copy link

StevenEarle-NCO commented Apr 8, 2022

I ran a proof of concept test, similar as Russ did...
Initial test simply changing ln to cp yielded nearly double the runtime... about 53 minutes, where normal is 29 minutes. It took about 24 minutes to get to the gsi executable.
I ran another test where I sent all the copy commands to a file, then ran mpmd process (run all the copy commands at the same time on different cores). This echo+cp took only 30 seconds and the runtime dropped to 28 minutes.
The analysis already allocates over 7000 cores so I recommend making use of them whenever possible to make the cp commands run in parallel.

@RussTreadon-NOAA
Copy link
Contributor Author

Smart use of mpmd, @StevenEarle-NCO! Scripts can be examined and refactored, where possible, to wrap multiple in/out copies within mpmd. Need to ensure this works on WCOSS2 and RDHPCS machines.

@RussTreadon-NOAA
Copy link
Contributor Author

FYI: General discussion of replacement of ln with cp/mv is occurring in g-w issue #712

@RussTreadon-NOAA
Copy link
Contributor Author

@dtkleist , @CoryMartin-NOAA , @CatherineThomas-NOAA - for your awareness.

GFS v17 can NOT use links in working directories. In case you do not have access to bugzilla, here's the content of bugzilla 1301

gfs - write to files in working directory instead of links pointing to COMOUT

[Wei Wei](mailto:[email protected]) 2022-04-06 13:27:21 UTC
In the current version of GFS, some jobs write to COMOUT directly through links in working directories. 

This is risky because downstream jobs can potentially get partial files and fail, as happened in wave_post job. 

Please write to working directories, then cpfs (or cp/mv, depends on the file sizes) to COMOUT once the files are completed.
[Wei Wei](mailto:[email protected]) 2022-04-07 13:16:17 UTC
Updates from Steven:

"
I asked Wei to submit this ticket because we need to get back to all of production in a self contained DATA per process/model. We can't have direct writes to COMOUT.
We allowed this to happen on WCOSS because we didn't have the IO/storage bandwidth to support self contained DATA.  It's time to go back to where we were several years ago to improve:
-- Portability
-- Contained IO, making management of the system/storage possible
-- Pristine place to save after failures for debug/troubleshooting later

We've designed WCOSS2 to have an all flash/ssd filesystem (f1/f2), which has superior performance... 10x the aggregate bandwidth when compared to the fastest filesystem on WCOSS1. GFS/GDAS cannot currently take advantage of that because COMOUT is on h1, which is designed for long term storage. 
As we design and procure future systems, we can ensure adequate bandwidth to support local, self contained working spaces. We can't do that when models use external links.

Please give it a try on WCOSS2 and let us know how much delay there is and/or how many more resources you need to support this requirement.

"


This bugzilla ticket is for the next major upgrade, GFSv17.

We need to address this for both the GSI- and JEDI- based pieces of GFS v17 DA.

@RussTreadon-NOAA
Copy link
Contributor Author

@CatherineThomas-NOAA , the GSI Handling Review team is going through GSI issue to see which, if any, we can close. Since this issue mentions GFS v17, I'm assume that we need to keep it open. Do you agree?

@CatherineThomas-NOAA
Copy link
Collaborator

@RussTreadon-NOAA
I agree, we need to keep this issue open. Thanks for checking.

Tagging @JessicaMeixner-NOAA for awareness.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @CatherineThomas-NOAA for the confirmation. We will keep this issue open.

@RussTreadon-NOAA
Copy link
Contributor Author

@CatherineThomas-NOAA : do we have anyone (EIB or DA) working on this issue?

@CatherineThomas-NOAA
Copy link
Collaborator

@RussTreadon-NOAA:
This issue was mentioned by @aerorahul last week, though I'm not sure if any work has started yet. @aerorahul, does the workflow team need anything from DA on this?

@aerorahul
Copy link
Contributor

The cp/ln issue is in many scripts of the workflow, and we are not equipped to resolve all of them in one go. Some of the biggest issues are in the forecast and analysis jobs where the volume of data and time are crucial.
We are tackling as we go.
Any help is appreciated.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @aerorahul for the update. I'll add you as an assignee but feel free to reassign to other EIB staff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants