Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel netCDF I/O failures on Hercules with I_MPI_EXTRA_FILESYSTEM=1 #694

Open
DavidHuber-NOAA opened this issue Feb 9, 2024 · 19 comments
Assignees

Comments

@DavidHuber-NOAA
Copy link
Collaborator

DavidHuber-NOAA commented Feb 9, 2024

Hercules is unable to handle parallel I/O when compiled with spack-stack v1.6.0. The only obvious difference between v1.6.0 and v1.5.1 is netcdf-fortran, which was upgraded to v4.6.1 from v4.6.0. When attempting parallel reads/writes, netCDF/HDF5 errors are encountered. The cause of the failure appears to be the use of the I_MPI_EXTRA_FILESYSTEM=1 flag, which enables native support for parallel I/O. Turning on netCDF debugging options reveals the following HDF5 traceback:

HDF5-DIAG: Error detected in HDF5 (1.14.0) MPI-process 14:
  #000: /work/noaa/global/dhuber/LIBS/hdf5/src/H5D.c line 1069 in H5Dread(): can't synchronously read data
    major: Dataset
    minor: Read failed
  #001: /work/noaa/global/dhuber/LIBS/hdf5/src/H5D.c line 1013 in H5D__read_api_common(): can't read data
    major: Dataset
    minor: Read failed
  #002: /work/noaa/global/dhuber/LIBS/hdf5/src/H5VLcallback.c line 2092 in H5VL_dataset_read_direct(): dataset read failed
    major: Virtual Object Layer
    minor: Read failed
  #003: /work/noaa/global/dhuber/LIBS/hdf5/src/H5VLcallback.c line 2048 in H5VL__dataset_read(): dataset read failed
    major: Virtual Object Layer
    minor: Read failed
  #004: /work/noaa/global/dhuber/LIBS/hdf5/src/H5VLnative_dataset.c line 361 in H5VL__native_dataset_read(): can't read data
    major: Dataset
    minor: Read failed
  #005: /work/noaa/global/dhuber/LIBS/hdf5/src/H5Dio.c line 370 in H5D__read(): can't read data
    major: Dataset
    minor: Read failed
  #006: /work/noaa/global/dhuber/LIBS/hdf5/src/H5Dchunk.c line 2889 in H5D__chunk_read(): chunked read failed
    major: Dataset
    minor: Read failed
  #007: /work/noaa/global/dhuber/LIBS/hdf5/src/H5Dselect.c line 466 in H5D__select_read(): read error
    major: Dataspace
    minor: Read failed
  #008: /work/noaa/global/dhuber/LIBS/hdf5/src/H5Dselect.c line 223 in H5D__select_io(): read error
    major: Dataspace
    minor: Read failed
  #009: /work/noaa/global/dhuber/LIBS/hdf5/src/H5Dcontig.c line 1225 in H5D__contig_readvv(): can't perform vectorized read
    major: Dataset
    minor: Can't operate on object
  #010: /work/noaa/global/dhuber/LIBS/hdf5/src/H5VM.c line 1400 in H5VM_opvv(): can't perform operation
    major: Internal error (too specific to document in detail)
    minor: Can't operate on object
  #011: /work/noaa/global/dhuber/LIBS/hdf5/src/H5Dcontig.c line 1154 in H5D__contig_readvv_cb(): block write failed
    major: Dataset
    minor: Write failed
  #012: /work/noaa/global/dhuber/LIBS/hdf5/src/H5Fio.c line 104 in H5F_shared_block_read(): read through page buffer failed
    major: Low-level I/O
    minor: Read failed
  #013: /work/noaa/global/dhuber/LIBS/hdf5/src/H5PB.c line 717 in H5PB_read(): read through metadata accumulator failed
    major: Page Buffering
    minor: Read failed
  #014: /work/noaa/global/dhuber/LIBS/hdf5/src/H5Faccum.c line 252 in H5F__accum_read(): driver read request failed
    major: Low-level I/O
    minor: Read failed
  #015: /work/noaa/global/dhuber/LIBS/hdf5/src/H5FDint.c line 255 in H5FD_read(): driver read request failed
    major: Virtual File Layer
    minor: Read failed
  #016: /work/noaa/global/dhuber/LIBS/hdf5/src/H5FDmpio.c line 1432 in H5FD__mpio_read(): MPI_File_read_at failed: MPI error string is 'Other I/O error , error stack:
ADIOI_LUSTRE_IOCONTIG(228): Other I/O error Input/output error'
    major: Internal error (too specific to document in detail)
    minor: Some MPI function failed
 ncdf error NetCDF: HDF error

This may be a Lustre issue on that system, but if that's the case, it is perplexing that it only occurs with the implementation of netcdf-fortran.

A large number of HDF5 MPI ctest fail (both v1.14.3 and v1.14.0) on both Hercules and Orion, so it's not clear if this could be a lower-level library issue that only Hercules is sensitive to. On closer examination, these 'failures' are mostly caused by warning messages about certain I_MPI* flags being ignored.

@TingLei-NOAA
Copy link
Contributor

@DavidHuber-NOAA Thanks a lot for all your efforts on this! What is the branch now to reproduce this issue when the system experts and other netcdf/hdf experts could reproduce the issue and investigate it?

@DavidHuber-NOAA
Copy link
Collaborator Author

@TingLei-NOAA I will create one, thanks!

@DavidHuber-NOAA
Copy link
Collaborator Author

@TingLei-NOAA
Copy link
Contributor

@DavidHuber-NOAA Thanks a lot!

@TingLei-NOAA
Copy link
Contributor

An update on digging using Dave's hercules/netcdff_461 on hercules.
My current focus is to find any possible issues in the fv3reg GSI IO codes.
Up to now, the changes include the fix as Ed proposed, change of "check( nf90_open(filenamein,nf90_write,gfile_loc,comm=mpi_comm_read,info=MPI_INFO_NULL) )" to
"check( nf90_open(filenamein,ior(nf90_write,nf90_mpiio),gfile_loc,comm=mpi_comm_read,info=MPI_INFO_NULL) )" and some other changes. Hasn't resolved the issue.
New findings found when more mpi process numbers like 20, 130 are used, the job would succeed, which might indicates/confirms the "hdf error" came from some "more intensified" parallell IO actions when less MPI processes were used.

@DavidHuber-NOAA DavidHuber-NOAA changed the title Parallel netCDF I/O failures on Hercules with netcdf-fortran v4.6.1 Parallel netCDF I/O failures on Hercules with I_MPI_EXTRA_FILESYSTEM=1 Feb 13, 2024
@DavidHuber-NOAA
Copy link
Collaborator Author

I updated the description and title of this issue as the apparent cause now is not the upgrade of netCDF-Fortran to v4.6.1, but instead the implementation of the I_MPI_EXTRA_FILESYSTEM=1 flag.

@DavidHuber-NOAA
Copy link
Collaborator Author

DavidHuber-NOAA commented Feb 13, 2024

@TingLei-NOAA The HDF5 failed tests were mostly false positives. They were largely the result of warning messages being printed into the log files that the HDF5 ctests then compared against expected logs. The warning messages were all about unused I_MPI* flags. There were a couple of out-of-memory failures as well, but I don't think this had anything to do with the I_MPI_EXTRA_FILESYSTEM flag.

Second, no, this is not required on the other systems. I_MPI_EXTRA_FILESYSTEM is a new flag implemented by Intel that does not exist for versions 18 through 2021.5.x (Hercules is running 2021.9.0). Instead, native filesystem support is automatic and cannot be disabled. Interestingly, this flag used to exist for older versions of Intel (version 15 and earlier).

@TingLei-NOAA
Copy link
Contributor

@DavidHuber-NOAA Thanks a lot! Will you report your findings in the hercules help ticket? I will follow up with some codes details (when the issue always occurred in my 4 mpi process cases) and see if the system administers would have any clues.

@DavidHuber-NOAA
Copy link
Collaborator Author

Yes, I will do that.

@edwardhartnett
Copy link

Firstly, great work @DavidHuber-NOAA , this was a lot to figure out.

If there is to be a refactor of the netCDF code, may I suggest that you start with some unit testing, which can then be used to verify correct behavior on new platforms? That is, start by writing unit tests which, when run on any platform, will indicate whether the parallel I/O code is working. This will allow debugging of I/O problems without involving the rest of the code.

I'm happy to help if this route is taken.

Also if a refactor is considered, you may also consider switching to PIO. It's offers a lot of great features for parallel I/O. Using netCDF parallel I/O directly is much more work than letting PIO do the heavy lifting. Let me know if you would like a presentation on PIO and how to use it.

@TingLei-NOAA
Copy link
Contributor

@edwardhartnett Do you have any comments/suggestion on my question in the hercules ticket following @DavidHuber-NOAA 's update on his findings?
I attached my question below:

n my 4 mpi process run, when "export   I_MPI_EXTRA_FILESYSTEM=1"
it always fails after Line XXX  and on Line YYY at a certain loop
do loop

....
call check( nf90_get_var(gfile_loc,ugrd_VarId,work_bu,start=u_startloc,count=u_countloc) ) !XXX
call check( nf90_get_var(gfile_loc,vgrd_VarId,work_bv,start=v_startloc,count=v_countloc) ) !YYY
........
end loop

From Dave's findings, seems the mpi lib would do some optimization for these two lines and cause the hdf error.   
```.    Thanks. 

@TingLei-NOAA
Copy link
Contributor

An update :
Now, I have a code which moves out that uv IO outside of the do loop and it seems working with I_MPI_EXTRA_FILESYSTEM=1 , namely, it had succeeded in all 4 runs up to now. This branch has been running successfully with I_MPI_EXTRA_FILESYSTEM=1.
I will prepare a clean and verified branch incorporating all recent changes including the dimension change for start and count paramters as @edwardhartnett proposed.

@TingLei-NOAA
Copy link
Contributor

An update: now it is believed with the PR #698 and appropriately tuned parameters in the job script ( to give enough memory to the low level parallel netcdf IO with mpi optimization) .
More details:

seems for the current codes, the memory could still play a role in causing this issue or similar issues.
First, a relevant update on netcdf output issue on hera.
#697 ( see the latest update ).
Second, on hercules, I found for hafs_3denvar_hybens_hiproc_updat,
when ppn=20, node=2,
the original HDF error popped again.
When ppn=10, node=4,
The GSI ran smoothly again.
So, seems the refactoring of the codes/changes including use of nf90_collective all help avoid some messing up in the low level parallel IO processes with the mpi IO optimization, while memory usages play an important role and need to taken care in addition to the code refactoring.

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Mar 7, 2024

An summary on what we have got on this issue. This is a investigation by "us" including @DavidHuber-NOAA @edwardhartnett with helps from Peter Johnson through the Hercules help desk and @RussTreadon-NOAA . It's important to note that the insights presented below represent my current perspective on the matter.
Feedbacks from collaborators are to refine these findings further, I hope the finalized summary reflecting our collective consensus will be shared subsequently.
Summary:

I_MPI_EXTRA_FILESYSTEM is to enable/disable "native support for parallel file systems" .
1)Issue Overview: The problem, identified while enabling native support for parallel I/O, is believed to stem from issues within the low-level NetCDF/HDF parallel I/O operations that interact with this "native support" feature. The recently refactored GSI fv3reg codes (PR #698) have significantly mitigated the frequency of these issues, though they have not entirely eliminated the possibility of their occurrence.
2)Alternative Solutions
While alternative approaches, such as utilizing a different MPI library, were considered as potential solutions to this issue, it is decided to revert to the function being disabled for several reasons:
The issue might be specific to the Hercules system, suggesting a platform-dependent problem.
It is probable that future software updates on Hercules may inherently resolve this issue.
Should this issue manifest on other systems, indicating a more generic problem related to the interaction between parallel NetCDF operations and MPI's native support for parallel I/O, a recommended and more comprehensive solution would be to adopt the Parallel I/O (PIO) library, as suggested by @edwardhartnett.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @TingLei-NOAA for the summary.

One clarification:

I am not an investigator on this issue. My silence should not be interpreted as agreement or disagreement. My silence reflects the fact that I am not actively working on this issue.

Two comments:

  1. Hera fully migrates to Rocky-8 on 4/2/2024. With Rocky-8 come new modules. The behavior observed on Hercules may appear on Hera. Hence interest in pursuing this on Hercules.
  2. Interestingly the global_4denvar and global_enkf ctests do not fail on Hercules. These configurations read & write netcdf files in parallel. Why do the global ctests pass while the regional tests fail?

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA Thanks for your clarification. I will update the summary accordingly.
For your point 1, I agree and as I described in that version of summary, I plan to wait to see if that actually happen. If that happen, I 'd prefer to use PIO if it is still difficult to sort out what happen in that level of parallel IO.
For your point 2, a possible reason is that in global parallel netcdf IO, for each IO function, it is always done for the same variables (like a 3D) while different MPI process access different parts of the same variables. In regional parallel IO, in addition to access different parts of a variable, the GSI fv3reg would also access different variables. The latter is making use of more "capabilities" of the system.

@DavidHuber-NOAA DavidHuber-NOAA self-assigned this May 6, 2024
@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA and @DavidHuber-NOAA : shall we keep this issue open or close it?

@DavidHuber-NOAA
Copy link
Collaborator Author

We can leave this open. I am working on building the GSI on Hercules with Intel and OpenMPI to provide @TingLei-NOAA with an alternative MPI provider to see if the issue lies in the GSI code or Intel MPI. I successfully compiled the GSI with this combination today, but need to make a couple tweaks before handing it over to Ting.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @DavidHuber-NOAA for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants