Skip to content
Snippets Groups Projects
Commit 210cd3ed authored by Marty Kandes's avatar Marty Kandes
Browse files

Test newer versions of OpenMPI 4.X.X series

May have observed the effects of a bug in older versions of OpenMPI
4.0.X series when attempting to run a single-node HPL calculation on
Expanse with the Singularity.hpl-2.3-ubuntu-18.04-openmpi-4.0.4-openblas-0.3.14
container. Single-node job fails with this set of PMIX errors [1] at
startup. This issue appears to have been observed previously [2] [3]
[4]. Unfortunately, the suggested temporary solutions to set
PMIX_MCA_gds=^ds21 or PMIX_MCA_gds=hash do not work. However, it seems
like the bug causing the problem should be fixed in the latest releases
of the OpenMPI 4.X.X series. Hence, the new Ubuntu 18.04 + OpenMPI 4.0.5
and Ubuntu 18.04 + OpenMPI 4.1.0 definitions files.

[1]

[exp-8-32:06710] PMIX ERROR: NOT-FOUND in file dstore_base.c at line 2866
[exp-8-32:06710] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at line 3408
[exp-8-32:06742] PMIX ERROR: OUT-OF-RESOURCE in file client/pmix_client.c at line 231
[exp-8-32:06742] OPAL ERROR: Error in file pmix3x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[exp-8-32:06742] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[43048,1],0]
  Exit code:    1
--------------------------------------------------------------------------
[exp-8-32:06710] PMIX ERROR: ERROR in file gds_ds21_lock_pthread.c at line 99
[exp-8-32:06710] PMIX ERROR: ERROR in file gds_ds21_lock_pthread.c at line 99

[2]

https://github.com/open-mpi/ompi/issues/6761

[3]

https://github.com/open-mpi/ompi/issues/6981

[4]

https://github.com/open-mpi/ompi/issues/7516
parent 6de984ff
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment