Skip to content
Snippets Groups Projects
Select Git revision
  • 210cd3ede65ce5530451c0d900f80fa80c202314
  • master default protected
2 results

ubuntu

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    Marty Kandes authored
    May have observed the effects of a bug in older versions of OpenMPI
    4.0.X series when attempting to run a single-node HPL calculation on
    Expanse with the Singularity.hpl-2.3-ubuntu-18.04-openmpi-4.0.4-openblas-0.3.14
    container. Single-node job fails with this set of PMIX errors [1] at
    startup. This issue appears to have been observed previously [2] [3]
    [4]. Unfortunately, the suggested temporary solutions to set
    PMIX_MCA_gds=^ds21 or PMIX_MCA_gds=hash do not work. However, it seems
    like the bug causing the problem should be fixed in the latest releases
    of the OpenMPI 4.X.X series. Hence, the new Ubuntu 18.04 + OpenMPI 4.0.5
    and Ubuntu 18.04 + OpenMPI 4.1.0 definitions files.
    
    [1]
    
    [exp-8-32:06710] PMIX ERROR: NOT-FOUND in file dstore_base.c at line 2866
    [exp-8-32:06710] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at line 3408
    [exp-8-32:06742] PMIX ERROR: OUT-OF-RESOURCE in file client/pmix_client.c at line 231
    [exp-8-32:06742] OPAL ERROR: Error in file pmix3x_client.c at line 112
    *** An error occurred in MPI_Init
    *** on a NULL communicator
    *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
    ***    and potentially your MPI job)
    [exp-8-32:06742] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
    --------------------------------------------------------------------------
    Primary job  terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    mpirun detected that one or more processes exited with non-zero status, thus causing
    the job to be terminated. The first process to do so was:
    
      Process name: [[43048,1],0]
      Exit code:    1
    --------------------------------------------------------------------------
    [exp-8-32:06710] PMIX ERROR: ERROR in file gds_ds21_lock_pthread.c at line 99
    [exp-8-32:06710] PMIX ERROR: ERROR in file gds_ds21_lock_pthread.c at line 99
    
    [2]
    
    https://github.com/open-mpi/ompi/issues/6761
    
    [3]
    
    https://github.com/open-mpi/ompi/issues/6981
    
    [4]
    
    https://github.com/open-mpi/ompi/issues/7516
    210cd3ed
    History
    Name Last commit Last update
    ..