View Issue Details

IDProjectCategoryView StatusLast Update
0017417CentOS-8openmpipublic2020-06-19 12:36
Reporternqhaas 
PrioritynormalSeveritycrashReproducibilityalways
Status newResolutionopen 
Platformx86-64OScentosOS Version8
Product Version8.1.1911 
Target VersionFixed in Version 
Summary0017417: openmpi programs seg fault
DescriptionTrivial MPI Hello World programs are crashing with seg faults when executing with the version of OpenMPI shipping with CentOS 8.1. When recompiled / reran with mpich, the program executes normally. See attached terminal log for details.
Steps To Reproduce1. Install gcc openmpi openmpi-devel
2. module load mpi/openmpi-x86_64
3. Create a simple 'hello world' MPI program in C called 'mpi.c'
4. mpicc mpic
5. mpirun a.out
TagsNo tags attached.

Activities

nqhaas

nqhaas

2020-05-30 15:26

reporter  

centos8_openmpi.txt (4,626 bytes)
$ rpm -q openmpi
openmpi-4.0.1-3.el8.x86_64
$ cat /etc/redhat-release
CentOS Linux release 8.1.1911 (Core)
$ cat mpi.c
#include <stdlib.h>
#include <mpi.h>

int main (int argc, char **argv)
{
  MPI_Init (NULL,NULL);
  MPI_Finalize();

  return 0;
}

$ module load mpi/openmpi-x86_64
$ mpicc mpi.c
$ ./a.out
[centoscli8:10319:0:10319] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f5e1084c768)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x18bb0) [0x7f5e101dfbb0]
    1  /lib64/libucs.so.0(+0x18d8a) [0x7f5e101dfd8a]
    2  /lib64/libuct.so.0(+0x1655b) [0x7f5e11b2b55b]
    3  /lib64/ld-linux-x86-64.so.2(+0xfd0a) [0x7f5e1f3dbd0a]
    4  /lib64/ld-linux-x86-64.so.2(+0xfe0a) [0x7f5e1f3dbe0a]
    5  /lib64/ld-linux-x86-64.so.2(+0x13def) [0x7f5e1f3dfdef]
    6  /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7f5e1ebf6ab7]
    7  /lib64/ld-linux-x86-64.so.2(+0x1365e) [0x7f5e1f3df65e]
    8  /lib64/libdl.so.2(+0x11ba) [0x7f5e1e3501ba]
    9  /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7f5e1ebf6ab7]
   10  /lib64/libc.so.6(_dl_catch_error+0x33) [0x7f5e1ebf6b53]
   11  /lib64/libdl.so.2(+0x1939) [0x7f5e1e350939]
   12  /lib64/libdl.so.2(dlopen+0x4a) [0x7f5e1e35025a]
   13  /usr/lib64/openmpi/lib/libopen-pal.so.40(+0x6df05) [0x7f5e1e5c0f05]
   14  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_repository_open+0x206) [0x7f5e1e59eb16]
   15  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_find+0x35a) [0x7f5e1e59da5a]
   16  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_components_register+0x2e) [0x7f5e1e5a93ce]
   17  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_register+0x252) [0x7f5e1e5a98b2]
   18  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_open+0x15) [0x7f5e1e5a9915]
   19  /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x674) [0x7f5e1f0f2494]
   20  /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0xa9) [0x7f5e1f1226e9]
   21  ./a.out() [0x4006a4]
   22  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f5e1eae1873]
   23  ./a.out() [0x4005ce]
===================
Segmentation fault (core dumped)
$ mpirun ./a.out
[centoscli8:10339:0:10339] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7f71f3dee768)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x18bb0) [0x7f71f3781bb0]
    1  /lib64/libucs.so.0(+0x18d8a) [0x7f71f3781d8a]
    2  /lib64/libuct.so.0(+0x1655b) [0x7f71f912755b]
    3  /lib64/ld-linux-x86-64.so.2(+0xfd0a) [0x7f7206980d0a]
    4  /lib64/ld-linux-x86-64.so.2(+0xfe0a) [0x7f7206980e0a]
    5  /lib64/ld-linux-x86-64.so.2(+0x13def) [0x7f7206984def]
    6  /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7f720619bab7]
    7  /lib64/ld-linux-x86-64.so.2(+0x1365e) [0x7f720698465e]
    8  /lib64/libdl.so.2(+0x11ba) [0x7f72058f51ba]
    9  /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7f720619bab7]
   10  /lib64/libc.so.6(_dl_catch_error+0x33) [0x7f720619bb53]
   11  /lib64/libdl.so.2(+0x1939) [0x7f72058f5939]
   12  /lib64/libdl.so.2(dlopen+0x4a) [0x7f72058f525a]
   13  /usr/lib64/openmpi/lib/libopen-pal.so.40(+0x6df05) [0x7f7205b65f05]
   14  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_repository_open+0x206) [0x7f7205b43b16]
   15  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_find+0x35a) [0x7f7205b42a5a]
   16  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_components_register+0x2e) [0x7f7205b4e3ce]
   17  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_register+0x252) [0x7f7205b4e8b2]
   18  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_open+0x15) [0x7f7205b4e915]
   19  /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x674) [0x7f7206697494]
   20  /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0xa9) [0x7f72066c76e9]
   21  ./a.out() [0x4006a4]
   22  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f7206086873]
   23  ./a.out() [0x4005ce]
===================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node centoscli8 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
$ module unload mpi
$ module load mpi/mpich-x86_64
$ mpicc mpi.c
$ ./a.out
$ mpirun ./a.out
centos8_openmpi.txt (4,626 bytes)
sterni1971

sterni1971

2020-06-02 10:50

reporter   ~0037021

I' have the same problem after updating to centos7.8 . In our cluster not all nodes are affected.
The cluster is quite heterogenous so I thought cpu type is relevant but this seems to be not
the only factor. For example all AMD EPYC failing but some nodes with
Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
are ok and some not
sterni1971

sterni1971

2020-06-02 16:55

reporter   ~0037026

The problem seems to be related to the type of infiniband hca

All nodes with these hca have openmpi problems (mpich works, 7.7 kernel works):

Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
nqhaas

nqhaas

2020-06-15 23:51

reporter   ~0037115

@sterni1971, I'm unable to repeat this issue in CentOS 7.8 minimal (see attached log) like I can with the CentOS 8.1 issue. I think we are encountering different problems. Have you experimented with different calls to mpirun?

https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot

centos7_openmpi.txt (353 bytes)
$ rpm -q openmpi
openmpi-1.10.7-5.el7.x86_64
$ cat /etc/redhat-release
CentOS Linux release 7.8.2003 (Core)
$ cat mpi.c
#include <stdlib.h>
#include <mpi.h>

int main (int argc, char **argv)
{
  MPI_Init (NULL,NULL);
  MPI_Finalize();

  return 0;
}
$ module load mpi/openmpi-x86_64
$ mpicc mpi.c
$ ./a.out
$ mpirun a.out
$ echo $?
0
centos7_openmpi.txt (353 bytes)
sterni1971

sterni1971

2020-06-16 08:17

reporter   ~0037122

@nqhaas I'm of course not sure that it is related but it look similar (for me) In the log
you see my session. This only fails on nodes with one of the mentioned HCA and
as I found if I explicitly set the pml parameter it works. (see log file)

log (866 bytes)
sternber@max-wgse002]~% rpm -q openmpi
openmpi-1.10.7-5.el7.x86_64
[sternber@max-wgse002]~% cat /etc/redhat-release 
CentOS Linux release 7.8.2003 (Core)
[sternber@max-wgse002]~% cat mpi2.c 
#include <stdlib.h>
#include <mpi.h>

int main (int argc, char **argv)
{
  MPI_Init (NULL,NULL);
  MPI_Finalize();

  return 0;
}
[sternber@max-wgse002]~% module load mpi/openmpi-x86_64
[sternber@max-wgse002]~% mpicc mpi2.c 
[sternber@max-wgse002]~% ./a.out 
zsh: segmentation fault  ./a.out
[sternber@max-wgse002]~% mpirun ./a.out
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 67693 on node max-wgse002 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[sternber@max-wgse002]~% mpirun --mca pml ob1 ./a.out
[sternber@max-wgse002]~% 

log (866 bytes)
nqhaas

nqhaas

2020-06-19 12:36

reporter   ~0037170

It does look similar... I recall having infiniband issues with OpenMPI back in the CentOS 7.5(?) days on one of our HPCs and we ended up having to disable infiniband support using arguments to mpirun and/or slurm and using only Ethernet, which isn't ideal, but our tasks were not comms heavy and we needed to clear the issue quickly. I don't think we ever got around to revisiting it. Hope you guys are able to clear it and report back on the fix for the community, might want to create a separate issue for better exposure.

An update on my EL8 openmpi issue, looks like an upgrade to CentOS 8.2 and/or openmpi-4.0.2-2 cleared my issue (log attached). So, feel free to close this particular issue.

I found the following OpenMPI discussion on the upstream bug tracker enlightening, a user said that the packagers aren't necessarily users of what they package, which makes sense and explains allot: https://bugzilla.redhat.com/show_bug.cgi?id=1770184#c1

centos82_openmpi.txt (366 bytes)
$ rpm -q openmpi
openmpi-4.0.2-2.el8.x86_64
$ cat /etc/redhat-release
CentOS Linux release 8.2.2004 (Core)
$ cat mpi.c
#include <stdlib.h>
#include <mpi.h>

int main (int argc, char **argv)
{
  MPI_Init (NULL,NULL);
  MPI_Finalize();

  return 0;
}
$ module load mpi/openmpi-x86_64
$ mpicc mpi.c
$ ./a.out
$ echo $?
0
$ mpirun a.out
$ echo $?
0
centos82_openmpi.txt (366 bytes)

Issue History

Date Modified Username Field Change
2020-05-30 15:26 nqhaas New Issue
2020-05-30 15:26 nqhaas File Added: centos8_openmpi.txt
2020-06-02 10:50 sterni1971 Note Added: 0037021
2020-06-02 16:55 sterni1971 Note Added: 0037026
2020-06-15 23:51 nqhaas File Added: centos7_openmpi.txt
2020-06-15 23:51 nqhaas Note Added: 0037115
2020-06-16 08:17 sterni1971 File Added: log
2020-06-16 08:17 sterni1971 Note Added: 0037122
2020-06-19 12:36 nqhaas File Added: centos82_openmpi.txt
2020-06-19 12:36 nqhaas Note Added: 0037170