Hello all.
I am unable to run LAM.7.1.1 or the latest beta LAM.7.1.2b28 using the
module ssi rpi ib.
Description of the system: 8 nodes each of them with 4 opteron processors.
SUSE 9.3. , IBGD, VAPI version 4.1
Configure is run with
./configure -with-threads=posix --prefix=/common/opt/lammpi/lam-7.1.2b
--with-rsh=ssh -x --with-rpi-ib=/usr/local/ibgd/driver/infinihost/
--enable-shared --
with-modules=ib,sysv,usysv,tcp
Description of the failure:
mpirun ssi rpi ib C full_path_to_application this hangs right at the
beginning (provably at MPI_INIT)
mpirun ssi rpi ib N full_path_to_application runs without problem
That means that I can use only one processor per node. If I try to use more
than one it hangs.
Other modules such as ssi rpi tcp work fine across nodes and using all
processors. I.e.
mpirun ssi rpi tcp C full_path_to_application
I read the documentation (user and installation) and I don't know if the
problem has to do with the memory that needs to be pinned, Ib port problem,
HCA problem, memory manager,.I am a bit lost.
Then I found a file under LAM_installation_path/etc/lam-ssi-rpi-ib-helpfile
Which contents I am dumping here:
------------------------------------------------------------
-*-rpi-ib:malloc-fail-*-
# Invoked when malloc fails.
# %1 = data struct for which malloc failed
LAM encountered an error when invoking the library call "malloc".
This happenned when trying to create "%1". Aborting!
-*-rpi-ib:hca-hndl-fail-*-
# Called when hca_get_hndl fails in IB
LAM was not able to get an HCA handle for Infiniband hardware. Aborting!
-*-rpi-ib:this-port-not-free-*-
It seems that the Infiniband port "%1" is busy or not
active. Aborting!
-*-rpi-ib:no-free-port-*-
LAM could not get a free port on Infiniband. All the ports seem to be
in use now. Aborting!
-*-rpi-ib:interval-init-fail-*-
LAM failed while initialization of the data structures required for
management of registered memory intervals. This may be because of
malloc failing due to lack of memory space. Aborting!
-*-rpi-ib:register-mem-fail-*-
An error occurred while registering/pinning OS memory for
Infiniband module. It is possible that you are running on an OS which does
not support memory pining. Aborting!
-*-rpi-ib:qp-create-fail-*-
An error occurred while creating a Infiniband Queue Pair.
The exact error string returned by Infiniband API is as follows:
"%1"
-*-rpi-ib:cq-create-fail-*-
An error occurred while creating an Infiniband Completion Queue.
The exact error string returned by Infiniband API is as follows:
"%1"
-*-rpi-ib:change-q-state-*-
An error occurred while changing the Infiniband queue state to "%2".
The exact error string returned by Infiniband API is as follows:
"%1"
-*-rpi-ib:post-req-fail-*-
An error occurred while posting a request to the Infiniband work queue.
The exact error string returned by Infiniband API is as follows:
"%1"
-*-rpi-ib:poll-fail-*-
An erroneous completion was generated while polling for the Infiniband
completion queue
The exact error string returned by Infiniband API is as follows:
"%1"
-*-rpi-ib:inval-cq-opcode-*-
An invalid completion opcode was obtained while polling the Infiniband
completion queue. Aborting!
-*-rpi-ib:fc-exchange-fail-*-
An error occurred while exchanging the flow control arguments
through the "lam daemon" channel. The communication between the "lam
daemons" failed. Make sure the "lam daemons" are active on all the
required nodes.
-*-rpi-ib:no-pool-buffer-*-
It seems that no pool buffers are left for the pre-posted
receives for the Infiniband work queues. Aborting!
-*-rpi-ib:no-memory-to-register-*-
It seems registration of memory for Infiniband operation failed. We
can not register any more memory. Aborting ..
-*-rpi-ib:invalid-pd-*-
Infiniband did not return a valid protection domain. Aborting!
----------------------------------------------------------------------------
-----------------
Any help would be much appreciated.
Best regards,
Joshua.
|