LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Joshua Mora (joshua_at_[hidden])
Date: 2005-11-09 03:16:41


Hello all.

I am unable to run LAM.7.1.1 or the latest beta LAM.7.1.2b28 using the
module ssi rpi ib.

Description of the system: 8 nodes each of them with 4 opteron processors.
SUSE 9.3. , IBGD, VAPI version 4.1

Configure is run with

./configure -with-threads=posix --prefix=/common/opt/lammpi/lam-7.1.2b

--with-rsh=ssh -x --with-rpi-ib=/usr/local/ibgd/driver/infinihost/
--enable-shared --

with-modules=ib,sysv,usysv,tcp

 

Description of the failure:

mpirun ssi rpi ib C full_path_to_application this hangs right at the
beginning (provably at MPI_INIT)

mpirun ssi rpi ib N full_path_to_application runs without problem

 

That means that I can use only one processor per node. If I try to use more
than one it hangs.

Other modules such as ssi rpi tcp work fine across nodes and using all
processors. I.e.

mpirun ssi rpi tcp C full_path_to_application

 

I read the documentation (user and installation) and I don't know if the
problem has to do with the memory that needs to be pinned, Ib port problem,
HCA problem, memory manager,.I am a bit lost.

Then I found a file under LAM_installation_path/etc/lam-ssi-rpi-ib-helpfile

Which contents I am dumping here:

 

------------------------------------------------------------

-*-rpi-ib:malloc-fail-*-

# Invoked when malloc fails.

# %1 = data struct for which malloc failed

LAM encountered an error when invoking the library call "malloc".

This happenned when trying to create "%1". Aborting!

-*-rpi-ib:hca-hndl-fail-*-

# Called when hca_get_hndl fails in IB

LAM was not able to get an HCA handle for Infiniband hardware. Aborting!

-*-rpi-ib:this-port-not-free-*-

It seems that the Infiniband port "%1" is busy or not

active. Aborting!

-*-rpi-ib:no-free-port-*-

LAM could not get a free port on Infiniband. All the ports seem to be

in use now. Aborting!

-*-rpi-ib:interval-init-fail-*-

LAM failed while initialization of the data structures required for

management of registered memory intervals. This may be because of

malloc failing due to lack of memory space. Aborting!

-*-rpi-ib:register-mem-fail-*-

An error occurred while registering/pinning OS memory for

Infiniband module. It is possible that you are running on an OS which does

not support memory pining. Aborting!

-*-rpi-ib:qp-create-fail-*-

An error occurred while creating a Infiniband Queue Pair.

 

The exact error string returned by Infiniband API is as follows:

 

"%1"

-*-rpi-ib:cq-create-fail-*-

An error occurred while creating an Infiniband Completion Queue.

The exact error string returned by Infiniband API is as follows:

"%1"

-*-rpi-ib:change-q-state-*-

An error occurred while changing the Infiniband queue state to "%2".

 

The exact error string returned by Infiniband API is as follows:

 

"%1"

-*-rpi-ib:post-req-fail-*-

An error occurred while posting a request to the Infiniband work queue.

 

The exact error string returned by Infiniband API is as follows:

 

"%1"

-*-rpi-ib:poll-fail-*-

An erroneous completion was generated while polling for the Infiniband

completion queue

 

The exact error string returned by Infiniband API is as follows:

 

"%1"

-*-rpi-ib:inval-cq-opcode-*-

An invalid completion opcode was obtained while polling the Infiniband

completion queue. Aborting!

-*-rpi-ib:fc-exchange-fail-*-

An error occurred while exchanging the flow control arguments

through the "lam daemon" channel. The communication between the "lam

daemons" failed. Make sure the "lam daemons" are active on all the

required nodes.

-*-rpi-ib:no-pool-buffer-*-

It seems that no pool buffers are left for the pre-posted

receives for the Infiniband work queues. Aborting!

-*-rpi-ib:no-memory-to-register-*-

It seems registration of memory for Infiniband operation failed. We

can not register any more memory. Aborting ..

-*-rpi-ib:invalid-pd-*-

Infiniband did not return a valid protection domain. Aborting!

----------------------------------------------------------------------------
-----------------

 

Any help would be much appreciated.

Best regards,

Joshua.