LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-03-22 07:17:05


This list is for support of the LAM implementation of MPI, not MPICH.

You'll need to contact the MPICH developers for support.

On Mar 22, 2005, at 12:59 AM, ags_at_[hidden] wrote:

> Hi all,
>
> My application is using MPICH-1.2.5. This is involving 21 IBM blade
> servers.
>
> The application was working fine initially.
>
> Then, teaming of network cards for 20 of these 21 IBM blade servers is
> done for the purpose of network redundency. (teaming/bonding is done
> as:
> eth0 and eth1 - two interfaces are created on each server and bond0
> interface is configured which is the single virtual interface for the
> communication purpose).
>
> My application has one server as control server which invokes
> executables
> on the 20 blade servers. We are using MPICH-1.2.5 for this purpose.
>
> After this change is done and my application is tried to be run, it is
> waiting for the response from the executables that run on the blade
> servers, it waits for 8min and then it is giving "timeout error".
>
> But, connectivity from the control server to all the blade servers is
> fine
> and the connectivity from blade servers to control server is also
> present
> and I could perform file copy across machines (in both directions: i.e.
> from control server to blade server and from blade server to the
> control
> server) using scp.
>
> Also, control server could invoke the executables on the blade servers
> using ssh. The problem is coming when the executables on the blade
> servers
> (on which teaming is done) have to comminicate back to the control
> server
> (they display their rank in the mpi environment as a return message to
> control server). At this stage, control server is waiting for the rank
> messages form the executables running on blade servers and after
> waiting
> for about 8min it is giving error showing connection timed out.
>
> Can any one throw some light on this problem asap?
>
> Thanks
> A G Srinivas
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/