This list is for support of the LAM implementation of MPI, not MPICH.
You'll need to contact the MPICH developers for support.
On Mar 22, 2005, at 12:59 AM, ags_at_[hidden] wrote:
> Hi all,
>
> My application is using MPICH-1.2.5. This is involving 21 IBM blade
> servers.
>
> The application was working fine initially.
>
> Then, teaming of network cards for 20 of these 21 IBM blade servers is
> done for the purpose of network redundency. (teaming/bonding is done
> as:
> eth0 and eth1 - two interfaces are created on each server and bond0
> interface is configured which is the single virtual interface for the
> communication purpose).
>
> My application has one server as control server which invokes
> executables
> on the 20 blade servers. We are using MPICH-1.2.5 for this purpose.
>
> After this change is done and my application is tried to be run, it is
> waiting for the response from the executables that run on the blade
> servers, it waits for 8min and then it is giving "timeout error".
>
> But, connectivity from the control server to all the blade servers is
> fine
> and the connectivity from blade servers to control server is also
> present
> and I could perform file copy across machines (in both directions: i.e.
> from control server to blade server and from blade server to the
> control
> server) using scp.
>
> Also, control server could invoke the executables on the blade servers
> using ssh. The problem is coming when the executables on the blade
> servers
> (on which teaming is done) have to comminicate back to the control
> server
> (they display their rank in the mpi environment as a return message to
> control server). At this stage, control server is waiting for the rank
> messages form the executables running on blade servers and after
> waiting
> for about 8min it is giving error showing connection timed out.
>
> Can any one throw some light on this problem asap?
>
> Thanks
> A G Srinivas
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
--
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/
|