Hi all,
My application is using MPICH-1.2.5. This is involving 21 IBM blade servers.
The application was working fine initially.
Then, teaming of network cards for 20 of these 21 IBM blade servers is
done for the purpose of network redundency. (teaming/bonding is done as:
eth0 and eth1 - two interfaces are created on each server and bond0
interface is configured which is the single virtual interface for the
communication purpose).
My application has one server as control server which invokes executables
on the 20 blade servers. We are using MPICH-1.2.5 for this purpose.
After this change is done and my application is tried to be run, it is
waiting for the response from the executables that run on the blade
servers, it waits for 8min and then it is giving "timeout error".
But, connectivity from the control server to all the blade servers is fine
and the connectivity from blade servers to control server is also present
and I could perform file copy across machines (in both directions: i.e.
from control server to blade server and from blade server to the control
server) using scp.
Also, control server could invoke the executables on the blade servers
using ssh. The problem is coming when the executables on the blade servers
(on which teaming is done) have to comminicate back to the control server
(they display their rank in the mpi environment as a return message to
control server). At this stage, control server is waiting for the rank
messages form the executables running on blade servers and after waiting
for about 8min it is giving error showing connection timed out.
Can any one throw some light on this problem asap?
Thanks
A G Srinivas
|