hi
i am new to lam and have till now run just a few basic programs ...of master slave type where process 0 is assigned as master and it distributes the work to the other processes .I am basically programming lam on a 2-node beowulf cluster ...now the problem is when i invoke my program using mpirun like this :-
mpirun -np 10 myprog
and any of the processes dies lam exits with a message :-
one of the many processes started by mpirun has failed...
process ... on node ... has terminated ...
now i understand that this is how lam is supposed to behave ...if any of the processes in a communicator dies ...(MPI_COMM_WORLD in this case ) then lam kills all the processes in the communicator ...right ? or am i missing something ?
moreover what if i want a program which spawns multiple slave processes and should any of the slave processes fails the master immediately comes to know and redstributes the job ...
any simple code examples !!!
|