Hi All,
I have a question about a problem that I am not able to
find any clue.
The problem is like this.
I started the lam daemon with lamboot command at 4 nodes.
Then I submit a job to all of those nodes. The master sends
a value to the nodes by broadcast. All nodes calculate a
routine depending on that value. Now after some time I want
to reduce the number of nodes from 4 to 3. So I use the
lamshrink command to remove node 4. After that I see my
program stuck and crashes with an error saying receive
failed etc. My question is what I am doing wrong ? Is not
that lamshrink command should allow me to reduce the number
of calculating nodes with out any problem ?
I even tried to give the command from the program itself
but no improvement.
Am I doing something wrong ?
Tanks for your attention.
Shihab
U. of Windsor
|