using lam6.6b1 a strange (aka. bizarre) problem has
occurred in the past 2 days after several months of
working fine.
currently i can boot all the nodes i want. and
checking "top" tells me that everything is fine.
then i try to run "mpirun" and node 0 (the first node
in the host file) suddenly has its lamd daemon killed.
node 0 also happens to NOT be the node that i am
booting from.
because of this, mpirun hangs (obviously) and lamhalt
gets stuck in a loop trying to communicate to node 0.
(switch lights become xmas tree lights)
the only thing left to do is to login into each node
and kill each lamd through "top".
the events that occurred in the past 2 days:
-6.6b1 working
-changed path to use lam 6.5.6
-disabled ipchains and iptables
-6.5.6 runs 1140 "mpirun" commands
-changed path back to use lam 6.6b1
-6.6b1 not working
any help, as always, is welcome.
i am considering just reinstalling lam because there
are "only" 4 nodes, but this won't be an option later
with 32 nodes...
and for those that know, this is not the version of
LAM that i am patching ("hacking") ; )
-j
__________________________________________________
Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes
http://finance.yahoo.com
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|