Hi again,
As per advice, I'm now running LAM-7.1.2b26.
>> 3) I can't seem to get $LAMHOME to "stick" after rebooting the master
>> or
>> the nodes, and the non-interactive sh shell invoked by rsh doesn't seem
>> to search the usual places for things to add to my path. So I went for
>> a
>> quick dirty fix - I wrote a script to put links to all the files in
>> usr/local/lam-mpi/bin into /usr/local/bin on each node. I don't think
>> this should be causing the problem, but thought I'd mention it as it's
>> a
>> bit "non-standard".
>This should not be necessary.
I noted that the user manual (v7.1.1 page 22, second table) said LAM would
call ~/.profile when a non-interactive sh shell was called. This doesn't
appear to be happening (when using 7.1.2b26). I put an echo foo in
~/.profile, and it (foo) wasn't appearing when LAM tried to use a
non-interactive remote shell. It *was* appearing when I started an
interactive remote shell. It seems something is broken on my installation
at least. So, back to the quick dirty fix as above with the links...
> I *think* that many of your problems is that LAM 7.1.1 is putting the ]
> in the wrong place -- it's missing a space, causing the parsing on the
> remote node to go badly.
> The latest beta of 7.1.2 fixes this issue -- could you give that a
> while? It might also fix your $LAMHOME/PATH issues (if the shell
> parsing is wrong right off the bat, other things can go wrong).
Same issue as with 7.1.1. The close bracket is at the end of the line, LAM
barfs on stderr output, and suggests the same test command. Which still
won't work until I swap the order of the last two non-white characters -
the ' and the ) in:
rsh euler -n '( ! [ -e ./.profile] || . ./profile;' hboot -t -c
lam-conf.lamd -d -v -s -I '"-H 192.168.0.10 -P 60547 -n 1 -o 0"' )
Here's the error message. Much less detail this time, as the issue seems
to be the same as in my last mail:
<Preceded by lots of nice looking happy messages>
n-1<1028> ssi:boot:rsh: attempting to execute: rsh euler -n '( ! [ -e
./.profile] || . ./.profile;' hboot -t -c lam-conf.lamd -d -v -s -I '"-H
192.168.0.10 -P 55089 -n 1 -o 0"' )
ERROR: LAM/MPI unexpectedly received the following on stderr:
[: missing ]
<Followed by an explanatory message and then the same rigmarole all over
again for tkill>
So, er, any more suggestions? Is there a way to force LAM to get the order
of these last two characters right? I'm stumped right now. And thanks
again to Jeff for his extremely prompt reply to the last cry for help...
;-)
Cheers,
James
--
******************************************
Dr. J.R. Dorsey
CNR - ESPM - Ecosystem Science
105 Hilgard Hall
University of California, Berkeley
Berkeley, CA 94720 - 3110
E: j.dorsey_at_[hidden]
T: +1 510 642 9048
M: +1 510 499 4398
W: http://nature.berkeley.edu/biometlab/
|