Jeff,
I was explicitly invoking the lamboot command with a fully qualified
pathname, but lam was not on the path. I've added lam to the (PBS
environment) path, and now it all seems to be working. Thanks for your help!
- Beth
Jeff Squyres wrote:
> This output confirms that tm build properly and is integrated into
> your LAM/MPI installation.
>
> The problem appears to be here in the output from lamboot:
>
> n-1<794> ssi:boot:tm: starting wipe on (x.grid.umich.edu)
> Can't find executable for tkill
>
> "tkill" is one of the LAM executables. If it can't be found, lamboot
> is going to abort (and it did).
>
> However, I can't figure out how lamboot would be found but tkill would
> not (they should be in the same directory). Is LAM's installation
> directory in your PATH? (specifically, $prefix/bin) This is going to
> sound dumb, but you can verify that both tkill and lamboot exist in
> the same directory and are both executable by you?
>
> Specifically, put a "which lamboot" and "which tkill" at the top of
> your PBS script -- let's triple check that you're getting all the
> "right" executables. What the heck -- put a "laminfo" in there, too
> -- we can verify that you're finding the right laminfo, etc.
>
>
> On May 27, 2005, at 5:01 PM, Beth Kirschner wrote:
>
>> Thanks for the quick reply. In regards to your last comment (that the
>> rsh module was chosen and not the lm module), I had sent you the
>> output from my build _without_ the tm module -- so this is to be
>> expected.
>>
>> I've attached a compressed tarball with the following files:
>>
>> config.log -- lam-7.1.1/config.log
>> configure.out -- output of configure command (with tm module)
>> laminfo.txt -- output of laminfo
>> mpi.err -- output from trying to run lamboot
>> tm.config.log -- lam-7.1.1/share/ssi/boot/tm/config.log
>>
>> Thanks in advance for any help,
>> - Beth
>>
>> Jeff Squyres wrote:
>>
>>> On May 27, 2005, at 9:29 AM, Beth Kirschner wrote:
>>>
>>>> I'm having trouble getting 'lamboot' to execute from within a PBS
>>>> script on a Mac OSX box. It runs fine without PBS. Has anyone else
>>>> had success with this?
>>>
>>>
>>>
>>> I *think* that we have tested this (PBS/Torque on OSX), but I can't
>>> swear to it. Hypothetically, it *should* be the same as it is on
>>> Linux -- there really shouldn't be any difficulties with this. If
>>> there are, it's a bug that we should fix.
>>>
>>>> I've tried building Lam 7.1.1 in two configurations:
>>>>
>>>> # configure --prefix=/usr/local/lam-7.1.1 -with-rsh="ssh -x"
>>>> --without-fc
>>>> # configure --prefix=/usr/local/lam-7.1.1 -with-rsh="ssh -x"
>>>> --without-fc --with-boot=tm --with-boot-tm=/usr/local/pbs
>>>
>>>
>>>
>>> Can you send the output of the latter? I'd like to see the full
>>> output of the configure including the --with-boot... switches
>>> (please compress). Also send the corresponding config.log file, and
>>> share/ssi/boot/tm/config.log.
>>>
>>> You can also check to ensure that the TM support built properly by
>>> running the laminfo command. It will show you all the modules that
>>> were built into LAM. If the "tm" boot module is not listed, then
>>> the PBS/Torque support did not build properly.
>>>
>>> If it did not build properly, the output from configure should shed
>>> light on the reason why (the determination of whether to build a
>>> given module or not is made during configure).
>>>
>>>> Here's the script I've been running:
>>>>
>>>> #PBS -l nodes=1:ppn=2
>>>> /usr/local/lam-7.1.1/bin/lamboot -d -v ${PBS_NODEFILE}
>>>>
>>>> Here's some of the output:
>>>>
>>>> n-1<7964> ssi:boot:base:server: opened port 55040
>>>> n-1<7964> ssi:boot:base:linear: booting n0 (x.grid.umich.edu)
>>>> n-1<7964> ssi:boot:rsh: starting lamd on (x.grid.umich.edu)
>>>> n-1<7964> ssi:boot:rsh: starting on n0 (x.grid.umich.edu): hboot
>>>> -t -c lam-conf.lamd -d -v -sessionsuffix pbs-3497.x.grid.umich.edu
>>>> -I -H 141.211.23.234 -P 55040 -n 0 -o 0
>>>> n-1<7964> ssi:boot:rsh: launching locally
>>>> n-1<7964> ssi:boot:base:linear: Failed to boot n0
>>>> (x.grid.umich.edu)
>>>> n-1<7964> ssi:boot:base:server: closing server socket
>>>> n-1<7964> ssi:boot:base:linear: aborted!
>>>> lamboot did NOT complete successfully
>>>
>>>
>>>
>>> Note that the rsh module was chosen instead of the tm module -- this
>>> seems to imply that the tm support was not built and included in
>>> your LAM installation. Can't say this for sure without the other
>>> data (see above), but it's one possible explanation.
>>>
>> <lam.tar.gz>_______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
|