run "top" and see how much memory you have left.
i have the same problem and found that i am running
out of memory because somehow very few of my memory
pages are returning to the memory pool. so, after
about 400 runs of my code i have used 499 Mb of memory
and mpi just stops. i need to reboot everything in
order to get LAM to run again (i am almost positive
the leak
has to do with Linux because if i just boot the comp
and run "top" i lose 8K every 3 seconds or so...)
-j
--- Bill Bruno <billb_at_[hidden]> wrote:
>
> There is no /usr/bin/lamd. I installed lam in my
> home directory,
> and set LAMHOME in .bashrc. /tmp is writable.
>
> The random sockets could be a problem; I'm not sure
> how to test
> that but if I do say
>
>
> cat > /dev/udp/localhost/17
> there is no error, whereas
> $ cat > /dev/tcp/localhost/17
>
> bash: connect: Connection refused
> bash: /dev/tcp/localhost/17: Connection refused
>
> Is it tcp or udp that is needed?
>
> I was hoping to get lam up without needing to get
> ahold
> of the su.
>
> On Wed, Sep 04, 2002 at 01:31:29AM -0500, Vishal
> Sahay wrote:
> > It looks like the fork is failing, somehow.
> > Check for the following things:
> >
> > - /usr/bin/lamd is the same version of LAM as
> lamboot. See if
> > lamboot is in /usr/bin, and that they're both
> 6.5.6.
> >
> > - /tmp is writable?
> >
> > - Firewall software is installed such that opening
> random sockets to
> > localhost will fail.
> >
> >
> > -Vishal Sahay
> >
>
===================================================================
> > (Graduate Student, CS Dept. Make Today A LAM/MPI
> Day :)
> > Indiana University, Bloomington)
> http://www.lam-mpi.org
> > http://cs.indiana.edu/~vsahay
> >
>
===================================================================
> >
> > On Sat, 31 Aug 2002, David Shattuck wrote:
> >
> > # Hi -
> > #
> > # I am trying to boot a lam cluster with two
> machines. One of these cannot
> > # lamboot itself. When I try, I get a error
> message with no description of
> > # the error. Any idea what could be causing this?
> I have included the
> > # output of both "lamboot" and "lamboot -d -v"
> below. SSH to the machine
> > # works fine, and I have LAMRSH set to "ssh -x".
> > #
> > # thanks,
> > # David Shattuck
> > # UCLA Laboratory of Neuro Imaging
> > #
> > #
> > #
> > #
> > #
> > #
> > # [glitch_at_wulfpet3 glitch]$ lamboot
> > #
> > # LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre
> Dame
> > #
> > #
>
-----------------------------------------------------------------------------
> > # lamboot encountered some error (see above)
> during the boot process,
> > # and will now attempt to kill all nodes that it
> was previously able to
> > # boot (if any).
> > #
> > # Please wait for LAM to finish; if you interrupt
> this process, you may
> > # have LAM daemons still running on remote nodes.
> > #
>
-----------------------------------------------------------------------------
> > #
> > # LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre
> Dame
> > #
> > # [glitch_at_wulfpet3 glitch]$ lamboot -d -v
> > #
> > # LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre
> Dame
> > #
> > # lamboot: boot schema file:
> /etc/lam/lam-bhost.def
> > # lamboot: opening hostfile /etc/lam/lam-bhost.def
> > # lamboot: found the following hosts:
> > # lamboot: n0 localhost
> > # lamboot: resolved hosts:
> > # lamboot: n0 localhost --> 127.0.0.1
> > # lamboot: found 1 host node(s)
> > # lamboot: origin node is 0 (localhost)
> > # Executing hboot on n0 (localhost - 1 CPU)...
> > # lamboot: attempting to execute "hboot -t -c
> lam-conf.lam -d -v -I " -H
> > # 127.0.0.1 -P 32835 -n 0 -o 0 ""
> > # hboot: process schema = "/etc/lam/lam-conf.lam"
> > # hboot: found /usr/bin/lamd
> > # hboot: performing tkill
> > # hboot: tkill
> > # hboot: booting...
> > # hboot: fork /usr/bin/lamd
> > # [1] 10980 lamd -H 127.0.0.1 -P 32835 -n 0 -o 0
> -d
> > # hboot: attempting to execute
> > #
>
-----------------------------------------------------------------------------
> > # lamboot encountered some error (see above)
> during the boot process,
> > # and will now attempt to kill all nodes that it
> was previously able to
> > # boot (if any).
> > #
> > # Please wait for LAM to finish; if you interrupt
> this process, you may
> > # have LAM daemons still running on remote nodes.
> > #
>
-----------------------------------------------------------------------------
> > # wipe ...
> > #
> > # LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre
> Dame
> > #
> > # Executing tkill on n0 (localhost)...
> > # lamboot did NOT complete successfully
> > # [glitch_at_wulfpet3 glitch]$
> > #
> > #
> > # _______________________________________________
> > # This list is archived at
> http://www.lam-mpi.org/MailArchives/lam/
> > #
> >
> > _______________________________________________
> > This list is archived at
> http://www.lam-mpi.org/MailArchives/lam/
>
> --
> _ _ _ _ _ _ _ _
> -_- -_- - -_- -_- - -_- -_- - -_- -_- -
> _______________________________________________
> This list is archived at
http://www.lam-mpi.org/MailArchives/lam/
__________________________________________________
Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes
http://finance.yahoo.com
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|