LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Bill Bruno (billb_at_[hidden])
Date: 2002-09-04 12:10:15


Well top says I have 10MB free. But perhaps it's
interesting that ipcs shows no numbers. Could this
be related to the socket issue? What can I do if
the inability to open tcp sockets is the problem?

Bill

On Wed, Sep 04, 2002 at 08:53:42AM -0700, jeremy archuleta wrote:
> run "top" and see how much memory you have left.
> i have the same problem and found that i am running
> out of memory because somehow very few of my memory
> pages are returning to the memory pool. so, after
> about 400 runs of my code i have used 499 Mb of memory
> and mpi just stops. i need to reboot everything in
> order to get LAM to run again (i am almost positive
> the leak
> has to do with Linux because if i just boot the comp
> and run "top" i lose 8K every 3 seconds or so...)
>
> -j
>
>
>
> --- Bill Bruno <billb_at_[hidden]> wrote:
> >
> > There is no /usr/bin/lamd. I installed lam in my
> > home directory,
> > and set LAMHOME in .bashrc. /tmp is writable.
> >
> > The random sockets could be a problem; I'm not sure
> > how to test
> > that but if I do say
> >
> >
> > cat > /dev/udp/localhost/17
> > there is no error, whereas
> > $ cat > /dev/tcp/localhost/17
> >
> > bash: connect: Connection refused
> > bash: /dev/tcp/localhost/17: Connection refused
> >
> > Is it tcp or udp that is needed?
> >
> > I was hoping to get lam up without needing to get
> > ahold
> > of the su.
> >
> > On Wed, Sep 04, 2002 at 01:31:29AM -0500, Vishal
> > Sahay wrote:
> > > It looks like the fork is failing, somehow.
> > > Check for the following things:
> > >
> > > - /usr/bin/lamd is the same version of LAM as
> > lamboot. See if
> > > lamboot is in /usr/bin, and that they're both
> > 6.5.6.
> > >
> > > - /tmp is writable?
> > >
> > > - Firewall software is installed such that opening
> > random sockets to
> > > localhost will fail.
> > >
> > >
> > > -Vishal Sahay
> > >
> >
> ===================================================================
> > > (Graduate Student, CS Dept. Make Today A LAM/MPI
> > Day :)
> > > Indiana University, Bloomington)
> > http://www.lam-mpi.org
> > > http://cs.indiana.edu/~vsahay
> > >
> >
> ===================================================================
> > >
> > > On Sat, 31 Aug 2002, David Shattuck wrote:
> > >
> > > # Hi -
> > > #
> > > # I am trying to boot a lam cluster with two
> > machines. One of these cannot
> > > # lamboot itself. When I try, I get a error
> > message with no description of
> > > # the error. Any idea what could be causing this?
> > I have included the
> > > # output of both "lamboot" and "lamboot -d -v"
> > below. SSH to the machine
> > > # works fine, and I have LAMRSH set to "ssh -x".
> > > #
> > > # thanks,
> > > # David Shattuck
> > > # UCLA Laboratory of Neuro Imaging
> > > #
> > > #
> > > #
> > > #
> > > #
> > > #
> > > # [glitch_at_wulfpet3 glitch]$ lamboot
> > > #
> > > # LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre
> > Dame
> > > #
> > > #
> >
> -----------------------------------------------------------------------------
> > > # lamboot encountered some error (see above)
> > during the boot process,
> > > # and will now attempt to kill all nodes that it
> > was previously able to
> > > # boot (if any).
> > > #
> > > # Please wait for LAM to finish; if you interrupt
> > this process, you may
> > > # have LAM daemons still running on remote nodes.
> > > #
> >
> -----------------------------------------------------------------------------
> > > #
> > > # LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre
> > Dame
> > > #
> > > # [glitch_at_wulfpet3 glitch]$ lamboot -d -v
> > > #
> > > # LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre
> > Dame
> > > #
> > > # lamboot: boot schema file:
> > /etc/lam/lam-bhost.def
> > > # lamboot: opening hostfile /etc/lam/lam-bhost.def
> > > # lamboot: found the following hosts:
> > > # lamboot: n0 localhost
> > > # lamboot: resolved hosts:
> > > # lamboot: n0 localhost --> 127.0.0.1
> > > # lamboot: found 1 host node(s)
> > > # lamboot: origin node is 0 (localhost)
> > > # Executing hboot on n0 (localhost - 1 CPU)...
> > > # lamboot: attempting to execute "hboot -t -c
> > lam-conf.lam -d -v -I " -H
> > > # 127.0.0.1 -P 32835 -n 0 -o 0 ""
> > > # hboot: process schema = "/etc/lam/lam-conf.lam"
> > > # hboot: found /usr/bin/lamd
> > > # hboot: performing tkill
> > > # hboot: tkill
> > > # hboot: booting...
> > > # hboot: fork /usr/bin/lamd
> > > # [1] 10980 lamd -H 127.0.0.1 -P 32835 -n 0 -o 0
> > -d
> > > # hboot: attempting to execute
> > > #
> >
> -----------------------------------------------------------------------------
> > > # lamboot encountered some error (see above)
> > during the boot process,
> > > # and will now attempt to kill all nodes that it
> > was previously able to
> > > # boot (if any).
> > > #
> > > # Please wait for LAM to finish; if you interrupt
> > this process, you may
> > > # have LAM daemons still running on remote nodes.
> > > #
> >
> -----------------------------------------------------------------------------
> > > # wipe ...
> > > #
> > > # LAM 6.5.6/MPI 2 C++/ROMIO - University of Notre
> > Dame
> > > #
> > > # Executing tkill on n0 (localhost)...
> > > # lamboot did NOT complete successfully
> > > # [glitch_at_wulfpet3 glitch]$
> > > #
> > > #
> > > # _______________________________________________
> > > # This list is archived at
> > http://www.lam-mpi.org/MailArchives/lam/
> > > #
> > >
> > > _______________________________________________
> > > This list is archived at
> > http://www.lam-mpi.org/MailArchives/lam/
> >
> > --
> > _ _ _ _ _ _ _ _
> > -_- -_- - -_- -_- - -_- -_- - -_- -_- -
> > _______________________________________________
> > This list is archived at
> http://www.lam-mpi.org/MailArchives/lam/
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Finance - Get real-time stock quotes
> http://finance.yahoo.com
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
   _   _     _   _     _   _     _   _ 
-_- -_- - -_- -_- - -_- -_- - -_- -_- - 
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/