LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: DeJiang Jin (dxj7602_at_[hidden])
Date: 2004-07-19 09:32:19


Hi, everyone
  I have encounter the same problem descript in previous mails: 1)
tping only 3 times. 2) Even then lamd still live on each node; and 3)
if run mpi programs, they hung before MPI_INIT have done. And only one
remote process has been forked. Local processes all are forked if they
are before all remote processes in order specified in command of
mpirun. 4) If monitor network packets, we can find these processes send
a lot of UDP packets each other, but no any TCP communication.
  I guess it is compatible problem because I bypass it as following:
  My computers in the system have a build-in NIC and I add one more NIC
(3com). The former uses the tg3 driver that install with Redhat Fedora
OS. The later use the 3com driver that com with these NIC. Both work
fine for general use.
  When I installed LAM7.06, NICs of 3com are configured to use. LAM can
be booted but fail to run mpi programs as described above. But when I
shift to NICs of tg3 everything is fine. (I can run sample mpi
programs). The shift between tg3 and 3com NICs means activate one and
deactivate the other and assign them exactly same IP. And firewall
disable on all nodes in the private system.
  It is more interested that if only one node in the system keeps the
use of 3com NIC LAM also works. But if more than one uses 3com NIC LAM
fail. It seems all parts (LAM software, Linux, and NIC) work but LAM
fails with some combination.
  Hope someone can explain the real reason of this problem and give
some well solutions.

  Best Regards,

 Dejiang

Jeff Squyres wrote:

>
>On Fri, 16 Jul 2004, Gkikas Magiorkinis wrote:
>
>> I have checked the security settings and it is at the "no firewall"
>> setting. Is there any specific test to check the firewall?
>
>Bogdan answered this.
>
>> All the nodes are running the lam. When the tping hangs the only way
to
>> bring down the lam at the tpinging nodes is to use wipe. Lamhalt does
>> not work for these specific nodes but it works for the rest of the
nodes
>> (i mean the nodes i did not choose to tping).
>
>When tping hangs, can you check to see if the lamd is still running on
all
>the nodes? One of the reaons that tping (and lamhalt) may hang is if a
>lamd fails/aborts.
>
>If this is what is happening, it is quite possible that the RPM you
>installed is not compatible with your system (there's a million reasons
>this could be happening). It may be advistable to either build from
>source or download the SRPM and rebuild it for your system (see the
thread
>that just wrapped up about your installed version of Libtool!
>http://www.lam-mpi.org/MailArchives/lam/msg08359.php).
>
>> One additional info is that i have installed MPICH also and it seems
to
>> work for some applications. The MPICH is installed in directory that
is
>> commonly shared by all the nodes.
>
>Note that LAM can do this as well; if you uninstall the RPM and build
LAM
>from source in a directory that is accessible on all nodes, it can be
an
>easier software management solution in many cases. See the LAM FAQ for
>more details here ("Typical setup of LAM").
>
>Hope this helps.
>
>--
>{+} Jeff Squyres
>{+} jsquyres_at_[hidden]
>{+} http://www.lam-mpi.org/
>_______________________________________________
>This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>