LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2004-07-20 10:07:04


In addition to Bogdan's replies, let me ask some general questions:

- Since you have 2 NICs for each host, do you have 2 IP addresses for
each host?
- If so, are both NICs on each node plugged into the same switch, or do
you have a second separate network for the second set of NICs?
- Are you absolutely sure that you have all the TCP networking stuff
setup correctly (regardless of whether you are using one or two
networks)? Setting up multiple NICs on each node can be quite
confusing the first time you do it (i.e., ensuring IP addresses,
netmasks, and routes are all correct).

It's a little confusing that you are able to successfully boot LAM but
then have tping fail after *some* successful pings. That seems to
indicate some network-level problems:

1) lamboot uses mostly TCP communication to setup the LAM daemons
2) tping uses UDP communications (i.e., the lamd's send UDP packets
between each other)

So if you can lamboot, then TCP communications are working. If you can
tping a few times before it fails, then UDP communications work, but
then somehow stop working. This is most suspicious -- all LAM is doing
is sendto() from one LAM daemon to another.

Some random things to check:

- ensure that you have no firewalling software running anywhere
- double and triple check that all your network setup is correct
- ensure that you're using IP addresses that are reachable from each
other (e.g., if you're using multiple different networks)

On Jul 19, 2004, at 10:32 AM, DeJiang Jin wrote:

> Hi, everyone
> I have encounter the same problem descript in previous mails: 1)
> tping only 3 times. 2) Even then lamd still live on each node; and 3)
> if run mpi programs, they hung before MPI_INIT have done. And only one
> remote process has been forked. Local processes all are forked if they
> are before all remote processes in order specified in command of
> mpirun. 4) If monitor network packets, we can find these processes send
> a lot of UDP packets each other, but no any TCP communication.
> I guess it is compatible problem because I bypass it as following:
> My computers in the system have a build-in NIC and I add one more NIC
> (3com). The former uses the tg3 driver that install with Redhat Fedora
> OS. The later use the 3com driver that com with these NIC. Both work
> fine for general use.
> When I installed LAM7.06, NICs of 3com are configured to use. LAM can
> be booted but fail to run mpi programs as described above. But when I
> shift to NICs of tg3 everything is fine. (I can run sample mpi
> programs). The shift between tg3 and 3com NICs means activate one and
> deactivate the other and assign them exactly same IP. And firewall
> disable on all nodes in the private system.
> It is more interested that if only one node in the system keeps the
> use of 3com NIC LAM also works. But if more than one uses 3com NIC LAM
> fail. It seems all parts (LAM software, Linux, and NIC) work but LAM
> fails with some combination.
> Hope someone can explain the real reason of this problem and give
> some well solutions.
>
> Best Regards,
>
> Dejiang
>
> Jeff Squyres wrote:
>
>
>>
>> On Fri, 16 Jul 2004, Gkikas Magiorkinis wrote:
>>
>>> I have checked the security settings and it is at the "no firewall"
>>> setting. Is there any specific test to check the firewall?
>>
>> Bogdan answered this.
>>
>>> All the nodes are running the lam. When the tping hangs the only way
> to
>>> bring down the lam at the tpinging nodes is to use wipe. Lamhalt does
>>> not work for these specific nodes but it works for the rest of the
> nodes
>>> (i mean the nodes i did not choose to tping).
>>
>> When tping hangs, can you check to see if the lamd is still running on
> all
>> the nodes? One of the reaons that tping (and lamhalt) may hang is if
>> a
>> lamd fails/aborts.
>>
>> If this is what is happening, it is quite possible that the RPM you
>> installed is not compatible with your system (there's a million
>> reasons
>> this could be happening). It may be advistable to either build from
>> source or download the SRPM and rebuild it for your system (see the
> thread
>> that just wrapped up about your installed version of Libtool!
>> http://www.lam-mpi.org/MailArchives/lam/msg08359.php).
>>
>>> One additional info is that i have installed MPICH also and it seems
> to
>>> work for some applications. The MPICH is installed in directory that
> is
>>> commonly shared by all the nodes.
>>
>> Note that LAM can do this as well; if you uninstall the RPM and build
> LAM
>> from source in a directory that is accessible on all nodes, it can be
> an
>> easier software management solution in many cases. See the LAM FAQ
>> for
>> more details here ("Typical setup of LAM").
>>
>> Hope this helps.
>>
>> --
>> {+} Jeff Squyres
>> {+} jsquyres_at_[hidden]
>> {+} http://www.lam-mpi.org/
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/