LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Robert fiske (rfiske__at_[hidden])
Date: 2006-02-16 17:18:35


I managed to get the program to launch, however after starting the following
error is produced (this particular run didn't show any output from the
program but the error remains the same) Is this still a LAM issue, or
should I move over to the nwchem list for help?

Thank you for your time and assistance

Robert Fiske

ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP Sockets'.
trying to connect to host=Mercury, port=49366
0:armci_CreateSocketAndConnect: gethostbyname failed: 0
0:armci_CreateSocketAndConnect: gethostbyname failed: 0
Last System Error Message from Task 0:: Invalid argument
-10000(s):armci_data_serv: unknown format code: (0,0)
-10000(s):armci_data_serv: unknown format code: (0,0)
Last System Error Message from Task -10000:: Invalid argument
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 424 failed on node n1 (192.168.5.14) with exit status 1.
-----------------------------------------------------------------------------

>From: Brian Barrett <brbarret_at_[hidden]>
>Reply-To: General LAM/MPI mailing list <lam_at_[hidden]>
>To: General LAM/MPI mailing list <lam_at_[hidden]>
>Subject: Re: LAM: Mac problem with some apps
>Date: Thu, 9 Feb 2006 13:19:34 -0500
>
>On Feb 9, 2006, at 11:11 AM, rob fiske wrote:
>
> > laminfo returns the same information both locally and the remote
> > machine
> > through ssh, the code is the same version but was compiled
> > seperately (for
> > both lam and nwchem, due to another error I was getting when I just
> > copied
> > the binaries) Is this an issue as you were saying? If so woul
> > compiling
> > either code with shared libraries enabled fix the problem, or
> > should I go
> > back to the nwchem list to try and get that other error resolved?
>
>I don't think this is a nwchem problem or a problem that will be
>solved with shared libraries. For whatever reason, it appears that
>the nwchem in /usr/local/NWCHem/bin on the host Cobalt was built
>against LAM 7.1.x while the build on the other hosts was with LAM
>7.0.x. That's the only way the error message below could have happened.
>
>You might want to try rebuilding LAM on the cobalt machine (since
>it's a different binary), double checking that it's finding the LAM/
>MPI that you expect to be using.
>
>Brian
>
>
> >> From: Brian Barrett <brbarret_at_[hidden]>
> >> Reply-To: General LAM/MPI mailing list <lam_at_[hidden]>
> >> To: General LAM/MPI mailing list <lam_at_[hidden]>
> >> Subject: Re: LAM: Mac problem with some apps
> >> Date: Wed, 8 Feb 2006 20:32:52 -0500
> >>
> >> On Feb 8, 2006, at 12:27 PM, rob fiske wrote:
> >>
> >>> ==============================================
> >>> Palladium:~/tests/QM/BH4_N fiske$ mpirun C /usr/local/NWChem/bin/
> >>> nwchem
> >>> tests/QM/BH4_N/test.nw
> >>> --------------------------------------------------------------------
> >>> --
> >>> -------
> >>> It seems that [at least] one of the processes that was started with
> >>> mpirun chose a different RPI than its peers. For example, at least
> >>> the following two processes mismatched in their RPI selections:
> >>>
> >>> MPI_COMM_WORLD rank 0: tcp (v7.0.0)
> >>> MPI_COMM_WORLD rank 2: usysv (v7.1.0)
> >>>
> >>> All MPI processes must choose the same RPI module and version when
> >>> they start. Check your SSI settings and/or the local environment
> >>> variables on each node.
> >>> --------------------------------------------------------------------
> >>> --
> >>> -------
> >>> --------------------------------------------------------------------
> >>> --
> >>> -------
> >>> The selected RPI failed to initialize during MPI_INIT. This is a
> >>> fatal error; I must abort.
> >>>
> >>> This occurred on host Cobalt (n1).
> >>> The PID of failed process was 15412 (MPI_COMM_WORLD rank: 2)
> >>> ==============================================
> >>>
> >>> Both machines have LAM-7.0.6 installed, and both are MAC OSX 10.3.9
> >>> for
> >>> their OS and their CPUs are G4's
> >>>
> >>> Has anyone encountered a problem such as this before (I have tried
> >>> giving
> >>> the -ssi option to mpirun as found on this list)?
> >>
> >> The error message really does seem to indicate that at least one
> >> process is using LAM 7.1. Since the error message is on rank 2, that
> >> suggests that it might be on the remote node, so you might be running
> >> into path search issues. One quick way to find out is to run "ssh
> >> <node> laminfo" to make sure you are getting the right one. Since
> >> LAM is generally compiled statically by default, make sure you are
> >> running the same version of your code on both nodes - otherwise, the
> >> component lists could be different based on what is compiled into the
> >> LAM library.
> >>
> >> Brian
> >>
> >>
> >> --
> >> Brian Barrett
> >> LAM/MPI developer and all around nice guy
> >> Have a LAM/MPI day: http://www.lam-mpi.org/
> >>
> >>
> >> _______________________________________________
> >> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
> >
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>--
> Brian Barrett
> LAM/MPI developer and all around nice guy
> Have a LAM/MPI day: http://www.lam-mpi.org/
>
>
>_______________________________________________
>This list is archived at http://www.lam-mpi.org/MailArchives/lam/