On Feb 9, 2006, at 11:11 AM, rob fiske wrote:
> laminfo returns the same information both locally and the remote
> machine
> through ssh, the code is the same version but was compiled
> seperately (for
> both lam and nwchem, due to another error I was getting when I just
> copied
> the binaries) Is this an issue as you were saying? If so woul
> compiling
> either code with shared libraries enabled fix the problem, or
> should I go
> back to the nwchem list to try and get that other error resolved?
I don't think this is a nwchem problem or a problem that will be
solved with shared libraries. For whatever reason, it appears that
the nwchem in /usr/local/NWCHem/bin on the host Cobalt was built
against LAM 7.1.x while the build on the other hosts was with LAM
7.0.x. That's the only way the error message below could have happened.
You might want to try rebuilding LAM on the cobalt machine (since
it's a different binary), double checking that it's finding the LAM/
MPI that you expect to be using.
Brian
>> From: Brian Barrett <brbarret_at_[hidden]>
>> Reply-To: General LAM/MPI mailing list <lam_at_[hidden]>
>> To: General LAM/MPI mailing list <lam_at_[hidden]>
>> Subject: Re: LAM: Mac problem with some apps
>> Date: Wed, 8 Feb 2006 20:32:52 -0500
>>
>> On Feb 8, 2006, at 12:27 PM, rob fiske wrote:
>>
>>> ==============================================
>>> Palladium:~/tests/QM/BH4_N fiske$ mpirun C /usr/local/NWChem/bin/
>>> nwchem
>>> tests/QM/BH4_N/test.nw
>>> --------------------------------------------------------------------
>>> --
>>> -------
>>> It seems that [at least] one of the processes that was started with
>>> mpirun chose a different RPI than its peers. For example, at least
>>> the following two processes mismatched in their RPI selections:
>>>
>>> MPI_COMM_WORLD rank 0: tcp (v7.0.0)
>>> MPI_COMM_WORLD rank 2: usysv (v7.1.0)
>>>
>>> All MPI processes must choose the same RPI module and version when
>>> they start. Check your SSI settings and/or the local environment
>>> variables on each node.
>>> --------------------------------------------------------------------
>>> --
>>> -------
>>> --------------------------------------------------------------------
>>> --
>>> -------
>>> The selected RPI failed to initialize during MPI_INIT. This is a
>>> fatal error; I must abort.
>>>
>>> This occurred on host Cobalt (n1).
>>> The PID of failed process was 15412 (MPI_COMM_WORLD rank: 2)
>>> ==============================================
>>>
>>> Both machines have LAM-7.0.6 installed, and both are MAC OSX 10.3.9
>>> for
>>> their OS and their CPUs are G4's
>>>
>>> Has anyone encountered a problem such as this before (I have tried
>>> giving
>>> the -ssi option to mpirun as found on this list)?
>>
>> The error message really does seem to indicate that at least one
>> process is using LAM 7.1. Since the error message is on rank 2, that
>> suggests that it might be on the remote node, so you might be running
>> into path search issues. One quick way to find out is to run "ssh
>> <node> laminfo" to make sure you are getting the right one. Since
>> LAM is generally compiled statically by default, make sure you are
>> running the same version of your code on both nodes - otherwise, the
>> component lists could be different based on what is compiled into the
>> LAM library.
>>
>> Brian
>>
>>
>> --
>> Brian Barrett
>> LAM/MPI developer and all around nice guy
>> Have a LAM/MPI day: http://www.lam-mpi.org/
>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|