LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2003-09-29 01:08:06


On Sep 28, 2003, at 11:05 PM, Jeremy Archuleta wrote:

> Just a thought. Whenever I have received those errors, it usually
> turns out to be 1) different executables than what I thought (forgot
> to copy manually), 2) my fault and a special case with the code
> exiting early, or 3) something crashed on that node and wiped out the
> executable (like a segfault)
>
> Hope that helps.
> If it is indeed something else, with LAM perhaps, ... uh... that could
> be a big problem.

LAM just calls fork()/exec() out on the remote nodes. We used to have
the problems you describe when all the LAM development workstations
used AFS, which did heavy client-side caching. Of course, by the time
you logged into the node to figure out what was going wrong, the cache
was invalidated and everything worked as expected.

If you are having repeated problems and are on a shared filesystem, you
might want to talk to your systems administrator. It sounds like you
may be having some problems on your machine. If you aren't using a
common filesystem, you might want to try using the -s option to mpirun.
  Having mpirun push the binary out may be slightly less error-prone
than doing it by hand.

Either way, this isn't a LAM problem, but just some of the pain of
working on clusters...

Brian

> On Sunday, Sep 28, 2003, at 21:53 US/Pacific, Andras Balogh wrote:
>
>>
>> I had the following strange problem.
>> I don't know if it is due to redhat or lam or ssh.
>> Looking through the archive I have the feeling that some other people
>> had
>> the same problem before me and maybe they did not realize what
>> happened.
>>
>> I compile my code on a dual-processor redhat system
>> and upload it to a redhat cluster in order to run it.
>>
>> I got error message
>> ``...mpirun did not invoke MPI_INIT before quitting...''
>> due to programming error.
>>
>> This is no big news, but the message stayed even after recompiling and
>> uploading a previously working version.
>>
>> Only renaming the executable solved the problem.
>>
>> It looks like that the OS (or lam) remembers the name of the
>> incorrect executable and does not want to accept it anymore as
>> correct.
>> This is freaky.
>> I renamed the file back and forth with the same result.
>>
>> --
>> Andras Balogh
>> ---------------------------------------------------------------------
>> Department of Mathematics | phone: (956) 381-2119
>> University of Texas - Pan American | phone: (956) 381-3452
>> Edinburg, TX 78541-2999 | fax: (956) 384-5091
>> http://www.math.panam.edu/abalogh | abalogh_at_[hidden]
>> ---------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/