I have had similar problems, but one thought comes to mind...
You are running on a cluster and either getting files to the proper
nodes either via NFS (or some other network file system) or you are
manually copying the code to all nodes. If you are manually copying,
make sure you copy after every compile so as to ensure that the
executables are the same. If you are using NFS, I do believe there is a
refresh rate at which files are synchronized. So, "if there is an
error, you fix it, and try mpirun again" all within the refresh rate,
the remote exectuables aren't actually the current version...yet. By
changing executable names you guarantee that if the code runs, that it
is the most current code.
Just a thought. Whenever I have received those errors, it usually turns
out to be 1) different executables than what I thought (forgot to copy
manually), 2) my fault and a special case with the code exiting early,
or 3) something crashed on that node and wiped out the executable (like
a segfault)
Hope that helps.
If it is indeed something else, with LAM perhaps, ... uh... that could
be a big problem.
-J
On Sunday, Sep 28, 2003, at 21:53 US/Pacific, Andras Balogh wrote:
>
> I had the following strange problem.
> I don't know if it is due to redhat or lam or ssh.
> Looking through the archive I have the feeling that some other people
> had
> the same problem before me and maybe they did not realize what
> happened.
>
> I compile my code on a dual-processor redhat system
> and upload it to a redhat cluster in order to run it.
>
> I got error message
> ``...mpirun did not invoke MPI_INIT before quitting...''
> due to programming error.
>
> This is no big news, but the message stayed even after recompiling and
> uploading a previously working version.
>
> Only renaming the executable solved the problem.
>
> It looks like that the OS (or lam) remembers the name of the
> incorrect executable and does not want to accept it anymore as correct.
> This is freaky.
> I renamed the file back and forth with the same result.
>
> --
> Andras Balogh
> ---------------------------------------------------------------------
> Department of Mathematics | phone: (956) 381-2119
> University of Texas - Pan American | phone: (956) 381-3452
> Edinburg, TX 78541-2999 | fax: (956) 384-5091
> http://www.math.panam.edu/abalogh | abalogh_at_[hidden]
> ---------------------------------------------------------------------
>
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|