On Mon, 2 Feb 2004, Shi Jin wrote:
> Hi, there
>
> We have upgraded from LAM/MPI version 6.5.9 to 7.0.4
> for our ia32-linux-cluster for a while. There are two
> problems troubling us. We have 16 nodes, 2 CPUs(4 with
> hyperthreading) per node.
>
> 1. My code was running well with the old version, but
> with the latest version, sometimes I would get NaN
> expections depending on each compilation. I could just
> add a print statment to make it work and also change
> some optimization switch to make it crash. First I
> though the problem is in my own coding. But then I
> found out the crashing code would just work fine when
> running 1 process per node. So I begin to think of
> the new SMP feature introduced by version 7.0.x. I
> digged into the manual and found a way to test it:
> I run my code by
> mpirun -ssi coll lam_basic C <mypro>
> Then everything is fine, no matter how many processes
> on each node.
> I think in this way I disabled the SMP collective
> communication which is newly introduced. Note we use a
> lot of collective communication such as broadcast and
> gather in our code.
> Does this mean there is some problem in LAM?
I would suggest that you run the program throught a valgrind debugger
and see if you can catch any memory errors. If you don't find anything
there, then we will look into it more.
>
> 2. This one sounds more strange for me. My colleague
> has a code using mpi+fftw. He has a lot of all-to-all
> commands. After the upgrading, he found out his
> results are different each instance he runs the code,
> which is never expected.
Here also, I would suggest that run the program with lam_basic and see if
you get a correct solution.
> I don't know whether this two problems are connected.
> I tried to use my previous solution for his code and
> it is still not working.
>
> Could somebody help us out?
> Thank you very much.
>
> Shi Jin
>
> __________________________________
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free web site building tool. Try it!
> http://webhosting.yahoo.com/ps/sb/
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
|