LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-01-13 13:17:12


It's hard to say without more detail about your application; this could
simply be the communication pattern of your application, that it causes
blocking and makes processes wait for message passing to complete, etc.

Which RPI were you using in 6.5.9? I ask because LAM could only have
one RPI compiled into it back in the 6.x series; only in the 7.x series
did we debut the ability to choose your RPI at run-time.

I'm guessing that you should be defaulting to usysv in 7.0.6, which,
since it uses shared memory for messages on the same node, *may*
account for speed differences between your 6.x and 7.x runs (e.g., if
you were using the tcp RPI in the 6.x series) and therefore expose
timing problems in your code.

The usysv RPI uses spin locks for on-node communication, so it should
spin (and consume all the CPU) when it's waiting for on-node
communication. But if you're blocking waiting for off-node
communication, you won't see this spinning behavior.

Can you attach a debugger to any of the processes and see what they are
doing?

On Jan 13, 2005, at 11:36 AM, Yu Chen wrote:

> Hello,
>
> After installation of OSCAR 4 on RH-EL-AS-3 cluster, one of my major
> mpi program is not running right. Here is the detail, thanks in
> advance for any help:
>
> In short, the program will just sit there, waiting and waiting, but
> doing nothing, since normally it should gives out a lot of outputs.
>
> In detail, we have a 28 nodes cluster including master node, each have
> 2 CPUs
>
> Originally, I was running LAM-6.5.9 on Redhat 7.2, using PGI FORTRAN
> compiler and GNU C compiler. The command used to run is:
> "mpirun -O -x CYANALIB c0,1,2,3,4,5,6,7,8,9,10,11,12 My_Program"
> It ran fine, when run "gstat -a -1", I would see 6 nodes running at
> about 100% CPU time, since each had two copies running.
>
> Now, I am using OSCAR 4(LAM-7.0.6) on RH-EL-AS-3 with all GNU
> compilers(C and FORTRAN), I recompiled my program BTW. Now with the
> same command, it runs, then just sits there, doing nothing. And from
> "gstat -a -1", it only shows 6 nodes running at about 50% CPU time,
> which seems like only one copy running on each node. The "mpitask"
> shows everything running.
>
> Anyone's got any idea?
>
> Regards
> Chen
>
> ===========================================
> Yu Chen
> Howard Hughes Medical Institute
> Chemistry Building, Rm 182
> University of Maryland at Baltimore County
> 1000 Hilltop Circle
> Baltimore, MD 21250
>
> phone: (410)455-6347 (primary)
> (410)455-2718 (secondary)
> fax: (410)455-1174
> email: chen_at_[hidden]
> ===========================================
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>

-- 
{+} Jeff Squyres
{+} jsquyres_at_[hidden]
{+} http://www.lam-mpi.org/