LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Anthony J. Ciani (aciani1_at_[hidden])
Date: 2004-11-29 14:27:53


On Nov 20, 2004, at 8:49 AM, Ahmad Faraj wrote:

> I have LAM 6.5.9 installed on a 32 node ethernet cluster. I am an MPI
....
> So, after running the application for many iterations, the
> program hangs on large sizes and sometimes even for meduim sizes. The
> feeling that i am getting is that somehow, after the network is saturated,
> lam does not deliver packets and hangs. Anyone has a clue? is there away
> around this? in the application, every X amounts of runs, i call lamclean
> to free some resources. That did not help!

I have seen some interesting behaviors for programs which are running on
NFS shares in clusters. For example, a program that would intermittently
hang just after startup. The cause, a file containing startup options
would be changed between two seperate executions, some images would see
the changed file, but for some reason the other images would read an old,
cached version. The solution is to delete the old file before replacing
it, not an overwrite.

This is not to say that your problem is with some NFS funkiness, but that
really odd things can happen. If all of your other MPI programs work
without difficulty, then the problem is most likely not with LAM. I would
suggest compiling a more verbose version of your program to see where it
is hanging (if it really is hanging and not looping or racing), and then
work from there.

------------------------------------------------------------
               Anthony Ciani (aciani1_at_[hidden])
            Computational Condensed Matter Physics
    Department of Physics, University of Illinois, Chicago
               http://ciani.phy.uic.edu/~tony
------------------------------------------------------------