On Sat, 29 Dec 2001, Brian W. Barrett wrote:
> It is possible that a kernel bug caused the problems you were seeing.
> Running in lamd mode can be, well, abusive to the OS and linux isn't know
> for its super-stable networking implementations.
All right; when I can put any of a few hundred different NICs
in a sunblade, and have it work perfectly, I guess I'll join you in
your complaint.
> Let us know if you are still having problems and I can investigate
> further.
I don't see an answer to this yet, so I'll add our suffering
to the list. The following stuff applies equally to lamd and c2c
mode, tho lamd mode runs about 20% the speed.
I've finally broken down and let us try RedHat 7.2 (eventually,
new hardware leaves no choice, eh?), and we're seeing problems like
a lot of others are. What I've seen so far leads me to believe it
has nothing to do with LAM, and probably nothing to do with the
kernel, either.
We have a 6x-style cluster running 2.4.9-13xxx from SGI, and
it doesn't seem to have this problem. So far, we've been working
with the stock RedHat 7.2 (2.4.7) kernel on the rh72 systems, but
we'll try some other things today.
By the way, I have seen really bizarre memory problems with some
earlier 2.4 series (open a bunch of files, do a lot of seeks, watch
memory spin out of control; close the files, and it clears up; upgrade
the kernel, and it behaves) but I don't see anything like that here.
The kind of sequence that gets us into trouble is a daisy-chain
setup where data go down the line from one worker to the next. This
is of course susceptible to buffers growing out of control along the
way, but I am seeing no evidence of that (and I never have with LAM).
So far, it seems that a job that just sends data down and does nothing
therewith works fine, but if it writes it to disk as it comes, it
bogs. I haven't found any hint that a worker is hanging on the data
write, but that could of course be the problem.
I fear glibc. It has bitten before, and it will again. Anyway....
here is a summary of the problem in code.
// This will still be running when the cows are dead on either system:
while (WHATEVER)
{
/*
* Get data from above, if we're not the master:
*/
if (commRank > 0)
MPI_Recv(data, ntfftp, MPI_FLOAT, commRank-1, commRank-1, commWorld, &mpistatus);
/*
* Send it on, if we're not the caboose:
*/
if (commRank < commSize)
status = MPI_Send (data, ntfftp, MPI_FLOAT, commRank+1, commRank, commWorld);
}
// This will bog down (and possibly hang) on rh7x, but works fine on rh6x:
while (WHATEVER)
{
/*
* Get data from above, if we're not the master:
*/
if (commRank > 0)
MPI_Recv(data, ntfftp, MPI_FLOAT, commRank-1, commRank-1, commWorld, &mpistatus);
/*
* Send it on, if we're not the caboose:
*/
if (commRank < commSize)
status = MPI_Send (data, ntfftp, MPI_FLOAT, commRank+1, commRank, commWorld);
/*
* Do something; write to disk, for example.
*/
DoSomething (data, ntfftp);
}
---- ``Windows: It does that sometimes.'' -mattdm
Chris J. Bednar
Director, Distributed Computing Product Group
http://AdvancedDataSolutions.com/
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|