On Tue, 8 Feb 2005, Michael Arndt wrote:
> my "commercial application" is just stopping any output and
> the job "hangs" forever.
>
> and here is a strace i did inside a shellwrapper from within
> the mpirun command:
>
> uname({sys="Linux", node="cae1", ...}) = 0
> stat64(0x40ed8c00, 0xffffc104) = 0
> getuid32() = 1000
> getcwd("/scratch/lsf.micha.422", 2048) = 23
> chdir("/scratch/lsf.micha.422/lam-micha_at_cae1-lsf-422-0") = 0
> socketcall(0x1, 0xffffc208) = 3
> socketcall(0x3, 0xffffc208) = 0
> chdir("/scratch/lsf.micha.422") = 0
> socketcall(0xf, 0xffffc278) = 0
> socketcall(0xf, 0xffffc278) = 0
> getppid() = 9238
> rt_sigaction(SIGUSR2, {0x1400000040ecc488, [], SA_NOMASK|0x555eb8}, {SIG_DFL}, 4294950972) = 0
> rt_sigprocmask(SIG_BLOCK, [USR2], [], 4294951176) = 0
> write(3, "\5\0\0\0\377\377\377\377\27$\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 96) = 96
>
> can anone make out where / why this "hangs" ?
Without knowing what your program does (i.e. the internals) it is almost
impossible to tell just from strace. Everything above "looks" proper,
except for the use of socketcall(), which really shouldn't be done in a
user space program. There are definitely system calls missing from what
you provide. The last call, write(), has a proper return value, so it
should have moved on to the next step from there.
My guess would be that strace won't help solve your problem. You should
really try running the program interractively in a debugger. But even
before that, do simple things like running the program on one
processor, or trying a simpler calculation. Since the package is
"commercial", you may want to contact the authors or look at its mailing
lists. Debugging a program's behavior can be tough, but if "lamnodes"
shows the correct nodes in the LAM, then it is a problem with the program,
not LAM.
------------------------------------------------------------
Anthony Ciani (aciani1_at_[hidden])
Computational Condensed Matter Physics
Department of Physics, University of Illinois, Chicago
http://ciani.phy.uic.edu/~tony
------------------------------------------------------------
|