LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: YoungHui Amend (yamend_at_[hidden])
Date: 2006-05-25 15:30:51


More info to help debug................
 
 
This is the master log file found in
/tmp/lam-yamend/@end-superman/lam-debug-log.txt
==================================
started (7.1.2), uid 24017, gid 1001
kernel: initialized
Link 0: node: 0, cpus: 1, type: 0, ip: 158.140.147.7, port 41548
Link 1: node: 1, cpus: 1, type: 384, ip: 158.140.147.91, port 35759
kio_req: new client on fd=14
kouter: attached process pid=18210, pri=1095, fd=14
flatd: flqload - successfully created file
/tmp/lam-yamend_at_end-cheetah/lam-flatd0
flatd: flqload - file descriptor 15
flatd: flqload - successfully appended 2059 bytes to
/tmp/lam-yamend_at_end-cheetah/lam-flatd0
kenyad: pqcreating with rtf 0x1b310
kenyad: looking for executable
"/sbox/yamend/r33/amd64_linux24/64/bin/TWTgen" in directory
"/afs/tda.cadence.com/project/tg/12/regression/btv2"
kenyad: found "/sbox/yamend/r33/amd64_linux24/64/bin/TWTgen"
kenyad: creating new user process...
kenyad: attempting to receive stdout/stderr file descriptors
kenyad: recv_stdio_fds: happiness
kenyad: setting environment variables to pass to new process
kenyad: setting TROLLIUSFD
kenyad: setting TROLLIUSRTF
kenyad: setting LAMJOBID
kenyad: setting LAMKENYAPID
kenyad: setting LAMWORLD
kenyad: setting LAMPARENT
kenyad: setting LAMRANK
kenyad: checking for working directory flag
kenyad: working directory set explicitly
kenyad: running in directory
/afs/tda.cadence.com/project/tg/12/regression/btv2
kenyad: fork/exec succeeded, pid 18211, index 11, rtf 0x1b312
kenyad: create succeeded, process running
died: caught child death; trying to detach
died: detaching table entry 10
kouter: kqdetach detached process pid=18210
kouter: kqdetach calling kio_close
kouter: kqdetach calling knuke

==================================
This is the slave log file found in
/tmp/lam-yamend/@end-leopard/lam-debug-log.txt
==================================
started (7.1.2), uid 24017, gid 1001
kernel: initialized
Link 0: node: 0, cpus: 1, type: 0, ip: 158.140.147.7, port 41548
Link 1: node: 1, cpus: 1, type: 384, ip: 158.140.147.91, port 35759
flatd: flqload - successfully created file
/tmp/lam-yamend_at_end-leopard/lam-flatd0
flatd: flqload - file descriptor 16
flatd: flqload - successfully appended 2061 bytes to
/tmp/lam-yamend_at_end-leopard/lam-flatd0
kenyad: pqcreating with rtf 0x40b310
kenyad: checking for directory
/afs/tda.cadence.com/project/tg/12/regression/btv2
kenyad: looking for executable
"/sbox/yamend/r33/amd64_linux24/64/bin/TWTgenfm" in directory
"/afs/tda.cadence.com/project/tg/12/regression/btv2"
kenyad: found "/sbox/yamend/r33/amd64_linux24/64/bin/TWTgenfm"
kenyad: creating new user process...
kenyad: setting environment variables to pass to new process
kenyad: setting TROLLIUSFD
kenyad: setting TROLLIUSRTF
kenyad: setting LAMJOBID
kenyad: setting LAMKENYAPID
kenyad: setting LAMWORLD
kenyad: setting LAMPARENT
kenyad: setting LAMRANK
kenyad: checking for working directory flag
kenyad: working directory set explicitly
kenyad: running in directory
/afs/tda.cadence.com/project/tg/12/regression/btv2
kenyad: fork/exec succeeded, pid 11690, index 11, rtf 0x40b312
kenyad: create succeeded, process running
died: caught child death; trying to detach
died: detaching table entry 10

Can you tell me why the slave is dieing during MPI_Init???????
 
Thank you,
YoungHui
 

________________________________

From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]] On Behalf
Of YoungHui Amend
Sent: Thursday, May 25, 2006 10:41 AM
To: General LAM/MPI mailing list
Subject: Re: LAM: LAM error: Unknown error 471

I've narrowed this problem to mpirun.c in otb/mpirun directory.
In this file, there's get_mpi_world function. After it does nrecv(msg),
it does the following check:
    if (msg.nh_type == 1) {
      char node[32];
      if (fl_very_verbose)
        printf("mpirun: someone died before MPI_INIT -- rank %d\n",
               msg.nh_node);
      snprintf(node, sizeof(node), "%d", msg.nh_node);
      show_help("mpirun", "no-init", node, NULL);
      errno = EMPINOINIT;
      return LAMERROR;
    }

When is nh_type being set to 1 when issuing nsend command?

One of the differences I noticed was that in 6.3, PTY_IS_DEFAULT is 0
but in 7.1, it's 1. What is the PTY support?
 
I would appreciate any help you can give me.
Thank you for your prompt attention,
YoungHui Amend

________________________________

        From: lam-bounces_at_[hidden] [mailto:lam-bounces_at_[hidden]]
On Behalf Of YoungHui Amend
        Sent: Wednesday, May 24, 2006 9:51 AM
        To: lam_at_[hidden]
        Subject: LAM: LAM error: Unknown error 471
        
        
        Hi,
         
        I'm in process of upgrading from version 6.3 to 7.1.
         
        I've got lam daemons running on my master and slave machines.
Then I'm executing mpirun with application schema and is getting
        MPI_Init: LAM error: Unknown error 471
        
------------------------------------------------------------------------
-----
        It seems that [at least] one of the processes that was started
with
        mpirun did not invoke MPI_INIT before quitting (it is possible
that
        more than one process did not invoke MPI_INIT -- mpirun was only
        notified of the first one, which was on node n0).
         
        mpirun can *only* be used with MPI programs (i.e., programs that
        invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec"
program
        to run non-MPI programs over the lambooted nodes.
        
------------------------------------------------------------------------
-----
        
////////////////////////////////////////////////////////////////////////
///////////////
        My master and slave processes does call MPI_Init. I think the
error message for 471 is coming out of slave processes and therefore is
quitting before my master processes gets a chance to call MPI_Init,
which generates the message about not invoking MPI_INIT before quitting.
         
        This part of the code works fine with version 6.3. Are there
some changes between release that I'm not aware of?
        I've seem some conflicting documentation saying MPI_Init needs
to be called by all processes and then another help file saying the
master or one of slave machine needs to call MPI_Init. In either case,
what is the Unknow error 471 and which LAM/MPI source code is this
coming out of?
         
        Here is my command:
        mpirun -t -c2c -O -w -x $LAM_EXPORT myapp
        where LAM_EXPORT=PATH,LD_LIBRARY_PATH,DISPLAY,LAMHOME
                  myapp file contains:
                  n0
/afs/tda/sti/r33/prod/linux24_64/tools/tb/bin-64/TWTgen
parallelprocess=yes experiment=ya lbist=yes
                  n1
/afs/tda/sti/r33/prod/linux24_64/tools/tb/bin-64/TWTgenfm experiment=ya
lbist=yes parallelprocess=yes
        
                  I'm running TWTgen on master and TWTgenfm on slave.
They are the same programs with different calling entry.