Be very sure of which "mpirun" is being executed. It is possible to get
an MPICH installation of mpirun in /usr/bin as part of a Fedora Core 8
installation and a lam installation in some other directory through your
"yum lam" process. You can invoke the lam daemons correctly, but if
your execution process calls the wrong mpirun, things will not behave as
expected.
John Bussoletti
-----Original Message-----
From: Krzysztof Bandurski [mailto:kb_at_[hidden]]
Sent: Monday, May 19, 2008 11:30 AM
To: lam_at_[hidden]
Subject: LAM: problem with mpirun - processes don't 'see' one another
Hi All,
I used lam before, but I upgraded my system and installed fedora 8 from
scratch. I have a dual-core athlon 64 on an nforce chipset. I wanted to
install some mpi environment quickly to test my parallel programs on my
machina at home before submitting them to the cluster that I use, so I
just "yummed" lam to my machine. Lamboot seems to work fine, but I have
a strange problem with mpirun/mpiexec.
When I run a program using mpirun, e.g. like this:
mpirun -np 4 testpopmpi_release <and then follow the command line
arguments...>
I do get 4 processes running, but each of them sees only itself in
MPI_COMM_WORLD. When I run it with --display-map, I get something like
this at the beginning of the output:
[kris_at_nothing nnworkshop]$ mpirun --display-map -np 4 testpopmpi_release
-packley -d300 -T0f -v1 -Dcgpr -P256 -Mdesa-best2bin [nothing:05733]
Map for job: 1 Generated by mapping mode: byslot
Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
Data for app_context: index 0 app: testpopmpi_release
Num procs: 4
Argv[0]: testpopmpi_release
Argv[1]: -packley
Argv[2]: -d300
Argv[3]: -T0f
Argv[4]: -v1
Argv[5]: -Dcgpr
Argv[6]: -P256
Argv[7]: -Mdesa-best2bin
Env[0]: OMPI_MCA_rmaps_base_display_map=1
Env[1]:
OMPI_MCA_orte_precondition_transports=444a2d3c430e64ba-6534b32b337c12e7
Env[2]: OMPI_MCA_rds=proxy
Env[3]: OMPI_MCA_ras=proxy
Env[4]: OMPI_MCA_rmaps=proxy
Env[5]: OMPI_MCA_pls=proxy
Env[6]: OMPI_MCA_rmgr=proxy
Working dir: /home/kris/nnworkshop (user: 0)
Num maps: 0
Num elements in nodes list: 1
Mapped node:
Cell: 0 Nodename: nothing Launch id: -1
Username: NULL
Daemon name:
Data type: ORTE_PROCESS_NAME Data Value: NULL
Oversubscribed: True Num elements in procs list: 4
Mapped proc:
Proc Name:
Data type: ORTE_PROCESS_NAME Data Value:
[0,1,0]
Proc Rank: 0 Proc PID: 0 App_context
index: 0
Mapped proc:
Proc Name:
Data type: ORTE_PROCESS_NAME Data Value:
[0,1,1]
Proc Rank: 1 Proc PID: 0 App_context
index: 0
Mapped proc:
Proc Name:
Data type: ORTE_PROCESS_NAME Data Value:
[0,1,2]
Proc Rank: 2 Proc PID: 0 App_context
index: 0
Mapped proc:
Proc Name:
Data type: ORTE_PROCESS_NAME Data Value:
[0,1,3]
Proc Rank: 3 Proc PID: 0 App_context
index: 0
and then follows the output of my program. As you can see, lam thinks
that all the processes are in the same communicator (they all have
different ranks), but when I call MPI_Comm_rank and MPI_Comm_size in my
program, I always get rank == 0 and size == 1in each single process -
needless to say, the processes can't communicate and I just have 4
independent copies of my program running (and printin exactly the same
output on the terminal....). Does anyone have any idea what might be
going on? This is really driving me nuts, I will appreciate any hints.
best regards,
kris.
_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/
|