1. The config.log file from the top-level LAM directory, if available (please compress!). LAM 7.1.3 was created with an rpm file by our system administrator; the config.log file is not available. 2. The output of the laminfo command. (bash) robocomp.pts/5% laminfo LAM/MPI: 7.1.3 Prefix: /usr/local Architecture: i686-pc-linux-gnu Configured by: root Configured on: Wed Mar 7 16:05:18 MST 2007 Configure host: robocomp.cr.usgs.gov Memory manager: ptmalloc2 C bindings: yes C++ bindings: yes Fortran bindings: yes C compiler: gcc C++ compiler: g++ Fortran compiler: gfortran Fortran symbols: underscore C profiling: yes C++ profiling: yes Fortran profiling: yes C++ exceptions: no Thread support: yes ROMIO support: yes IMPI support: no Debug support: no Purify clean: no SSI boot: globus (API v1.1, Module v0.6) SSI boot: rsh (API v1.1, Module v1.1) SSI boot: slurm (API v1.1, Module v1.0) SSI coll: lam_basic (API v1.1, Module v7.1) SSI coll: shmem (API v1.1, Module v1.0) SSI coll: smp (API v1.1, Module v1.2) SSI rpi: crtcp (API v1.1, Module v1.1) SSI rpi: lamd (API v1.0, Module v7.1) SSI rpi: sysv (API v1.0, Module v7.1) SSI rpi: tcp (API v1.0, Module v7.1) SSI rpi: usysv (API v1.0, Module v7.1) SSI cr: self (API v1.0, Module v1.0) 3. A detailed description of what is failing. A detailed description of the problem is given in the body of the email message; here I will present some run results so that a user may appreciate what is going on. The program itself compares the right-hand-side (RHS) vector of two programs that do vector/matrix multiplies. The matrices themselves and the input vectors X (A*X=RHS) are generated in module driver_matmul.f90; a serial vector/matrix multiply program generates a RHS there, which is stored in the driver. The matrix A and X vector is sent out to subroutine matmul_block (module MS_matmul) to perform a parallel vector/matrix multiply; the resulting RHS from the parallel vector/matrix multiply is pulled back into driver_matmul module and compared with the serial result: the largest absolute deviate from this comparison is reported. Generally, a result on the order of 10^-12 indicates the two RHS are compariable. Here is the result of a run under OpenMPI: (bash) stoch.pts/5% mpirun -np 5 --hostfile hosts_stoch wrapper_matmul_ompi_lf95.ex Problem size currently 8x8x8; do you wish to change the problem size? (y/n) y Input new values of nx, ny and nz. 3 6 6 Input max_part, max_cyc 4 2 ABSOLUTE MAXIMUM RESIDUAL; 6.82121E-13 ABSOLUTE MAXIMUM RESIDUAL; 4.54747E-13 process 0 at barrier in wrapper process 1 at barrier in wrapper process 2 at barrier in wrapper process 3 at barrier in wrapper process 4 at barrier in wrapper wrapper elapsed time= 36.42136216163635 With "y" I indicated that I wanted to change the size of the problem; the new problem size is "3 6 6". "max_part" is the maximum number of partitions (blocks) for the matrix problem (4) and "max_cyc" is the the maximum number of random matrices to generate and test (2). For a LAM 7.1.3 run, the following result is obtained: (bash) robocomp.pts/5% mpirun -np 5 wrapper_matmul.ex Problem size currently 8x8x8; do you wish to change the problem size? (y/n) y Input new values of nx, ny and nz. 3 6 6 Input max_part, max_cyc 4 2 ABSOLUTE MAXIMUM RESIDUAL; 9.8485 ABSOLUTE MAXIMUM RESIDUAL; 6.9640 process 0 at barrier in wrapper process 2 at barrier in wrapper process 3 at barrier in wrapper process 1 at barrier in wrapper process 4 at barrier in wrapper wrapper elapsed time= 29.51910710334778 The absolute maximum residuals for the two randomly generated matrices are far from zero; this results from incorrect vectors being exchanged between processes, as explained in the body of the email message. Larger problems, with more partions, can also be run; here is a LAM 7.1.3 run: (bash) robocomp.pts/5% mpirun -np 9 wrapper_matmul.ex Problem size currently 8x8x8; do you wish to change the problem size? (y/n) n Input max_part, max_cyc 8 2 ABSOLUTE MAXIMUM RESIDUAL; 1189.3 ABSOLUTE MAXIMUM RESIDUAL; 793.38 process 8 at barrier in wrapper process 0 at barrier in wrapper process 2 at barrier in wrapper process 3 at barrier in wrapper process 1 at barrier in wrapper process 4 at barrier in wrapper process 5 at barrier in wrapper process 6 at barrier in wrapper process 7 at barrier in wrapper wrapper elapsed time= 13.13265895843506 and here is the same run with OpenMPI: (bash) stoch.pts/5% mpirun -np 9 --hostfile hosts_stoch wrapper_matmul_ompi_lf95.ex Problem size currently 8x8x8; do you wish to change the problem size? (y/n) n Input max_part, max_cyc 8 2 ABSOLUTE MAXIMUM RESIDUAL; 1.13687E-12 ABSOLUTE MAXIMUM RESIDUAL; 9.09495E-13 process 0 at barrier in wrapper process 2 at barrier in wrapper process 4 at barrier in wrapper process 6 at barrier in wrapper process 1 at barrier in wrapper process 5 at barrier in wrapper process 8 at barrier in wrapper process 3 at barrier in wrapper process 7 at barrier in wrapper wrapper elapsed time= 11.90904712677002 Generally speaking, then number of processes requested in the mpirun command should be one more that the number of partitions (blocks) requested, as the parallel vector/matrix multiply requires one master process in addition to a slave process corresponding to every partition.