Also make sure that the lockd daemon is running on the machines.
Rajeev
> Message: 2
> Date: Wed, 18 Jan 2006 22:05:32 -0500
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: LAM: File locking failed in ADIOI_Set_lock
> To: General LAM/MPI mailing list <lam_at_[hidden]>
> Message-ID: <7BCE105C-B81E-4EA9-9131-122EBC005296_at_[hidden]>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>
> LAM uses the ROMIO package to implement its MPI IO functionality.
> There is a note in the romio/README file about the use of NFS:
>
> Using ROMIO on NFS
> ------------------
>
> To use ROMIO on NFS, file locking with fcntl must work correctly on
> the NFS installation. On some installations, fcntl locks don't work.
> To get them to work, you need to use Version 3 of NFS and
> have the system administrator mount the NFS file system with the
> "noac" option (no attribute caching). Turning off attribute caching
> may reduce performance, but it is necessary for correct behavior.
>
> The following are some instructions we received from Ian Wells of HP
> for setting the noac option on NFS. We have not tried them
> ourselves. We are including them here because you may find
> them useful. Note that some of the steps may be specific to HP
> systems, and you may need root permission to execute some of the
> commands.
>
> >1. first confirm you are running nfs version 3
> >
> >rpcnfo -p `hostname` | grep nfs
> >
> >ie
> > goedel >rpcinfo -p goedel | grep nfs
> > 100003 2 udp 2049 nfs
> > 100003 3 udp 2049 nfs
> >
> >
> >2. then edit /etc/fstab for each nfs directory
> read/written by MPIO
> > on each machine used for multihost MPIO.
> >
> > Here is an example of a correct fstab entry for /epm1:
> >
> > ie grep epm1 /etc/fstab
> >
> > ROOOOT 11>grep epm1 /etc/fstab
> > gershwin:/epm1 /rmt/gershwin/epm1 nfs bg,intr,noac 0 0
> >
> > if the noac option is not present, add it
> > and then remount this directory
> > on each of the machines that will be used to share MPIO files
> >
> >ie
> >
> >ROOOOT >umount /rmt/gershwin/epm1
> >ROOOOT >mount /rmt/gershwin/epm1
> >
> >3. Confirm that the directory is mounted noac:
> >
> >ROOOOT >grep gershwin /etc/mnttab
> >gershwin:/epm1 /rmt/gershwin/epm1 nfs
> >noac,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0 0 0 899911504
>
>
> On Jan 10, 2006, at 12:59 PM, zkis_at_[hidden] wrote:
>
> > Hi,
> >
> > I am fighting for a while with a problem, and couldn't found a
> > solution so
> > far. The problem is that my program exits regularly with the error
> > message
> > pasted at the end of this message. My system is rather new, it
> > consists of
> > AMD Athlon K7 and Intel Xeon processors, 100Mbit Ethernet
> > connections, and
> > run (Debian distribution) Linux Kernel 2.6.10-14. I have lam-7.1.1
> > installed from a debian package. Beside the LAM MPI libraries I
> > also use
> > parallel HDF5 in my program, installed from the libhdf5-lam-1.6.2-0
> > package + the necessary header files. The strange thing is that
> > sometimes
> > my program ends correctly, but most of the time it exits
> with error. I
> > have tested the connection between the machines, there is
> no problem.
> > There is no error message in the log files either! No other
> > application
> > complains, only my MPI programs. The program seems correct, under
> > mpich no
> > such error occured.
> >
> > I would very appretiate any suggestion.
> >
> > Best wishes,
> >
> > Zsolt Kis
> >
> > PS: Sorry for double posting!!
> >
> >
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> > %%%%%%%%
> >
> >
> > zsolt_at_sas:/bird/pool/zsolt$ mpirun -np 15 ppu alma
> > File locking failed in ADIOI_Set_lock. If the file system is NFS,
> > you need
> > to use NFS version 3 and mount the directory with the
> 'noac' option
> > (no
> > attribute caching).
> > File locking failed in ADIOI_Set_lock. If the file system is NFS,
> > you need
> > to use NFS version 3 and mount the directory with the
> 'noac' option
> > (no
> > attribute caching).
> >
> --------------------------------------------------------------
> --------
> > -------
> > One of the processes started by mpirun has exited with a
> nonzero exit
> > code. This typically indicates that the process finished in error.
> > If your process did not finish in error, be sure to include
> a "return
> > 0" or "exit(0)" in your C code before exiting the application.
> >
> > PID 23175 failed on node n3 (192.168.1.33) with exit status 1.
> >
> --------------------------------------------------------------
> --------
> > -------
> >
> > _______________________________________________
> > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
|