On Mar 20, 2006, at 11:27 AM, Scott Campbell wrote:
> I am seeing that permissions on files created by MPI programs change
> depending on which boot module I have selected.
<snip>
> My umask is set to 0022.
<snip>
> PBS script:
>
> #!/bin/bash
> echo `date`
>
> lamboot -v -ssi boot rsh $PBS_NODEFILE
> mpirun -np 2 /tmp/a.out
>
>
> Resulting files:
>
> -rw-r--r-- 1 user1 g1 0 2006-03-20 12:16 node_1_of_2
> -rw-r--r-- 1 user1 g1 0 2006-03-20 12:16 node_0_of_2
>
> If I change the script to this:
>
> #!/bin/bash
> echo `date`
>
> lamboot -v -ssi boot tm $PBS_NODEFILE
> mpirun -np 2 /tmp/a.out
>
>
> Resulting files:
>
> -rw------- 1 user1 g1 0 2006-03-20 12:15 node_1_of_2
> -rw------- 1 user1 g1 0 2006-03-20 12:15 node_0_of_2
>
> Is this by design? I need the files created when using the tm boot
> module to have the -rw-r--r-- permissions. Can this be
> configured? If
> not, any pointers on where in the source code I need to tweak?
This is not completely by design, but is not LAM's doing ;). LAM
depends on the process starting the lamd to set its umask as
appropriate. When you use the ssh starter, the remote shell sets the
umask according to it's default rules, or whatever your
initialization files set it to. The lamd inherits this umask, and
that is the umask set when the lamd launches your application. When
the tm starter is used, the lamd inherits the umask from the pbs mom,
and that is what is used as the umask when the lamd launches your
application.
As Troy pointed out, there are a bunch of places in the LAM/MPI
source code where we set the umask to 077. These are in files under
<topdir>/otb/sys/, and are all during the early part of lamd
initialization. However, we store the original umask of the process
and after the fork() to start the user's process reset the umask to
that original umask. We feel that for security reasons, the lamd
should not create files as other users. But this shouldn't be what
is affecting you, since user applications will have the original umask.
One thing I noticed on our small PBS test setup (I believe our setup
is Torque, but based on what you are seeing I'm willing to bet the
behavior is the same in PBS Pro) is that the pbs mom appears to
always start processes with a umask of 077. I wrote a small
application that just prints the return value of umask() as an octal,
and running it through pbsdsh in a pbs job gives:
[20:41] brbarret_at_vogon:pts/2 ~> umask
22
[20:41] brbarret_at_vogon:pts/2 ~> cat $PBS_NODEFILE
vogon.osl.iu.edu
eddie.osl.iu.edu
[20:41] brbarret_at_vogon:pts/2 ~> /opt/pbs/bin/pbsdsh $HOME/my_umask
umask: 77
umask: 77
Some well-laid printfs in the lamd source code seems to indicate that
the mom does the same thing there. So it looks like our assumption
that the starter would always do something sane with things like the
umask isn't quite right. It works well for rsh/ssh and for SLURM
(processes started on the allocated nodes have the same umask as the
process that called srun), but not for PBS. It's possible that we
could work around this bug and have lamboot propagate the umask for
PBS, but it would be much easier if PBS could just have the moms
start processes with an environment similar to that of the process
calling tm_spawn ;).
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|