LAM/MPI logo

PSR: How does it work?

  |   Home   |   Download   |   Documentation   |   FAQ   |  

There's a few sections here, and this is a somewhat lengthy page, so here's a table of contents:


Short version

The user encrypts their password with the public key for the system that they want to run on. The encrypted password is stored in the user's AFS space. When a job is submitted, the user's password is automatically (and securely) decrypted to obtain the corresponding plain-text password, which is used to obtain an AFS token. The token-obtaining process then goes to sleep, and wakes up one hour before the AFS token expires. It then re-decrypts the password and refreshes the token before returning to sleep. This process repeats until the job completes.


Longer version

The system administrators for the batch system run the mkpsrauthrkeys program to generate public and private keys. The public key is made available for users to encrypt their AFS passwords. The private key is kept secure and installed on the compute nodes in the batch system.

Each user runs the pwstore command (or a custom shell script -- see below) and types in their AFS password. pwstore encrypts the password to the public key and saves the file in the designated directory. The pwstore command is typically wrapped in a shell script by system administrators (a sample script is included in the distribution). The shell script can do site-specific functionality such as create a subdirectory under the user's $HOME to store the encrypted password, set the appropriate AFS permissions on that directory, etc.

When the user submits a PBS job, the mother superior of the user's job forks off a "shepherd" process that decrypts the user's password using the secret key, and then obtains a new process authentication group and AFS token. The PBS prologue is then run, followed by the user's job. The shepherd process will wake itself up one hour before the AFS token expires to renew the token. This continues until the job completes. The epilogue is then run, and then the shepherd job is killed off, and the AFS token is discarded.

Note that PBS is not the only system that PSR can be used with; this scheme is general enough to work in most situations. PBS just happens to be the only one that we provide a patch for.


Consequences of this design

Although this scheme is relatively safe from external hacking (nothing is completely safe from external hacking), it does mandate that the users must trust the system administrators -- the system administrators will be able to retrieve the user's plain text AFS password. Good system administrators will not abuse this system, but it is one possibility for abuse, and is important enough to be explicitly mentioned here. Also, some sites have policies that explicitly disallow the ability of anyone (even system administrators) from knowing or obtaining other users' clear text passwords.

As such, this scheme may or may not be compatable with your site's local policies. Consult with your system administrators and/or management before installing and using this system.


Use of PSR with multiple systems

PSR can easily be used with multiple systems. This can be done by running pwstore with multiple public keys and storing the encrypted result in different files. For example, a user can run pwstore with public keys A, B, and C. The encrypted passwords are then stored in $HOME/.psr_authr/a.epass, $HOME/.psr_authr/b.epass, and $HOME/.psr_authr/c.epass, respectively.

When running on the A system, A compute nodes with use A's secret key to decrypt $HOME/.psr_authr/a.epass to obtain the plain text password and obtain an AFS token. Simlarly, the B compute nodes used B's secret key to decrypt $HOME/.psr_authr/b.epass, etc. This is a particularly useful model when the A, B, and C systems are owned and operated by different groups.


When users change their password

Note that when users change their AFS passwords, they must re-run the pwstore program to re-encrypt their password for PSR. Indeed, if the user has jobs running when the password is changed, the next time that the shepherd wakes up, it will use the new password to refresh the token.

If a user fails to re-run the pwstore program, their queued jobs will run with system:anyuser access to the user's AFS space, which will likely result in the job failing (since most jobs typically try to read/write to private AFS space). If jobs are running when the user changes their password without re-running pwstore, the next time that the shepherd wakes up, it will fail to refresh the AFS token properly, which will likey cause the job to fail for the same reason.