LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-04-28 17:30:58


Wow. Wrong list entirely!

Sorry for the noise...

On Apr 28, 2009, at 4:37 PM, Jeff Squyres (jsquyres) wrote:

> Is anyone going to comment on this? I'm surprised / disappointed that
> it's been over 2 weeks with *no* comments.
>
> Roland can't lead *every* discussion...
>
>
>
> On Apr 13, 2009, at 12:07 PM, Jeff Squyres (jsquyres) wrote:
>
> > The following is a proposal from several MPI implementations to the
> > OpenFabrics community (various MPI implementation representatives
> > CC'ed). The basic concept was introduced in the MPI Panel at Sonoma
> > (see http://www.openfabrics.org/archives/spring2009sonoma/tuesday/panel3/panel3.zip)
> > ; it was further refined in discussions after Sonoma.
> >
> > Introduction:
> > =============
> >
> > MPI has long had a problem maintaining its own verbs memory
> > registration cache in userspace. The main issue is that user
> > applications are responsible for allocating/freeing their own data
> > buffers -- the MPI layer does not (usually) have visibility when
> > application buffers are allocated or freed. Hence, MPI has had to
> > intercept deallocation calls in order to know when its registration
> > cache entries have potentially become invalid. Horrible and
> dangerous
> > tricks are used to intercept the various flavors of free, sbrk,
> > munmap, etc.
> >
> > Here's the classic scenario we're trying to handle better:
> >
> > 1. MPI application allocs buffer A and MPI_SENDs it
> > 2. MPI library registers buffer A and caches it (in user space)
> > 3. MPI application frees buffer A
> > 4. page containing buffer A is returned to the OS
> > 5. MPI application allocs buffer B
> > 5a. B is at the same virtual address as A, but different
> physical
> > address
> > 6. MPI application MPI_SENDs buffer B
> > 7. MPI library thinks B is already registered and sends it
> > --> the physical address may well still be registered, so the
> send
> > does not fail -- but it's the wrong data
> >
> > Note that the above scenario occurs because before Linux kernel
> > v2.6.27, the OF kernel drivers are not notified when pages are
> > returned to the OS -- we're leaking registered memory, and therefore
> > the OF driver/hardware have the wrong virtual/physical mapping. It
> > *may* not segv at step 7 because the OF driver/hardware can still
> > access the memory and it is still registered. But it will
> definitely
> > be accessing the wrong physical memory.
> >
> > In discussions before the Sonoma OpenFabrics event this year,
> several
> > MPI implementations got together and concluded that userspace
> > "notifier" functions might solve this issue for MPI (as proposed by
> > Pete Wyckoff quite a while ago). Specifically, when memory is
> > unregistered down in the kernel, a flag is set in userspace that
> > allows the userspace to know that it needs to make a [potentially
> > expensive] downcall to find out exactly what happened. In this way,
> > MPI can know when to update its registration cache safely.
> >
> > After further post-Sonoma discussion, it became evident that the
> > so-called userspace "notifier" functions nat not solve the problem
> --
> > there seem to be unavoidable race conditions, particularly in
> > multi-threaded applications (more on this below). We concluded that
> > what could be useful is to move the registration cache from the
> > userspace/MPI down into the kernel and maintain it on a per-
> protection
> > domain (PD) basis.
> >
> > Short version:
> > ==============
> >
> > Here's a short version of our proposal:
> >
> > 1. A new enum value is added to ibv_access_flags: IBV_ACCESS_CACHE.
> > If this flag is set in the call to ibv_reg_mr(), the following
> > occurs down in the kernel:
> > - look for the memory to be registered in the PD-specific cache
> > - if found
> > - increment its refcount
> > - else
> > - try to register the memory
> > - if the registration fails because no more memory is
> > available
> > - traverse all PD registration caches in this process,
> > evicting/unregistering each entry with a refcount <= 0
> > - try to register the memory again
> > - if the registration succeeds (either the 1st or the 2nd
> > time),
> > put it in the PD cache with a refcount of 1
> >
> > If this flag is *not* set in the call to ibv_reg_mr(), then the
> > following occurs:
> >
> > - try to register the memory
> > - if the registration fails because no more registered memory is
> > available
> > - traverse all PD registration caches in this process,
> > evicting/unregistering each entry with a refcount <= 0
> > - try to register the memory again
> >
> > If an application never uses IBV_ACCESS_CACHE, registration
> > performance should be no different. Registration costs may
> > increase slightly in some cases if there is a non-empty
> > registration cache.
> >
> > 2. The kernel side of the ibv_dereg_mr() deregistration call now
> does
> > the following:
> > - look for the memory to be deregistered in the PD's cache
> > - if it's in the cache
> > - decrement the refcount (leaving the memory registered)
> > - else
> > - unregister the memory
> >
> > 3. A new verb, ibv_is_reg(), is created to query if the entire
> buffer
> > X is already registered. If it is, increase its refcount in the
> > reg cache. If it is not, just return an error (and do not
> > register
> > any of the buffer).
> >
> > --> An alternate proposal for this idea is to add another
> > ibv_access_flags value (e.g., IBV_ACCESS_IS_CACHED)
> instead of
> > a new verb. But that might be a little odd in that we don't
> > want the memory registered if it's not already registered.
> >
> > This verb is useful for pipelined protocols to offset the cost
> of
> > registration of long buffers (e.g., if the buffer is already
> > registered, just send it -- otherwise let the ULP potentially do
> > something else). See below for a more detailed explanation /
> use
> > case.
> >
> > 4. A new verb, ibv_reg_mr_limits(), is created to specify some
> > configuration information about the registration cache.
> > Configuration specifics TBD here, but one obvious possibility
> here
> > would be to specify the maximum number of pages that can be
> > registered by this process (which must be <= the value specified
> > limits.conf, or it will fail).
> >
> > 5. A new verb, ibv_reg_mr_clean(), is created to traverse the
> internal
> > registration cache and actually de-register any item with a
> > refcount <= 0. The intent is to give applications the ability
> to
> > forcibly deregister any still-existing memory that has been
> > ibv_reg_mr(..., IBV_ACCESS_CACHE)'ed and later
> ibv_dereg_mr()'ed.
> >
> > These proposals assume that the new IOMMU notify system in >=2.6.27
> > kernels will be used to catch when memory is returned from a process
> > to the kernel, and will both unregister the memory and remove it
> from
> > the kernel PD reg caches, if relevant.
> >
> > More details:
> > =============
> >
> > Starting with Linux kernel v2.6.27, the OF kernel drivers can be
> > notified when pages are returned to the OS (I don't know if they yet
> > take advantage of this feature). However, we can still run into
> > pretty much the same scenario -- the MPI userspace registration
> cache
> > can become invalid even though the kernel is no longer leaking
> > registered memory. The situation is *slightly* better because the
> > ibv_post_send() may fail because the memory will (in a single
> threaded
> > application) likely be unregistered.
> >
> > Pete Wyckoff's solution several years ago was to add two steps into
> > the scenario listed above; my understanding is this is now possible
> > with the IOMMU notifiers in 2.6.27 (new steps 4a and 4b):
> >
> > 1. MPI application allocs buffer A and MPI_SENDs it
> > 2. MPI library registers buffer A and caches it (in user space)
> > 3. MPI application frees buffer A
> > 4. page containing buffer A is returned to the OS
> > 4a. OF kernel driver is notified and can unregister the page
> > 4b. OF kernel driver can twiddle a bit in userspace indicating
> > that
> > something has changed
> > ...etc.
> >
> > The thought here is that the MPI can register a global variable
> during
> > MPI_INIT that can be modified during step 4b. Hence, you can add a
> > cheap "if" statement in MPI's send path like this:
> >
> > if (variable_has_changed_indicating_step_4b_executed) {
> > ibv_expensive_downcall_to_find_out_what_happened(...,
> &output);
> > if (need_to_register(buffer, mpi_reg_cache, output)) {
> > ibv_reg_mr(buffer, ...);
> > }
> > }
> > ibv_post_send(...);
> >
> > You get the idea -- check the global variable before invoking
> > ibv_post_send() or ibv_post_recv(), and if necessary, register the
> > memory that MPI thought was already registered.
> >
> > But whacky situations might occur in a multithreaded application
> where
> > one thread calls free() while another thread calls malloc(), gets
> the
> > same virtual address that was just free()d but has not yet been
> > unregistered in the kernel, so a subsequent ibv_post_send() may
> > succeed but be sending the wrong data.
> >
> > Put simply: in a multi-threaded application, there's always the
> chance
> > that the notify won't get to the user-level process until after the
> > global notifier variable has been checked, right? Or, putting it
> the
> > other way: is there any kind of notify system that could be used
> that
> > *can't* create a potential race condition in a multi-threaded user
> > application?
> >
> > NOTE: There's actually some debate about whether this "bad"
> > scenario
> > could actually happen -- I admit that I'm not entirely
> sure.
> > But if this race condition *can* happen, then I cannot
> think
> > of a kernel notifier system that would not have this race
> > condition.
> >
> > So a few of us hashed this around and came up with an alternate
> > proposal:
> >
> > 1. Move the entire registration cache down into the kernel.
> > Supporting rationale:
> > 1a. If all ULPs (MPIs, in this case) have to implement
> > registration
> > caches, why not implement it *once*, not N times?
> > 1b. Putting the reg cache in the kernel means that with the
> IOMMU
> > notifier system introduced in 2.6.27, the kernel can call
> back
> > to the device driver when the mapping changes so that a) the
> > memory can be deregistered, and b) the corresponding item
> can
> > be removed from the registration cache. Specifically: the
> > race
> > condition described above can be fixed because it's all
> > located
> > in one place in the kernel.
> >
> > 2. This means that the userspace process must *always* call
> > ibv_reg_mr() and ibv_dereg_mr() to increment / decrement the
> > reference counts on the kernel reg cache. But in practice,
> > on-demand registration/de-registration is only done for long
> > messages (short messages typically use
> > copy-to-pre-registered-buffers schemes). So the additional
> > ibv_reg_mr() before calling ibv_post_send() / ibv_post_recv()
> for
> > long messages shouldn't matter.
> >
> > 3. The registration cache in the kernel can lazily deregister cached
> > memory, as described in the "short version" discussion, above
> > (quite similar to what MPI's do today).
> >
> > To offset the cost of large memory registrations (because
> registration
> > is linearly proportional to the size of the buffer being
> registered),
> > pipelined protocols are sometimes used. As such, it seems useful to
> > have a "is this memory already registered?" verb -- a ULP can
> check to
> > see if an entire long message is already registered, and if so, do a
> > single large RDMA action. If not, the ULP can use a pipelined
> > protocol to loop over registering a portion of the buffer and then
> > RDMA'ing it.
> >
> > Possible pipelined pseudocode can look like this:
> >
> > if (ibv_is_reg(pd, buffer, len)) {
> > ibv_post_send();
> > // will still need to ibv_dereg_mr() after completion
> > } else {
> > // pipeline loop
> > for (i = 0; ...) {
> > ibv_reg_mr(pd, buffer + i*pipeline_size,
> > pipeline_size, IBV_ACCESS_CACHE);
> > ibv_post_send(...);
> > }
> > }
> >
> > The rationale here is that these verbs allow the flexibility of
> doing
> > something like the above scenario or just registering the whole long
> > buffer and sending it immediately:
> >
> > ibv_reg_mr(pd, buffer, len, IBV_ACCESS_CACHE);
> > ibv_post_send(...);
> >
> > It may also be useful to progamatically enforce some limits on a
> given
> > PD's registration cache. A per-process limit is already enforced
> via
> > /etc/security/limits.conf, but it may be useful to specify per-PD
> > limits in the ULP (MPI) itself. Note that most MPI's have controls
> > like this already; it's consistent with moving the registration
> cache
> > down to the kernel. A proposal for the verb could be:
> >
> > ibv_reg_mr_cache_limits(pd, max_num_pages)
> >
> > Another userspace-accessible verb that may be useful is one that
> > traverses a PD's reg cache and actually deregisters any item with a
> > refcount <= 0. This allows a ULP to "clean out" any lingering
> > registrations, thereby freeing up registered memory for other uses
> > (e.g., being registered by another PD). This verb can have a
> > simplistic interface:
> >
> > ibv_reg_mr_clean(pd)
> >
> > It's not 100% clear that we need this "clean" verb -- if
> ibv_reg_mr()
> > will evict entries with <= 0 refcounts from any PD's registration
> > cache in this process, that might be enough. However, using verbs
> > registered memory with other (non-verbs) pinned memory in the same
> > process may make this verb necessary.
> >
> > -----
> >
> > Finally, it should be noted that with 2.6.27's IOMMU notify system,
> > full on-demand paging / registering seems possible. On-demand
> paging
> > would be a full, complete solution -- the ULP wouldn't have to worry
> > about registering / de-registering memory at all (the existing
> > de/registration verbs could become no-ops for backwards
> > compatibility). I assume that a proposal along these lines this
> would
> > be a [much] larger debate in the OpenFabrics community, and further
> > assume that the proposal above would be a smaller debate and
> actually
> > have a chance of being implemented in the not-distant future.
> >
> > (/me puts on fire suit)
> >
> > Thoughts?
> >
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/

-- 
Jeff Squyres
Cisco Systems