/* -*- mode: C; c-basic-offset: 4 -*- */
#ident "Copyright (c) 2007 Tokutek Inc.  All rights reserved."

/* Readers/writers locks implementation
 *
 *****************************************
 *     Overview
 *****************************************
 *
 * TokuDB employs readers/writers locks for the ephemeral locks (e.g.,
 * on BRT nodes) Why not just use the toku_pthread_rwlock API?
 *
 *   1) we need multiprocess rwlocks (not just multithreaded)
 *
 *   2) pthread rwlocks are very slow since they entail a system call
 *   (about 2000ns on a 2GHz T2500.)
 *
 *     Related: We expect the common case to be that the lock is
 *     granted
 *
 *   3) We are willing to employ machine-specific instructions (such
 *   as atomic exchange, and mfence, each of which runs in about
 *   10ns.)
 *
 *   4) We want to guarantee nonstarvation (many rwlock
 *   implementations can starve the writers because another reader
 *   comes * along before all the other readers have unlocked.)
 *
 *****************************************
 *      How it works
 *****************************************
 *
 * We arrange that the rwlock object is in the address space of both
 * threads or processes.  For processes we use mmap().
 *
 * The rwlock struct comprises the following fields
 *
 *    a long mutex field (which is accessed using xchgl() or other
 *    machine-specific instructions.  This is a spin lock.
 *
 *    a read counter (how many readers currently have the lock?)
 *
 *    a write boolean (does a writer have the lock?)
 *
 *    a singly linked list of semaphores for waiting requesters.  This
 *    list is sorted oldest requester first.  Each list element
 *    contains a semaphore (which is provided by the requestor) and a
 *    boolean indicating whether it is a reader or a writer.
 *
 * To lock a read rwlock:
 *
 *    1) Acquire the mutex.
 *
 *    2) If the linked list is not empty or the writer boolean is true
 *    then
 *
 *       a) initialize your semaphore (to 0),
 *       b) add your list element to the end of the list (with  rw="read")
 *       c) release the mutex
 *       d) wait on the semaphore
 *       e) when the semaphore release, return success.
 *
 *    3) Otherwise increment the reader count, release the mutex, and
 *    return success.
 *
 * To lock the write rwlock is almost the same.
 *     1) Acquire the mutex
 *     2) If the list is not empty or the reader count is nonzero
 *        a) initialize semaphore
 *        b) add to end of list (with rw="write")
 *        c) release mutex
 *        d) wait on the semaphore
 *        e) return success when the semaphore releases
 *     3) Otherwise set writer=TRUE, release mutex and return success.
 *
 * To unlock a read rwlock:
 *     1) Acquire mutex
 *     2) Decrement reader count
 *     3) If the count is still positive or the list is empty then
 *        return success
 *     4) Otherwise (count==zero and the list is nonempty):
 *        a) If the first element of the list is a reader:
 *            i) while the first element is a reader:
 *                 x) pop the list
 *                 y) increment the reader count
 *                 z) increment the semaphore (releasing it for some waiter)
 *            ii) return success
 *        b) Else if the first element is a writer
 *            i) pop the list
 *            ii) set writer to TRUE
 *            iii) increment the semaphore
 *            iv) return success
 */