Documentation/arm/vlocks.rst - linux - Git at Google

 ======================================
 vlocks for Bare-Metal Mutual Exclusion
 ======================================

 Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
 mechanism, with reasonable but minimal requirements on the memory
 system.

 These are intended to be used to coordinate critical activity among CPUs
 which are otherwise non-coherent, in situations where the hardware
 provides no other mechanism to support this and ordinary spinlocks
 cannot be used.


 vlocks make use of the atomicity provided by the memory system for
 writes to a single memory location.  To arbitrate, every CPU "votes for
 itself", by storing a unique number to a common memory location.  The
 final value seen in that memory location when all the votes have been
 cast identifies the winner.

 In order to make sure that the election produces an unambiguous result
 in finite time, a CPU will only enter the election in the first place if
 no winner has been chosen and the election does not appear to have
 started yet.


 Algorithm
 ---------

 The easiest way to explain the vlocks algorithm is with some pseudo-code::


 	int currently_voting[NR_CPUS] = { 0, };
 	int last_vote = -1; /* no votes yet */

 	bool vlock_trylock(int this_cpu)
 	{
 		/* signal our desire to vote */
 		currently_voting[this_cpu] = 1;
 		if (last_vote != -1) {
 			/* someone already volunteered himself */
 			currently_voting[this_cpu] = 0;
 			return false; /* not ourself */
 		}

 		/* let's suggest ourself */
 		last_vote = this_cpu;
 		currently_voting[this_cpu] = 0;

 		/* then wait until everyone else is done voting */
 		for_each_cpu(i) {
 			while (currently_voting[i] != 0)
 				/* wait */;
 		}

 		/* result */
 		if (last_vote == this_cpu)
 			return true; /* we won */
 		return false;
 	}

 	bool vlock_unlock(void)
 	{
 		last_vote = -1;
 	}


 The currently_voting[] array provides a way for the CPUs to determine
 whether an election is in progress, and plays a role analogous to the
 "entering" array in Lamport's bakery algorithm [1].

 However, once the election has started, the underlying memory system
 atomicity is used to pick the winner.  This avoids the need for a static
 priority rule to act as a tie-breaker, or any counters which could
 overflow.

 As long as the last_vote variable is globally visible to all CPUs, it
 will contain only one value that won't change once every CPU has cleared
 its currently_voting flag.


 Features and limitations
 ------------------------

  * vlocks are not intended to be fair.  In the contended case, it is the
    _last_ CPU which attempts to get the lock which will be most likely
    to win.

    vlocks are therefore best suited to situations where it is necessary
    to pick a unique winner, but it does not matter which CPU actually
    wins.

  * Like other similar mechanisms, vlocks will not scale well to a large
    number of CPUs.

    vlocks can be cascaded in a voting hierarchy to permit better scaling
    if necessary, as in the following hypothetical example for 4096 CPUs::

 	/* first level: local election */
 	my_town = towns[(this_cpu >> 4) & 0xf];
 	I_won = vlock_trylock(my_town, this_cpu & 0xf);
 	if (I_won) {
 		/* we won the town election, let's go for the state */
 		my_state = states[(this_cpu >> 8) & 0xf];
 		I_won = vlock_lock(my_state, this_cpu & 0xf));
 		if (I_won) {
 			/* and so on */
 			I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
 			if (I_won) {
 				/* ... */
 			}
 			vlock_unlock(the_whole_country);
 		}
 		vlock_unlock(my_state);
 	}
 	vlock_unlock(my_town);


 ARM implementation
 ------------------

 The current ARM implementation [2] contains some optimisations beyond
 the basic algorithm:

  * By packing the members of the currently_voting array close together,
    we can read the whole array in one transaction (providing the number
    of CPUs potentially contending the lock is small enough).  This
    reduces the number of round-trips required to external memory.

    In the ARM implementation, this means that we can use a single load
    and comparison::

 	LDR	Rt, [Rn]
 	CMP	Rt, #0

    ...in place of code equivalent to::

 	LDRB	Rt, [Rn]
 	CMP	Rt, #0
 	LDRBEQ	Rt, [Rn, #1]
 	CMPEQ	Rt, #0
 	LDRBEQ	Rt, [Rn, #2]
 	CMPEQ	Rt, #0
 	LDRBEQ	Rt, [Rn, #3]
 	CMPEQ	Rt, #0

    This cuts down on the fast-path latency, as well as potentially
    reducing bus contention in contended cases.

    The optimisation relies on the fact that the ARM memory system
    guarantees coherency between overlapping memory accesses of
    different sizes, similarly to many other architectures.  Note that
    we do not care which element of currently_voting appears in which
    bits of Rt, so there is no need to worry about endianness in this
    optimisation.

    If there are too many CPUs to read the currently_voting array in
    one transaction then multiple transations are still required.  The
    implementation uses a simple loop of word-sized loads for this
    case.  The number of transactions is still fewer than would be
    required if bytes were loaded individually.


    In principle, we could aggregate further by using LDRD or LDM, but
    to keep the code simple this was not attempted in the initial
    implementation.


  * vlocks are currently only used to coordinate between CPUs which are
    unable to enable their caches yet.  This means that the
    implementation removes many of the barriers which would be required
    when executing the algorithm in cached memory.

    packing of the currently_voting array does not work with cached
    memory unless all CPUs contending the lock are cache-coherent, due
    to cache writebacks from one CPU clobbering values written by other
    CPUs.  (Though if all the CPUs are cache-coherent, you should be
    probably be using proper spinlocks instead anyway).


  * The "no votes yet" value used for the last_vote variable is 0 (not
    -1 as in the pseudocode).  This allows statically-allocated vlocks
    to be implicitly initialised to an unlocked state simply by putting
    them in .bss.

    An offset is added to each CPU's ID for the purpose of setting this
    variable, so that no CPU uses the value 0 for its ID.


 Colophon
 --------

 Originally created and documented by Dave Martin for Linaro Limited, for
 use in ARM-based big.LITTLE platforms, with review and input gratefully
 received from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
 grabbing most of this text out of the relevant mail thread and writing
 up the pseudocode.

 Copyright (C) 2012-2013  Linaro Limited
 Distributed under the terms of Version 2 of the GNU General Public
 License, as defined in linux/COPYING.


 References
 ----------

 [1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
     Problem", Communications of the ACM 17, 8 (August 1974), 453-455.

     https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm

 [2] linux/arch/arm/common/vlock.S, www.kernel.org.
	======================================
	vlocks for Bare-Metal Mutual Exclusion
	======================================

	Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
	mechanism, with reasonable but minimal requirements on the memory
	system.

	These are intended to be used to coordinate critical activity among CPUs
	which are otherwise non-coherent, in situations where the hardware
	provides no other mechanism to support this and ordinary spinlocks
	cannot be used.


	vlocks make use of the atomicity provided by the memory system for
	writes to a single memory location. To arbitrate, every CPU "votes for
	itself", by storing a unique number to a common memory location. The
	final value seen in that memory location when all the votes have been
	cast identifies the winner.

	In order to make sure that the election produces an unambiguous result
	in finite time, a CPU will only enter the election in the first place if
	no winner has been chosen and the election does not appear to have
	started yet.


	Algorithm
	---------

	The easiest way to explain the vlocks algorithm is with some pseudo-code::


	int currently_voting[NR_CPUS] = { 0, };
	int last_vote = -1; /* no votes yet */

	bool vlock_trylock(int this_cpu)
	{
	/* signal our desire to vote */
	currently_voting[this_cpu] = 1;
	if (last_vote != -1) {
	/* someone already volunteered himself */
	currently_voting[this_cpu] = 0;
	return false; /* not ourself */
	}

	/* let's suggest ourself */
	last_vote = this_cpu;
	currently_voting[this_cpu] = 0;

	/* then wait until everyone else is done voting */
	for_each_cpu(i) {
	while (currently_voting[i] != 0)
	/* wait */;
	}

	/* result */
	if (last_vote == this_cpu)
	return true; /* we won */
	return false;
	}

	bool vlock_unlock(void)
	{
	last_vote = -1;
	}


	The currently_voting[] array provides a way for the CPUs to determine
	whether an election is in progress, and plays a role analogous to the
	"entering" array in Lamport's bakery algorithm [1].

	However, once the election has started, the underlying memory system
	atomicity is used to pick the winner. This avoids the need for a static
	priority rule to act as a tie-breaker, or any counters which could
	overflow.

	As long as the last_vote variable is globally visible to all CPUs, it
	will contain only one value that won't change once every CPU has cleared
	its currently_voting flag.


	Features and limitations
	------------------------

	* vlocks are not intended to be fair. In the contended case, it is the
	_last_ CPU which attempts to get the lock which will be most likely
	to win.

	vlocks are therefore best suited to situations where it is necessary
	to pick a unique winner, but it does not matter which CPU actually
	wins.

	* Like other similar mechanisms, vlocks will not scale well to a large
	number of CPUs.

	vlocks can be cascaded in a voting hierarchy to permit better scaling
	if necessary, as in the following hypothetical example for 4096 CPUs::

	/* first level: local election */
	my_town = towns[(this_cpu >> 4) & 0xf];
	I_won = vlock_trylock(my_town, this_cpu & 0xf);
	if (I_won) {
	/* we won the town election, let's go for the state */
	my_state = states[(this_cpu >> 8) & 0xf];
	I_won = vlock_lock(my_state, this_cpu & 0xf));
	if (I_won) {
	/* and so on */
	I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
	if (I_won) {
	/* ... */
	}
	vlock_unlock(the_whole_country);
	}
	vlock_unlock(my_state);
	}
	vlock_unlock(my_town);


	ARM implementation
	------------------

	The current ARM implementation [2] contains some optimisations beyond
	the basic algorithm:

	* By packing the members of the currently_voting array close together,
	we can read the whole array in one transaction (providing the number
	of CPUs potentially contending the lock is small enough). This
	reduces the number of round-trips required to external memory.

	In the ARM implementation, this means that we can use a single load
	and comparison::

	LDR Rt, [Rn]
	CMP Rt, #0

	...in place of code equivalent to::

	LDRB Rt, [Rn]
	CMP Rt, #0
	LDRBEQ Rt, [Rn, #1]
	CMPEQ Rt, #0
	LDRBEQ Rt, [Rn, #2]
	CMPEQ Rt, #0
	LDRBEQ Rt, [Rn, #3]
	CMPEQ Rt, #0

	This cuts down on the fast-path latency, as well as potentially
	reducing bus contention in contended cases.

	The optimisation relies on the fact that the ARM memory system
	guarantees coherency between overlapping memory accesses of
	different sizes, similarly to many other architectures. Note that
	we do not care which element of currently_voting appears in which
	bits of Rt, so there is no need to worry about endianness in this
	optimisation.

	If there are too many CPUs to read the currently_voting array in
	one transaction then multiple transations are still required. The
	implementation uses a simple loop of word-sized loads for this
	case. The number of transactions is still fewer than would be
	required if bytes were loaded individually.


	In principle, we could aggregate further by using LDRD or LDM, but
	to keep the code simple this was not attempted in the initial
	implementation.


	* vlocks are currently only used to coordinate between CPUs which are
	unable to enable their caches yet. This means that the
	implementation removes many of the barriers which would be required
	when executing the algorithm in cached memory.

	packing of the currently_voting array does not work with cached
	memory unless all CPUs contending the lock are cache-coherent, due
	to cache writebacks from one CPU clobbering values written by other
	CPUs. (Though if all the CPUs are cache-coherent, you should be
	probably be using proper spinlocks instead anyway).


	* The "no votes yet" value used for the last_vote variable is 0 (not
	-1 as in the pseudocode). This allows statically-allocated vlocks
	to be implicitly initialised to an unlocked state simply by putting
	them in .bss.

	An offset is added to each CPU's ID for the purpose of setting this
	variable, so that no CPU uses the value 0 for its ID.


	Colophon
	--------

	Originally created and documented by Dave Martin for Linaro Limited, for
	use in ARM-based big.LITTLE platforms, with review and input gratefully
	received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for
	grabbing most of this text out of the relevant mail thread and writing
	up the pseudocode.

	Copyright (C) 2012-2013 Linaro Limited
	Distributed under the terms of Version 2 of the GNU General Public
	License, as defined in linux/COPYING.


	References
	----------

	[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
	Problem", Communications of the ACM 17, 8 (August 1974), 453-455.

	https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm

	[2] linux/arch/arm/common/vlock.S, www.kernel.org.