Documentation/filesystems/mandatory-locking.txt - linux - Git at Google

 	Mandatory File Locking For The Linux Operating System

 		Andy Walker <andy@lysaker.kvaerner.no>

 			   15 April 1996
 		     (Updated September 2007)

 0. Why you should avoid mandatory locking
 -----------------------------------------

 The Linux implementation is prey to a number of difficult-to-fix race
 conditions which in practice make it not dependable:

 	- The write system call checks for a mandatory lock only once
 	  at its start.  It is therefore possible for a lock request to
 	  be granted after this check but before the data is modified.
 	  A process may then see file data change even while a mandatory
 	  lock was held.
 	- Similarly, an exclusive lock may be granted on a file after
 	  the kernel has decided to proceed with a read, but before the
 	  read has actually completed, and the reading process may see
 	  the file data in a state which should not have been visible
 	  to it.
 	- Similar races make the claimed mutual exclusion between lock
 	  and mmap similarly unreliable.

 1. What is  mandatory locking?
 ------------------------------

 Mandatory locking is kernel enforced file locking, as opposed to the more usual
 cooperative file locking used to guarantee sequential access to files among
 processes. File locks are applied using the flock() and fcntl() system calls
 (and the lockf() library routine which is a wrapper around fcntl().) It is
 normally a process' responsibility to check for locks on a file it wishes to
 update, before applying its own lock, updating the file and unlocking it again.
 The most commonly used example of this (and in the case of sendmail, the most
 troublesome) is access to a user's mailbox. The mail user agent and the mail
 transfer agent must guard against updating the mailbox at the same time, and
 prevent reading the mailbox while it is being updated.

 In a perfect world all processes would use and honour a cooperative, or
 "advisory" locking scheme. However, the world isn't perfect, and there's
 a lot of poorly written code out there.

 In trying to address this problem, the designers of System V UNIX came up
 with a "mandatory" locking scheme, whereby the operating system kernel would
 block attempts by a process to write to a file that another process holds a
 "read" -or- "shared" lock on, and block attempts to both read and write to a
 file that a process holds a "write " -or- "exclusive" lock on.

 The System V mandatory locking scheme was intended to have as little impact as
 possible on existing user code. The scheme is based on marking individual files
 as candidates for mandatory locking, and using the existing fcntl()/lockf()
 interface for applying locks just as if they were normal, advisory locks.

 Note 1: In saying "file" in the paragraphs above I am actually not telling
 the whole truth. System V locking is based on fcntl(). The granularity of
 fcntl() is such that it allows the locking of byte ranges in files, in addition
 to entire files, so the mandatory locking rules also have byte level
 granularity.

 Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite
 borrowing the fcntl() locking scheme from System V. The mandatory locking
 scheme is defined by the System V Interface Definition (SVID) Version 3.

 2. Marking a file for mandatory locking
 ---------------------------------------

 A file is marked as a candidate for mandatory locking by setting the group-id
 bit in its file mode but removing the group-execute bit. This is an otherwise
 meaningless combination, and was chosen by the System V implementors so as not
 to break existing user programs.

 Note that the group-id bit is usually automatically cleared by the kernel when
 a setgid file is written to. This is a security measure. The kernel has been
 modified to recognize the special case of a mandatory lock candidate and to
 refrain from clearing this bit. Similarly the kernel has been modified not
 to run mandatory lock candidates with setgid privileges.

 3. Available implementations
 ----------------------------

 I have considered the implementations of mandatory locking available with
 SunOS 4.1.x, Solaris 2.x and HP-UX 9.x.

 Generally I have tried to make the most sense out of the behaviour exhibited
 by these three reference systems. There are many anomalies.

 All the reference systems reject all calls to open() for a file on which
 another process has outstanding mandatory locks. This is in direct
 contravention of SVID 3, which states that only calls to open() with the
 O_TRUNC flag set should be rejected. The Linux implementation follows the SVID
 definition, which is the "Right Thing", since only calls with O_TRUNC can
 modify the contents of the file.

 HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not
 just mandatory locks. That would appear to contravene POSIX.1.

 mmap() is another interesting case. All the operating systems mentioned
 prevent mandatory locks from being applied to an mmap()'ed file, but  HP-UX
 also disallows advisory locks for such a file. SVID actually specifies the
 paranoid HP-UX behaviour.

 In my opinion only MAP_SHARED mappings should be immune from locking, and then
 only from mandatory locks - that is what is currently implemented.

 SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for
 mandatory locks, so reads and writes to locked files always block when they
 should return EAGAIN.

 I'm afraid that this is such an esoteric area that the semantics described
 below are just as valid as any others, so long as the main points seem to
 agree.

 4. Semantics
 ------------

 1. Mandatory locks can only be applied via the fcntl()/lockf() locking
    interface - in other words the System V/POSIX interface. BSD style
    locks using flock() never result in a mandatory lock.

 2. If a process has locked a region of a file with a mandatory read lock, then
    other processes are permitted to read from that region. If any of these
    processes attempts to write to the region it will block until the lock is
    released, unless the process has opened the file with the O_NONBLOCK
    flag in which case the system call will return immediately with the error
    status EAGAIN.

 3. If a process has locked a region of a file with a mandatory write lock, all
    attempts to read or write to that region block until the lock is released,
    unless a process has opened the file with the O_NONBLOCK flag in which case
    the system call will return immediately with the error status EAGAIN.

 4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has
    any mandatory locks owned by other processes will be rejected with the
    error status EAGAIN.

 5. Attempts to apply a mandatory lock to a file that is memory mapped and
    shared (via mmap() with MAP_SHARED) will be rejected with the error status
    EAGAIN.

 6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED)
    that has any mandatory locks in effect will be rejected with the error status
    EAGAIN.

 5. Which system calls are affected?
 -----------------------------------

 Those which modify a file's contents, not just the inode. That gives read(),
 write(), readv(), writev(), open(), creat(), mmap(), truncate() and
 ftruncate(). truncate() and ftruncate() are considered to be "write" actions
 for the purposes of mandatory locking.

 The affected region is usually defined as stretching from the current position
 for the total number of bytes read or written. For the truncate calls it is
 defined as the bytes of a file removed or added (we must also consider bytes
 added, as a lock can specify just "the whole file", rather than a specific
 range of bytes.)

 Note 3: I may have overlooked some system calls that need mandatory lock
 checking in my eagerness to get this code out the door. Please let me know, or
 better still fix the system calls yourself and submit a patch to me or Linus.

 6. Warning!
 -----------

 Not even root can override a mandatory lock, so runaway processes can wreak
 havoc if they lock crucial files. The way around it is to change the file
 permissions (remove the setgid bit) before trying to read or write to it.
 Of course, that might be a bit tricky if the system is hung :-(
	Mandatory File Locking For The Linux Operating System

	Andy Walker <andy@lysaker.kvaerner.no>

	15 April 1996
	(Updated September 2007)

	0. Why you should avoid mandatory locking
	-----------------------------------------

	The Linux implementation is prey to a number of difficult-to-fix race
	conditions which in practice make it not dependable:

	- The write system call checks for a mandatory lock only once
	at its start. It is therefore possible for a lock request to
	be granted after this check but before the data is modified.
	A process may then see file data change even while a mandatory
	lock was held.
	- Similarly, an exclusive lock may be granted on a file after
	the kernel has decided to proceed with a read, but before the
	read has actually completed, and the reading process may see
	the file data in a state which should not have been visible
	to it.
	- Similar races make the claimed mutual exclusion between lock
	and mmap similarly unreliable.

	1. What is mandatory locking?
	------------------------------

	Mandatory locking is kernel enforced file locking, as opposed to the more usual
	cooperative file locking used to guarantee sequential access to files among
	processes. File locks are applied using the flock() and fcntl() system calls
	(and the lockf() library routine which is a wrapper around fcntl().) It is
	normally a process' responsibility to check for locks on a file it wishes to
	update, before applying its own lock, updating the file and unlocking it again.
	The most commonly used example of this (and in the case of sendmail, the most
	troublesome) is access to a user's mailbox. The mail user agent and the mail
	transfer agent must guard against updating the mailbox at the same time, and
	prevent reading the mailbox while it is being updated.

	In a perfect world all processes would use and honour a cooperative, or
	"advisory" locking scheme. However, the world isn't perfect, and there's
	a lot of poorly written code out there.

	In trying to address this problem, the designers of System V UNIX came up
	with a "mandatory" locking scheme, whereby the operating system kernel would
	block attempts by a process to write to a file that another process holds a
	"read" -or- "shared" lock on, and block attempts to both read and write to a
	file that a process holds a "write " -or- "exclusive" lock on.

	The System V mandatory locking scheme was intended to have as little impact as
	possible on existing user code. The scheme is based on marking individual files
	as candidates for mandatory locking, and using the existing fcntl()/lockf()
	interface for applying locks just as if they were normal, advisory locks.

	Note 1: In saying "file" in the paragraphs above I am actually not telling
	the whole truth. System V locking is based on fcntl(). The granularity of
	fcntl() is such that it allows the locking of byte ranges in files, in addition
	to entire files, so the mandatory locking rules also have byte level
	granularity.

	Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite
	borrowing the fcntl() locking scheme from System V. The mandatory locking
	scheme is defined by the System V Interface Definition (SVID) Version 3.

	2. Marking a file for mandatory locking
	---------------------------------------

	A file is marked as a candidate for mandatory locking by setting the group-id
	bit in its file mode but removing the group-execute bit. This is an otherwise
	meaningless combination, and was chosen by the System V implementors so as not
	to break existing user programs.

	Note that the group-id bit is usually automatically cleared by the kernel when
	a setgid file is written to. This is a security measure. The kernel has been
	modified to recognize the special case of a mandatory lock candidate and to
	refrain from clearing this bit. Similarly the kernel has been modified not
	to run mandatory lock candidates with setgid privileges.

	3. Available implementations
	----------------------------

	I have considered the implementations of mandatory locking available with
	SunOS 4.1.x, Solaris 2.x and HP-UX 9.x.

	Generally I have tried to make the most sense out of the behaviour exhibited
	by these three reference systems. There are many anomalies.

	All the reference systems reject all calls to open() for a file on which
	another process has outstanding mandatory locks. This is in direct
	contravention of SVID 3, which states that only calls to open() with the
	O_TRUNC flag set should be rejected. The Linux implementation follows the SVID
	definition, which is the "Right Thing", since only calls with O_TRUNC can
	modify the contents of the file.

	HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not
	just mandatory locks. That would appear to contravene POSIX.1.

	mmap() is another interesting case. All the operating systems mentioned
	prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX
	also disallows advisory locks for such a file. SVID actually specifies the
	paranoid HP-UX behaviour.

	In my opinion only MAP_SHARED mappings should be immune from locking, and then
	only from mandatory locks - that is what is currently implemented.

	SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for
	mandatory locks, so reads and writes to locked files always block when they
	should return EAGAIN.

	I'm afraid that this is such an esoteric area that the semantics described
	below are just as valid as any others, so long as the main points seem to
	agree.

	4. Semantics
	------------

	1. Mandatory locks can only be applied via the fcntl()/lockf() locking
	interface - in other words the System V/POSIX interface. BSD style
	locks using flock() never result in a mandatory lock.

	2. If a process has locked a region of a file with a mandatory read lock, then
	other processes are permitted to read from that region. If any of these
	processes attempts to write to the region it will block until the lock is
	released, unless the process has opened the file with the O_NONBLOCK
	flag in which case the system call will return immediately with the error
	status EAGAIN.

	3. If a process has locked a region of a file with a mandatory write lock, all
	attempts to read or write to that region block until the lock is released,
	unless a process has opened the file with the O_NONBLOCK flag in which case
	the system call will return immediately with the error status EAGAIN.

	4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has
	any mandatory locks owned by other processes will be rejected with the
	error status EAGAIN.

	5. Attempts to apply a mandatory lock to a file that is memory mapped and
	shared (via mmap() with MAP_SHARED) will be rejected with the error status
	EAGAIN.

	6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED)
	that has any mandatory locks in effect will be rejected with the error status
	EAGAIN.

	5. Which system calls are affected?
	-----------------------------------

	Those which modify a file's contents, not just the inode. That gives read(),
	write(), readv(), writev(), open(), creat(), mmap(), truncate() and
	ftruncate(). truncate() and ftruncate() are considered to be "write" actions
	for the purposes of mandatory locking.

	The affected region is usually defined as stretching from the current position
	for the total number of bytes read or written. For the truncate calls it is
	defined as the bytes of a file removed or added (we must also consider bytes
	added, as a lock can specify just "the whole file", rather than a specific
	range of bytes.)

	Note 3: I may have overlooked some system calls that need mandatory lock
	checking in my eagerness to get this code out the door. Please let me know, or
	better still fix the system calls yourself and submit a patch to me or Linus.

	6. Warning!
	-----------

	Not even root can override a mandatory lock, so runaway processes can wreak
	havoc if they lock crucial files. The way around it is to change the file
	permissions (remove the setgid bit) before trying to read or write to it.
	Of course, that might be a bit tricky if the system is hung :-(