The Unix File System

The Kernel Buffer Structures

The nature of physical disc drives is such that it is often only possible to transfer data in block size chunks. These are typically 512,1024 or 2048 bytes. It is most unlikely that these units and the arbitrary boundaries they impose on the data will correspond to anything useful to the end-user application. To avoid multiple successive accesses to the same physical disc blocks the kernel maintains a set of block size in-core buffers and attempts to satisfy user requests from these blocks only initiating a physical disc read if necessary. Similarly when an end-user application writes to disc the kernel will, instead, modify the in-core buffer, only actually initiating a physical disc read if the user process so requests or if and when the buffer is required for some other purpose.

The performance advantages of in-core buffering or disc caching are fairly obvious and most operating systems provide this sort of facility. The number of such buffers is usually a "tunable" parameter. The size of buffers is not likely to be alterable and will impose constraints on the attachment of discs of very different characteristics. A multi-tasking operating system, such as Unix, imposes further constraints on the design of the kernel buffering system.

Associated with every buffer is a buffer header which contains the following information.

Buffer Status.
- Locked/busy or Unlocked/free.
- Contains valid data.
- Buffer is dirty. I.e. the data has been modified since it was read from disc.
- The buffer is in use by the kernel.
- A process is waiting for the buffer.
Device number and block number associated with the buffer data.
Pointers to other buffer headers. There are actually two sets, one associated with the hash queue that the buffer happens to be on and one associated with the free buffer list. A buffer can be and often will be, on both a hash queue and the free list. This means that it contains valid data but that no process is currently accessing the data and there is no current I/O activity relating to the buffer.
Pointer to actual data area.
Size of data area. This is file system dependent. Earlier versions of Unix insisted on 512 byte buffers.

The kernel maintains a doubly linked list of free buffers known as the free list. Note that it is also possible for a buffer header not to be associated with a buffer, such headers are on the pfreelist. It also maintains a number of linked lists of used buffers known as hash queues. When a request is made for a disc block, the kernel hashes the device number and block number to decide which of the hash queues to explore looking for the block in question. The default number of hash queues is 64. If the required block is not found then a block is removed from the free list and a physical transfer is initiated. Of course it is very important to ensure consistency during these manipulations of the buffer structures and, to this end, the kernel will disable all interrupts while it manipulates certain key aspects of the buffer structures.

The operation of the kernel buffering system is described in terms of five algorithms, the names correspond to functions in the kernel source code.

getblk()
This associates a buffer number with device/block number. It may involve allocation of a new buffer via the geteblk() [get empty block] function. The buffer is locked and can be used for input or output. It is important to consider whether the buffer is "free" or not. In this context "free" means that there is currently no input/output activity in process for the buffer. The following possibilities may be encountered
1. The block is found and is free. The block is marked busy and removed from the free list.
2. The block is found and is busy. Under these circumstances there can be complex interactions between processes.
3. The block cannot be found so new block is obtained from the free list. This involves marking the block as busy, transferring it to the new hash queue and removing it from the free list.
4. The block cannot be found and a "delayed write" block is found on the free list. Asynchronous writing out of the "delayed write" block is initiated and an attempt is initiated to find another free block.
5. The block cannot be found and the free list is empty. The process that, indirectly, initiated the search is slept waiting the execution of the brelse()
The opreation of getblk can be described algorithmically thus.
```
while(buffer not found)
{
	if(block in hash queue)
	{
		if(buffer busy/locked)
		{
			sleep(buffer becomes free)
			continue
		}
		mark buffer busy/locked
		remove buffer from free list
		return buffer
	}
	else
	{
		if(no buffers on free list)
		{
			sleep(any buffer becomes free)
			continue;
		}
		remove buffer from free list
		if(delayed write buffer)
		{
			write buffer to disc (asynchronously)
			continue
		}
		remove buffer from old hash queue
		put buffer on new hash queue
		return buffer
	}
}
```
It should be noted that the algorithm returns (via continue) to re-examine the hash queues etc., whenever it wakes up after a sleep.

brelse()

This releases a used buffer back to the free list. The algorithm is

{
	wake up all processes waiting for this buffer
	wake up all processes waiting for a free buffer
	block interrupts
	if(buffer contents valid AND buffer not "old")
		put buffer at end of free list
	else
		put buffer at start of free list
	unblock interrupts
	mark buffer free
}

bread()

This reads a disc block via the buffering system. The algorithm is.

{
	getblk()
	if(buffer date valid) return buffer
	initiate disc read
	sleep (disc read complete)
	return buffer
}

breada()
This reads two disc blocks via the buffering system. The second block is read asynchronously which means that no process will be blocked awaiting completion of this speculative read ahead.

bwrite()

Writes data via the buffer system. This may be synchronous or asynchronous. The algorithm is

{
	initiate disc write
	if(I/O synchronous)
	{
		sleep(disc write complete)
		brelse(buffer)
	}
	else if(marked for delayed write)
		mark buffer to put at head of free list
}