The Kernel Buffer Structures
The nature of physical disc drives is such that it is often only possible to transfer data in block size chunks. These are typically 512,1024 or 2048 bytes. It is most unlikely that these units and the arbitrary boundaries they impose on the data will correspond to anything useful to the end-user application. To avoid multiple successive accesses to the same physical disc blocks the kernel maintains a set of block size in-core buffers and attempts to satisfy user requests from these blocks only initiating a physical disc read if necessary. Similarly when an end-user application writes to disc the kernel will, instead, modify the in-core buffer, only actually initiating a physical disc read if the user process so requests or if and when the buffer is required for some other purpose.
The performance advantages of in-core buffering or disc caching are fairly obvious and most operating systems provide this sort of facility. The number of such buffers is usually a "tunable" parameter. The size of buffers is not likely to be alterable and will impose constraints on the attachment of discs of very different characteristics. A multi-tasking operating system, such as Unix, imposes further constraints on the design of the kernel buffering system.
Associated with every buffer is a buffer header which contains the following information.
The kernel maintains a doubly linked list of free buffers known as the free list. Note that it is also possible for a buffer header not to be associated with a buffer, such headers are on the pfreelist. It also maintains a number of linked lists of used buffers known as hash queues. When a request is made for a disc block, the kernel hashes the device number and block number to decide which of the hash queues to explore looking for the block in question. The default number of hash queues is 64. If the required block is not found then a block is removed from the free list and a physical transfer is initiated. Of course it is very important to ensure consistency during these manipulations of the buffer structures and, to this end, the kernel will disable all interrupts while it manipulates certain key aspects of the buffer structures.
The operation of the kernel buffering system is described in terms of five algorithms, the names correspond to functions in the kernel source code.
This associates a buffer number with device/block number. It may involve allocation of a new buffer via the geteblk() [get empty block] function. The buffer is locked and can be used for input or output. It is important to consider whether the buffer is "free" or not. In this context "free" means that there is currently no input/output activity in process for the buffer. The following possibilities may be encountered
The opreation of getblk can be described algorithmically thus.
while(buffer not found) { if(block in hash queue) { if(buffer busy/locked) { sleep(buffer becomes free) continue } mark buffer busy/locked remove buffer from free list return buffer } else { if(no buffers on free list) { sleep(any buffer becomes free) continue; } remove buffer from free list if(delayed write buffer) { write buffer to disc (asynchronously) continue } remove buffer from old hash queue put buffer on new hash queue return buffer } }
It should be noted that the algorithm returns (via continue) to re-examine the hash queues etc., whenever it wakes up after a sleep.
This releases a used buffer back to the free list. The algorithm is
{ wake up all processes waiting for this buffer wake up all processes waiting for a free buffer block interrupts if(buffer contents valid AND buffer not "old") put buffer at end of free list else put buffer at start of free list unblock interrupts mark buffer free }
This reads a disc block via the buffering system. The algorithm is.
{ getblk() if(buffer date valid) return buffer initiate disc read sleep (disc read complete) return buffer }
This reads two disc blocks via the buffering system. The second block is read asynchronously which means that no process will be blocked awaiting completion of this speculative read ahead.
Writes data via the buffer system. This may be synchronous or asynchronous. The algorithm is
{ initiate disc write if(I/O synchronous) { sleep(disc write complete) brelse(buffer) } else if(marked for delayed write) mark buffer to put at head of free list }