File System Interface
(read chapter 11 of the dinosaur book)
The most visible aspects of a computer is the data that it is storing. This view is generally of files and directories.
The operating system abstracts the hardware details of storing a file, and presents a clean file interface, using which we can access the files and modify them. The users generally do not care how the file is stored on the hardware, how the sectors on the hard drive are arranged, etc. All those low level details are handled by the operating system.
Files systems are generally arranged in a tree like manner, where root and branches are directories and files are leafs.
Various operating systems may store various attributes for files. Some will store the owner of the file, while others won't. Generally, there are a few attributes that are always there: the name of the file, the size, type, location, time when it was created, protection information, etc.
The information about files is kept in the directory where these files are residing.
Treating the file as an abstract data type, we can imagine performing abstract operations on them, like:
Creating a file: Two steps are needed in creating a file. First, space must be found for the file. Then, an entry in the directory needs to be created to illustrate that the file is there.
Writing a file: To write a file we use a system call. We provide a file identifier, and the information to be written. The operating system performs the write (or lets some driver handle it). The operating system must maintain the write pointer in the file.
Reading a file: To read a file we use a system call. We provide the file identifier, and the address where we want the data to be read (and the size of how much data to read). The operating system handles the read (or lets some driver handle it). The operating system must maintain the read pointer in the file (which is generally the same as the writing pointer).
Repositioning within a file: Moving the read/write pointer in the file (or moving through the directory entries) is one of the basic operations on files. This does not have to do any IO, since the data structures that maintain the read/write code are in memory, and can be updated in memory.
Deleting a file: The operating system must provide for the deletion of a file. This involves freeing the space allocated to the file, and removing the file entry from the directory. (the reverse operation of creating a file).
Truncating a file: Truncating a file means erasing the contents of the file but keeping the directory entry. (freeing memory allocated to the file, but not touching the directory entry of the file). The file will effectively be empty.
For open files, the operating system usually maintains several lists. There is a global list, which contains references to open files all across the system (for all processes), there is also a local list that maintains open files per process.
Depending on the setup, these lists and data structures will contain:
File Pointer: The read/write pointer inside the file.
File Open Count: In the global table the system needs to maintain how many instances of this file are currently opened by various processes.
Disk location of the file: The hard drive location of the file is usually kept in memory to avoid having to read the hard drive to save something to a file.
Access Rights: Each file is opened in some access mode (read only, etc.) An operating system can store this in a per-process open file table so that it can deny or allow some file operation on a per-process basis.
There are various file types. The problem for the users and operating system is to somehow maintain the type of the file. Some operating systems (notably Windows) rely on the file extension to indicate the type of file. Others, like UNIX, rely on a magic number in the file header to indicate the type.
Determining the structure of the file sometimes goes hand in hand with the file type. If we know the type, we can usually determine the file format.
For example, if we know that the type is a plain ASCII file, then we can be pretty sure the file contains text. If we know that the extension is .java, then we know it contains java code. Similarly, if we know the type is executable, we can expect the structures normally found in executable files.
There are also binary types, which don't directly have any type. These are the most flexible, but provide least support. Applications using such files must define their own file format and read/write that format.
Internal File Structure
Files are generally stored on block devices, where data is stored in blocks. These blocks can be hard drive sectors, or some other logically defined entity.
What this means is that the file, as a stream of bytes, may need to be subdivided into these blocks. And tables will need to be maintained to indicate which block belongs to what file.
There is also the problem of internal fragmentation. Where parts of the last block can be wasted. For example, if the file is 513 bytes, and we have blocks of size 512, then we need at least 2 blocks to store the file, with one block having only 1 byte (and the other 511 bytes being mostly wasted).
So, while increasing the block size reduces the size of tables that need to be maintained for the disk, it does produce internal fragmentation.
There are several ways of accessing a file in a system.
Files can be accessed sequentially, meaning we read one byte of data at a time, and never readjust the file pointer. This mode of accessing is very common. Programs like compilers and many text editors use this format.
This access method also utilizes buffering to the maximum possible level. For example, the library/operating system can read an entire hardware block (a sector for example) into memory, and then simply provide bytes from memory. Meaning that you'll only be reading the hard drive after 512 (if that's the sector/buffer size) bytes have been read via the read system call.
This access method is so common that the notion of standard IO is based on them. Even in java, the most common way to do IO is via Streams, which provide essentially sequential file IO.
For some file manipulation sequential access is not enough. This is where we need direct access. Direct access lets us seek to some location, and read/write data from that file location.
Obviously we can simulate sequential access via direct access (by seeking every time we read a byte).
Many databases use this method, since sequentially going through many records is time consuming. If you know where the record is, why not just go to it directly?
Other Access Methods
We can build many other access methods on top of direct-access method. The book goes into some detail here, but we'll side step it, and assume you can figure it out if you ever need to.
In order to store many files, we need some organization in the way we store and reference them. Commonly, there are several levels of abstraction.
First is that disks are broken up into partitions. Each partition is logically treated as a separate hard drive. A disk can have many partitions. There is usually at least one partition on every hard drive.
Each partition (being a logical hard drive) has a directory that maintains a list of files in the directory.
Just as with files, there are several basic operations on a directory:
Search for a file: we need to be able to search the directory for a specific file.
Create a file: We need to be able to create files and place the entry in the directory.
Delete a file: We need to be able to delete a file and remove its entry from the directory.
List a directory: We need to be able to get a list of files in the directory.
Rename a file: We need to be able to rename files in the directory.
Traverse the file system: We need some way (or ability) to traverse the directory. This is useful in backups, compression (zipping the entire directory), etc.
Directory structures are usually not simple, and some simple systems may provide a limited view of what we know today as a directory structure.
Some systems only provide a single level directory structure. This means that we cannot nest directories. There is a pictorial view of this in section 11.3.1 of the dinosaur book.
Single level directory structure has some major limitations. The namespace is fairly limited and restricts any decent use of the system. There are some other good reasons illustrated in the book, but I think you can more or less figure out what the limitations are by simply imagining your computer providing just a single set of directories.
Two level directories aren't much better, but they provide a bit more abstraction. For example, we can have several users having their own set of single-level directories.
Most operating systems today use a variant of the Tree Structured directory. What this means is that directories are like branches in a tree, and files are like leafs.
The tree has a root, which is a directory. Which in turn has a list of files. A file is either a regular file, or a subdirectory. A subdirectory can have either files or other subdirectories.
This structure in effect provides an arbitrary directory level. There are some practical limitations though (the maximum limit that directories could be nested).
Tree Structured directories, by definition, do not allow sharing of files (otherwise they wouldn't be a 'tree' structure).
Thus, acyclic graph directories were created, so that users can share files and directories.
What this structure provides are files which are links to other files somewhere on the system.
There are some serious limitations to this approach as we'll discuss next.
General Graph Directories
Acyclic directories suffer from the fact that they cannot have any cycles. In acyclic directories we are intentionally restricting the structure not to have any cycles. This means that every time we create a link file, we need to check and ensure that no cycle has been created. This is not an easy task.
Now, we can have General Graph directories, that allow cycles in the graph. Unfortunately, the algorithms that traverse graphs and acyclic graphs are quire different. For a graph with possible cycles in it, we need to maintain a list of nodes we've already visited, in order not to visit them again once they came up again in a cycle.
For large file systems, this storage of nodes may be prohibitive.