Once you begin creating files, both you and the Operating system need some way to keep track of them. How is this done? In DOS, the answer lies
with something called the File Allocation Table (FAT). To understand how this important component of DOS functions, let us first look at how disks are organized to store data.
Each disk is divided into sectors which are 512 bytes in size. These sectors lie along tracks, which are concentric rings on the disk. On a hard drive, these are created as part of the low-level formatting process, and have been done at the factory. In the old days users performed the low-level format themselves, but this has not been necessary in some years. On an old floppy drive, you could conceivably use sectors as the basic unit for storing file data, since the total number of sectors would not be that large. On a 360K floppy disk, you would have 720 sectors to deal with. But on a large hard drive (like 100MB!!), you would need to keep track of 200,000 of these sectors, with all of the overhead of of assigning addresses to each sector and storing information about them in a table. Also, 512 bytes is pretty small as files go. Most files would require multiple sectors to store their information, possibly hundreds of them. So the sectors were collected into larger units, called clusters. The cluster is sometimes referred to as the allocation unit, because it is the minimum amount of space that can be allocated to a file.
For example, suppose the size of a cluster is 4096 bytes (i.e. 8 sectors). If you have a file that is 3000 bytes in size, it will be saved using one cluster, and 1096 bytes of that cluster will be wasted. That is because only one file can ever “own” a cluster. If your file was 5000 bytes,you would use two clusters (total 8192 bytes), and 3192 bytes of the second cluster would be wasted. Assuming that file sizes are a random number, you can quickly show that on average you waste one-half of a cluster per file saved. So there is some incentive to minimize this wastage, and the best way is reduce the size of the partition. The reason for this has to do with how cluster sizes are determined, and that leads us to the File Allocation Table.
The File Allocation Table (FAT) is a place on the disk where information about the files is stored. Metaphorically, it is like the catalog in a library.
It is a table that stores the name of each file, and has a pointer to the place
on the disk where that file can be found. It also has a few other things. These address pointer entries are stored as a binary number, and the number of bits used determines the type of FAT in use. FAT-12, which is used for floppy disks (and for hard disks smaller than 17MB, should you ever encounter one <g>), stores the information in 12 bits per cluster. FAT-16, used in DOS and in versions of Windows prior to the OSR-2 version of Windows 95, stores the information in 16 bits. FAT-32, introduced to some computers in Windows 95 OSR-2, and in general to most people in Windows 98, uses 32 bits to store this information. Why does this matter? Because the maximum number of clusters is determined by the bits available to address each one. Since each bit is a binary 0 or 1, the formula is based on powers of two. Note that in FAT-12 and FAT-16, a few of the theoretically available slots have been reserved for the use of the file system. In FAT-32, 4 of the 32 bits in each address have been reserved for other uses, leaving 28-bits for pure addressing.
File System | Possible Entries | Actual Entries |
---|---|---|
FAT-12 | 2^12=4096 | 4086 |
FAT-16 | 2^16=65,536 | 65,526 |
FAT-32 | 2^28=268,435,456 | 268 million |
With this information, we can begin to do some calculations on cluster sizes. On a hard drive formatted using FAT-16, here is what you would
find. (Note: these numbers are approximate, since hard drive sizes are stated differently in some cases.) I will assume 5,000 files per hard drive as an example. Note that cluster size has to be in even numbers of sectors (512 bytes each), so if you are doing the calculations you will need to round up to the next even multiple of 512.
Hard Drive Size | Cluster Size | Estimated Wastage (5,000 Files) |
---|---|---|
100 MB | 2048 (4 sectors | 5 MB |
500 MB | 8192 (16 sectors) | 20MB |
800 MB | 12,800 (25 sectors) | 32MB |
1.2 GB | 18944 (37 sectors | 47MB |
Since on a large hard drive the figure of 5,000 files is a drastic underestimate (and note that you need to throw in all of the directories and
subdirectories, each of which uses a slot), you can see why FAT-16 is just not
acceptable for larger hard drives.
The Structure of FAT
Assuming you have a FAT-16 file system, you have 65,526 clusters available for use when you begin. Of course, installing the operating system is going to use up a lot of those slots, and each additional program you install uses up many more. Here is how the FAT is structured:
Cluster Number
| Contents |
---|---|
Cluster 0
| Reserved for DOS |
Cluster 1
| Reserved for DOS |
Cluster 2
| 2 (used to store a small file) |
Cluster 3
| 4 (used to store data, extends to cluster 4) |
Cluster 4
| 5 (used to store data, extends to cluster 5) |
Cluster 5
| 7 (used to store data, extends to cluster 7) |
Cluster 6
| 0 (empty, available for use) |
Cluster 7
| FFFh (used to store data, is the last cluster in the chain) |
Cluster 8
| 0 (empty, available for use) |
*
*
*
Cluster 65524 | 0 (empty, available for use) |
Cluster 65525 | 0 (empty, available for use) |
Cluster 65526 | 0 (empty, available for use) |
As you can see, in each slot of the FAT there is status information. If the cluster is free, the value of zero is recorded. If the cluster contains data, but all of the data fits in that one cluster, the cluster number itself is stored. If the data extends over multiple clusters, the number of the next cluster in the chain is stored. If this is the last cluster in the chain, an end-of-file marker is stored (the hexadecimal number FFF).
Ordinarily, you should not have any problems retrieving a file. The FAT would have a pointer that says that your file MYFILE.TXT begins in cluster
10793, for instance, and would go there first and retrieve what is in that cluster. In looking at the FAT entry, it would see the number 10794, for instance, and know that the next cluster in that chain was 10794, and it would go there and retrieve the contents of that cluster and append them to the end of the contents of the first cluster. It would keep doing this until it reached the cluster that had FFF stored, and it would know that this meant it had found the end of the file and could stop.
Two things can go wrong with this, though. First, you can have a situation where two different clusters, each part of a different file, point to the same cluster as part of their chain. This is called cross-linked files. The second problem is when you have clusters that appear to be part of a chain, but the whole chain is not present. These are referred to as lost clusters. When either problem is present, your file system is unreliable and must be fixed. In early versions of DOS, you would fix this using the external command CHKDSK.EXE, which is short for Check Disk. This program would fix the file system by taking the clusters that were apparently part of a chain (the lost clusters) and converting them to a file, usually something like FILE0001.CHK. If you see this on your hard drive, you can usually delete
it safely since it is probably something that you cannot make sense of anyway. But if you want you can try opening it in a text editor and see if it contains anything you have been missing. If you have cross-linked files, CHKDSK.EXE will convert them to two separate files that are no longer cross-linked. Of course, at least one of them must be corrupt, since you cannot have two different files use the one cluster. In later versions of DOS, and in Windows, CHKDSK.EXE was replaced with a new utility called SCANDISK.EXE, which does essentially the same things.
Because of these, and other problems that can occur, each DOS FAT is actually duplicated as two consecutive duplicate copies. The first is the normal working copy, and the second is a backup copy that is used if the
first becomes corrupted.
A related issue is file fragmentation. When a file is deleted, the clusters it used are marked with a zero to indicate that they are available for use. The contents are not removed, though, which is why you can sometimes “undelete” a file if you act before those clusters have been reallocated to a new file. Now, when a file is saved, the operating system consults the FAT, and begins saving the file in the first available cluster. If a second cluster is required, the next available cluster is used. But this second cluster may be nowhere near the first. And perhaps a third cluster is required, and it is nowhere near the other two. This is file fragmentation. This can reduce performance since the heads of the hard drive must travel some distance between each cluster to load the file. It is a good idea to periodically defragment the drive, which means to use a utility that moves the data contained in various clusters around so that each file uses a series of contiguous clusters that are not spread out all over the place. This means also updating all the records in the FAT so that the file can be retrieved after the defragmentation has occurred. DOS has an external command called DEFRAG that can be used to do this, and many utility packages (such as Norton Utilities) had utilities for this purpose as well.
Directories and Directory Entries
In each FAT volume, right after the two copies of the FAT, we come to the root directory. In DOS, this is represented by the symbol \ (in Unix, it is just the opposite: / ). This is the top of the directory structure, and is always created when the disk is formatted and FAT is installed. The word directory, in this context, actually has two different meanings.
Technically, a directory is a listing of contents. But in common usage, we often use it to denote the container of the contents. For example, if you go into a large office building, there is frequently a directory in the lobby that tells you where you can find the particular office you are looking for. But that directory does not “contain” the office, it simply tells you were to find it. Yet in computers, we often use the word directory to mean the place where a file is located, rather than the table where we look up its location. This can get confusing. It is better to use the word directory to mean the table where we look up the information, and use a different word, such as “folder”, to mean where a folder is located. Of course, on a deep level these are all metaphors we use to help us make sense of what the computer is doing. The computer never gets confused, it is just us poor carbon-based life forms that get turned around.
If we use the word directory to mean the table where we look things up, the root directory is a table that records the location of all of the folders (directories) on the drive, and of any files that are not in one of those folders. This table (on a hard drive) has 512 slots, and in each slot there is room for a 32-byte entry. When a folder is created, that folder has a directory table that also has 512 slots, each with a 32-byte entry. It follows that each folder, from the root on down, can hold a maximum of 512 “objects”, where those objects are either files or other folders. The 32-byte description allows 8 bytes for the file (or folder) name, 3 bytes for the file’s extension, and additional bytes describe the attributes (read-only, system file, hidden, archive, etc.), the date created or last modified, etc. In the last four bytes is stored the value for the starting cluster number, and byte count number. Incidentally, the space reserved for the root directory on a floppy disk is smaller, so only 224 entries are possible.
Because the root directory can only hold 512 entries, and modern hard drives typically hold many thousands of files, it is necessary that a directory
structure be created. The mechanics of how to do this in DOS is the subject
of the next lesson, but it is absolutely necessary. Periodically someone will
encounter a problem saving a file, and when you investigate, it turns out they were trying to save every file in the root directory and eventually ran out of slots. In Windows 9x machines, this happens even more quickly because of long file name support. If you noted above that only 8 bytes
are reserved for the file name in each entry, and if you realized that each
character requires one byte to describe it, you quickly see why DOS only allows 8 character file names. In Windows 9.x, you can use longer file names, but only by using multiple directory entries for each long file name. It is not unusual, therefore, to have a directory in Windows 9.x fill up when only 200-300 items are stored, if long file names are used.
That is the technical reason for creating a directory structure. There is also a practical reason, and that is that a good directory structure can help you organize your data in useful ways. Imagine a company that stored all of its documents in a “documents room”. Every day, people would open the door, throw in a bunch of documents, and close the door again. One day, you need to find a particular document, so you have to go into this room, and look at each document one at a time until you find the one you want. Obviously, this would take a lifetime to find, and is a really stupid way to save documents. Instead, you would create a filing system, using file cabinets, each divided into drawers, and in each drawer a bunch of hanging folders, and in each hanging folder a number of manila folders, and in each of them a number of documents. Then, when you wanted to find a particular document, you would look up in a directory to see which filing cabinet it was in, then read the drawer labels to see which drawer it was in, then read the labels on the folders, etc. until you had the document. You might perform this task in only a few minutes if the filing system was logical. Well, this is what you want to do with your hard drive. Under the root directory, you create your top-level “directories”, which are the equivalent of your filing cabinets. Then inside of each of these you can create sub-folders (“drawers”), and in each of these sob-folders you can create additional sub-folders (“hanging folders”), etc. Then, when you need to find the memo you wrote to your boss in October of 1998, it will be easy to find it.