DOS Lesson 9: The File System

Once you begin creating files, both you and the Operating system
need some way to keep track of them. How is this done? In DOS, the answer lies
with something called the File
Allocation Table
(FAT). To understand how this important component
of DOS functions, let us first look at how disks are organized to store data.

Each disk is divided into sectors
which are 512 bytes in size. These sectors lie along tracks,
which are concentric rings on the disk. On a hard drive, these are created as
part of the low-level formatting process, and have been done at the factory.
In the old days users performed the low-level format themselves, but this has
not been necessary in some years. On an old floppy drive, you could conceivably
use sectors as the basic unit for storing file data, since the total number
of sectors would not be that large. On a 360K floppy disk, you would have 720
sectors to deal with. But on a large hard drive (like 100MB!!), you would need
to keep track of 200,000 of these sectors, with all of the overhead of of assigning
addresses to each sector and storing information about them in a table. Also,
512 bytes is pretty small as files go. Most files would require multiple sectors
to store their information, possibly hundreds of them. So the sectors were collected
into larger units, called clusters.
The cluster is sometimes referred to as the allocation unit, because
it is the minimum amount of space that can be allocated to a file.

For example, suppose the size of a cluster is 4096 bytes (i.e.
8 sectors). If you have a file that is 3000 bytes in size, it will be saved
using one cluster, and 1096 bytes of that cluster will be wasted. That is because
only one file can ever “own” a cluster. If your file was 5000 bytes,
you would use two clusters (total 8192 bytes), and 3192 bytes of the second
cluster would be wasted. Assuming that file sizes are a random number, you can
quickly show that on average you waste one-half of a cluster per file saved.
So there is some incentive to minimize this wastage, and the best way is reduce
the size of the partition. The reason for this has to do with how cluster sizes
are determined, and that leads us to the File Allocation Table.

The File Allocation Table (FAT) is a place on the disk where information
about the files is stored. Metaphorically, it is like the catalog in a library.
It is a table that stores the name of each file, and has a pointer to the place
on the disk where that file can be found. It also has a few other things. These
address pointer entries are stored as a binary number, and the number of bits
used determines the type of FAT in use. FAT-12, which is used for floppy disks
(and for hard disks smaller than 17MB, should you ever encounter one <g>),
stores the information in 12 bits per cluster. FAT-16, used in DOS and in versions
of Windows prior to the OSR-2 version of Windows 95, stores the information
in 16 bits. FAT-32, introduced to some computers in Windows 95 OSR-2, and in
general to most people in Windows 98, uses 32 bits to store this information.
Why does this matter? Because the maximum number of clusters is determined by
the bits available to address each one. Since each bit is a binary 0 or 1, the
formula is based on powers of two. Note that in FAT-12 and FAT-16, a few of
the theoretically available slots have been reserved for the use of the file
system. In FAT-32, 4 of the 32 bits in each address have been reserved for other
uses, leaving 28-bits for pure addressing.

File System Possible Entries Actual Entries
FAT-12 2^12=4096 4086
FAT-16 2^16=65,536 65,526
FAT-32 2^28=268,435,456 268 million

With this information, we can begin to do some calculations on
cluster sizes. On a hard drive formatted using FAT-16, here is what you would
find. (Note: these numbers are approximate, since hard drive sizes are stated
differently in some cases.) I will assume 5,000 files per hard drive as an example.
Note that cluster size has to be in even numbers of sectors (512 bytes each),
so if you are doing the calculations you will need to round up to the next even
multiple of 512.

Hard Drive Size Cluster Size Estimated Wastage (5,000 Files)
100 MB 2048 (4 sectors 5 MB
500 MB 8192 (16 sectors) 20MB
800 MB 12,800 (25 sectors) 32MB
1.2 GB 18944 (37 sectors 47MB

Since on a large hard drive the figure of 5,000 files is a drastic
underestimate (and note that you need to throw in all of the directories and
subdirectories, each of which uses a slot), you can see why FAT-16 is just not
acceptable for larger hard drives.

The Structure of FAT

Assuming you have a FAT-16 file system, you have 65,526 clusters
available for use when you begin. Of course, installing the operating system
is going to use up a lot of those slots, and each additional program you install
uses up many more. Here is how the FAT is structured:

Cluster Number
Contents
Cluster 0
Reserved for DOS
Cluster 1
Reserved for DOS
Cluster 2
2 (used to store a small file)
Cluster 3
4 (used to store data, extends to cluster 4)
Cluster 4
5 (used to store data, extends to cluster 5)
Cluster 5
7 (used to store data, extends to cluster 7)
Cluster 6
0 (empty, available for use)
Cluster 7
FFFh (used to store data, is the last cluster in the chain)
Cluster 8
0 (empty, available for use)

*

*

*

Cluster 65524 0 (empty, available for use)
Cluster 65525 0 (empty, available for use)
Cluster 65526 0 (empty, available for use)

As you can see, in each slot of the FAT there is status information.
If the cluster is free, the value of zero is recorded. If the cluster contains
data, but all of the data fits in that one cluster, the cluster number itself
is stored. If the data extends over multiple clusters, the number of the next
cluster in the chain is stored. If this is the last cluster in the chain, an
end-of-file marker is stored (the hexadecimal number FFF).

Ordinarily, you should not have any problems retrieving a file.
The FAT would have a pointer that says that your file MYFILE.TXT begins in cluster
10793, for instance, and would go there first and retrieve what is in that cluster.
In looking at the FAT entry, it would see the number 10794, for instance, and
know that the next cluster in that chain was 10794, and it would go there and
retrieve the contents of that cluster and append them to the end of the contents
of the first cluster. It would keep doing this until it reached the cluster
that had FFF stored, and it would know that this meant it had found the end
of the file and could stop.

Two things can go wrong with this, though. First, you can have
a situation where two different clusters, each part of a different file, point
to the same cluster as part of their chain. This is called cross-linked
files
. The second problem is when you have clusters that appear to be
part of a chain, but the whole chain is not present. These are referred to as
lost clusters. When either problem is present, your file system
is unreliable and must be fixed. In early versions of DOS, you would fix this
using the external command CHKDSK.EXE, which is short for Check Disk. This program
would fix the file system by taking the clusters that were apparently part of
a chain (the lost clusters) and converting them to a file, usually something
like FILE0001.CHK. If you see this on your hard drive, you can usually delete
it safely since it is probably something that you cannot make sense of anyway.
But if you want you can try opening it in a text editor and see if it contains
anything you have been missing. If you have cross-linked files, CHKDSK.EXE will
convert them to two separate files that are no longer cross-linked. Of course,
at least one of them must be corrupt, since you cannot have two different files
use the one cluster. In later versions of DOS, and in Windows, CHKDSK.EXE was
replaced with a new utility called SCANDISK.EXE, which does essentially the
same things.

Because of these, and other problems that can occur, each DOS
FAT is actually duplicated as two consecutive duplicate copies. The first is
the normal working copy, and the second is a backup copy that is used if the
first becomes corrupted.

A related issue is file fragmentation.
When a file is deleted, the clusters it used are marked with a zero to indicate
that they are available for use. The contents are not removed,
though, which is why you can sometimes “undelete” a file if you act
before those clusters have been reallocated to a new file. Now, when a file
is saved, the operating system consults the FAT, and begins saving the file
in the first available cluster. If a second cluster is required, the next available
cluster is used. But this second cluster may be nowhere near the first. And
perhaps a third cluster is required, and it is nowhere near the other two. This
is file fragmentation. This can reduce performance since the heads of the hard
drive must travel some distance between each cluster to load the file. It is
a good idea to periodically defragment the drive, which means
to use a utility that moves the data contained in various clusters around so
that each file uses a series of contiguous clusters that are not spread out
all over the place. This means also updating all the records in the FAT so that
the file can be retrieved after the defragmentation has occurred. DOS has an
external command called DEFRAG that can be used to do this, and many utility
packages (such as Norton Utilities) had utilities for this purpose as well.

Directories and Directory Entries

In each FAT volume, right after the two copies of the FAT, we
come to the root
directory
. In DOS, this is represented by the symbol \ (in Unix,
it is just the opposite: / ). This is the top of the directory structure, and
is always created when the disk is formatted and FAT is installed. The word
directory, in this context, actually has two different meanings.
Technically, a directory is a listing of contents. But in common usage, we often
use it to denote the container of the contents. For example, if you go into
a large office building, there is frequently a directory in the lobby that tells
you where you can find the particular office you are looking for. But that directory
does not “contain” the office, it simply tells you were to find it.
Yet in computers, we often use the word directory to mean the place where a
file is located, rather than the table where we look up its location. This can
get confusing. It is better to use the word directory to mean the table where
we look up the information, and use a different word, such as “folder”,
to mean where a folder is located. Of course, on a deep level these are all
metaphors we use to help us make sense of what the computer is doing. The computer
never gets confused, it is just us poor carbon-based life forms that get turned
around.

If we use the word directory to mean the table where we look things
up, the root directory is a table that records the location of all of the folders
(directories) on the drive, and of any files that are not in one of those folders.
This table (on a hard drive) has 512 slots, and in each slot there is room for
a 32-byte entry. When a folder is created, that folder has a directory table
that also has 512 slots, each with a 32-byte entry. It follows that each folder,
from the root on down, can hold a maximum of 512 “objects”, where
those objects are either files or other folders. The 32-byte description allows
8 bytes for the file (or folder) name, 3 bytes for the file’s extension, and
additional bytes describe the attributes (read-only, system file, hidden, archive,
etc.), the date created or last modified, etc. In the last four bytes is stored
the value for the starting cluster number, and byte count number. Incidentally,
the space reserved for the root directory on a floppy disk is smaller, so only
224 entries are possible.

Because the root directory can only hold 512 entries, and modern
hard drives typically hold many thousands of files, it is necessary that a directory
structure be created. The mechanics of how to do this in DOS is the subject
of the next lesson, but it is absolutely necessary. Periodically someone will
encounter a problem saving a file, and when you investigate, it turns out they
were trying to save every file in the root directory and eventually ran out
of slots. In Windows 9x machines, this happens even more quickly because of
long file name support. If you noted above that only 8 bytes
are reserved for the file name in each entry, and if you realized that each
character requires one byte to describe it, you quickly see why DOS only allows
8 character file names. In Windows 9.x, you can use longer file names, but only
by using multiple directory entries for each long file name. It is not unusual,
therefore, to have a directory in Windows 9.x fill up when only 200-300 items
are stored, if long file names are used.

That is the technical reason for creating a directory structure.
There is also a practical reason, and that is that a good directory structure
can help you organize your data in useful ways. Imagine a company that stored
all of its documents in a “documents room”. Every day, people would
open the door, throw in a bunch of documents, and close the door again. One
day, you need to find a particular document, so you have to go into this room,
and look at each document one at a time until you find the one you want. Obviously,
this would take a lifetime to find, and is a really stupid way to save documents.
Instead, you would create a filing system, using file cabinets, each divided
into drawers, and in each drawer a bunch of hanging folders, and in each hanging
folder a number of manila folders, and in each of them a number of documents.
Then, when you wanted to find a particular document, you would look up in a
directory to see which filing cabinet it was in, then read the drawer labels
to see which drawer it was in, then read the labels on the folders, etc. until
you had the document. You might perform this task in only a few minutes if the
filing system was logical. Well, this is what you want to do with your hard
drive. Under the root directory, you create your top-level “directories”,
which are the equivalent of your filing cabinets. Then inside of each of these
you can create sub-folders (“drawers”), and in each of these sob-folders
you can create additional sub-folders (“hanging folders”), etc. Then,
when you need to find the memo you wrote to your boss in October of 1998, it
will be easy to find it.