HTML and SGML
HTML stands for Hyper-Text Markup Language. It is a coding language,
which uses a method called markup, to create hyper-text. HTML is actually a
simplified subset of a more general markup language called SGML, which stands
for Standard Generalized Markup Language, but is gradually returning to SGML
as it evolves. This evolution of HTML is worth knowing at least a little about,
since HTML is not set in stone. The changes that are occurring have their reasons,
mostly in terms of creating capabilities that previous versions were lacking.
In the beginning…
In 1989, Tim Berners-Lee, working at the European particle physics
institute known as CERN (Centre European pour la Recherche Nucleaire), proposed
a system to allow scientists to share papers with other using electronic networking
methods. His idea became what is called the World-Wide Web. Since these documents
were to be shared, some common method coding them needed to be developed. Tim
Berners-Lee suggested that it be based on the already existing SGML. Here are
a few quotes from a 1990 CERN memo that Berners-Lee wrote:
HyperText is a way to link and access information of various kinds
as a web of nodes in which the user can browse at will. It provides a single
user-interface to large classes of information (reports, notes, data-bases,
computer documentation and on-line help).
We propose a simple scheme incorporating servers already available
at CERN…
A program which provides access to the hypertext world we call a
browser…
It would be inappropriate for us (rather than those responsible)
to suggest specific areas, but experiment online help, accelerator online
help, assistance for computer center operators, and the dissemination of information
by central services such as the user office and CN [Computing & Networks]
and ECP [Electronics & Computing for Physics] divisions are obvious candidates.
WorldWideWeb (or W3 ) intends to cater for these services across the HEP
[ High Energy Physics ] community.
As you can see, Tim Berners-Lee put all of the basic pieces into place.
In 1992, when there were all of 50 web servers in the world, CERN released
the portable Web browser as freeware. Marc Andreesen, who was working at the
National Center for Supercomputing Applications, created a browser called Mosaic
which was released in 1993. Shortly after that, he left NCSA to found Netscape.
The first version of the Netscape browser implemented HTML 1.0.
HTML 1.0 and 2.0
In 1992, Berners-Lee and the CERN team released the first draft HTML 1.0, which
was finalized in 1993. This specification was so simple it could be printed
on one side of a piece of paper, but even then it contained the basic idea that
has become central in the recent evolution of HMTL, which is the separation
between logical structures and presentational elements. This is the most important
single idea to grasp in learning HTML, IMHO. In 1994, HTML 2.0 was developed
by the Internet Engineering Task Force’s HTML Working Group. This group later
was disbanded in favor of the World Wide Web Consortium (http://www.w3.org),
which continues to develop HTML.
Browsers and HTML
Netscape was just one of a number of browsers available. Mosaic was still offered
by NCSA, Lynx was available on Unix machines, and few other companies were creating
browsers. One of them, Spyglass, was purchased by Microsoft, and became the
basis for Internet Explorer. Each browser contains, in its heart, a rendering
engine, which is the code that tells it how to take your HTML
and turn it into something you can see on the screen. What happened at this
point is that each company, most particularly Netscape and Microsoft, started
to develop their own “extensions” to HTML, often going in different
directions. This problem bedevils us to this day, though the upcoming Netscape
6 browser may resolve this by being 100% compliant with the published HMTL standards.
We are still waiting to see what this will look like.
W3C takes over: HTML 3.0 and HTML 3.2
The World Wide Web Consortium (W3C), which had taken over HTML development,
attempted to create some standardization in HTML 3.0. But there was so much
argument over what should be included that it never got beyond the draft discussion
stage. Finally, in 1996 a consensus version, HTML 3.2, was issued. This added
features like tables, and text flowing around images, to the official specification,
while maintaining backwards compatibility with HTML 2.0. This also is a convenient
place for marking the divergence in practice from the separation that Berners-Lee
first made between logical structures and presentational elements. And as the
Web took off in popularity, this breakdown became widespread and serious. The
main focus of the W3C since then has been to rectify the situation. An example
of this is the widespread use of tables and transparent “shim” GIFs
to create page layout. While this creates pages that are visually correct, the
logical structure of the page is pretty much destroyed, and such pages are frequently
useless to anyone using a text browser, or a text-to-speech parser.
HTML 4.0x
The W3C released the HTML 4.0 specification at the end of 1997, and followed
with HTML 4.01 in 1999, which mostly corrected a few errors in the 4.0 specification.
This release attempted to correct some of the more egregious errors that 3.2
had allowed (encouraged?) designers to commit, particularly in introducing Cascading
Style Sheets. But in fact the W3C has abandoned HTML as the default standard
in favor of a move back towards the root of SGML, a larger and more complex
language. There will probably never be another HTML specification.
XHTML 1.0
This is the successor to HTML. The “X” stands for Extensible. This
is a reformulation of HTML 4.01 within XML (Extensible Markup Language), which
is far more rigorous, and is intended to start moving the creation of Web pages
away from HTML. This was released earlier this year, and is the most current
standard for creating Web pages. This introduces some interesting changes in
coding. For example, virtually all tags now have to be closed, including paragraph
tags. Other tags, like the FONT tag, have been banished in favor of using Cascading
Style Sheets to control all presentational elements.
Back to the browsers
Now, while standards are wonderful, that does not mean that browsers follow
them. No browser currently available is completely consistent with HTML 4.0,
which is already two and a half years old. Support for Cascading Style Sheets
(CSS), for instance, is spotty and incomplete in all browsers. Also, each browser
(rendering engine) interprets the specifications in different ways, leading
to the eternal complaint of pages looking different in different browsers. Plus,
most browsers have tried to maintain backwards compatibility with older standards,
which complicates things when a newer standard invalidates some aspect of an
older standard.
As I mentioned before, Netscape 6, which is still in development, is claimed
to be 100% standards compliant with HTML 4.0, XHTML 1.0, CSS1, and partially
compliant with CSS2. If they can pull it off, this would be wonderful for Web
developers. But we have to wait and see what happens. Also, Netscape is not
the only browser on the market. The leader, Microsoft’s Internet Explorer, has
better standards support than Netscape does among current browsers, but IE has
appeared to drop full compliance from its plans, and has received a lot of criticism
from the Web developer community on that account. Netscape, meanwhile, has made
the decision to drop backwards compatibility from its rendering engine so as
to get a lean, efficient, standards-compliant browser. It is entirely possible,
therefore, that many pages that work fine now will stop working in Netscape
6 because they use methods that are no longer acceptable.