Converting to XHTML: Simple Rules, Easy Guidelines

Converting from traditional HTML to XHTML 1.0 Transitional is quick and painless, as long as you observe a few simple rules and easy guidelines. (We really can't get enough of that phrase.) If you've written HTML, you can write XHTML. If you've never written HTML, you can still write XHTML. Let's zip through the (simple and easy) basics, shall we? Here are the rules of XHTML.

Open with the Proper DOCTYPE and Namespace

XHTML documents begin with elements that tell browsers how to interpret them and validation services how to test them for conformance. The first of these is the DOCTYPE (short for "document type") declaration. This handy little element informs the validation service which version of XHTML or HTML you're using. For reasons known only to a W3C committee member, the word DOCTYPE is always written in all caps.

Why a DOCTYPE?

XHTML allows designers/developers to author several different types of documents, each bound by different rules. The rules of each type are spelled out within the XHTML specifications in a long piece of text called a document type definition (DTD). Your DOCTYPE declaration informs validation services and modern browsers which DTD you followed in crafting your markup. In turn, this information tells those validation services and browsers how to handle your page.

DOCTYPE declarations are a key component of compliant web pages; your markup and CSS won't validate unless your XHTML source begins with a proper DOCTYPE. In addition, your choice of DOCTYPE affects the way most modern browsers display your site. The results might surprise you if you're not expecting them. In Chapter 11, "Working with Browsers Part I: DOCTYPE Switching and Standards Mode," we'll explain the impact of DOCTYPE in Internet Explorer and in Gecko-based browsers like Netscape, Mozilla, and Camino.

XHTML 1 offers three yummy choices of DTD and three possible DOCTYPE declarations:

Which DOCTYPE Is Your Type?

Of the three flavors listed in the previous section, XHTML 1.0 Transitional is the one that's closest to the HTML we all know and love. That is to say, it's the only one that forgives presentational markup structures and deprecated elements and attributes.

The target attribute to the HREF link is one such bit of deprecated business. If you want linked pages to open in new windows—or even if you don't want that but your client insists—Transitional is the only XHTML DTD that lets you do so with the target attribute:

<p>Visit <a href="http://www.whatever.org" target=
"_blank">whatever.org</a> in a new window.</p>

<p>Visit <a href="http://www.whatever.org/" target=
"bob">whatever.org</a> in a named new window.</p>

To open linked pages in new windows under XHTML 1.0 Strict, you would need to write JavaScript, and you'd also need to make sure the links work in a non-JavaScript-capable environment. Whether you should open linked pages in new windows is beside the point here. The point is that XHTML 1.0 Transitional lets you do so with a minimum of fuss.

XHTML 1.0 Transitional also tolerates background colors applied to table cells and other such stuff you really ought to do with CSS instead of in your markup. If your DOCTYPE declaration states that you've written XHTML 1.0 Strict but your page includes the deprecated bgcolor attribute, validation services will flag it as an error, and some compliant browsers will ignore it (that is, they will not display the background color). By contrast, if you declare that you're following the XHTML 1.0 Transitional DTD, bgcolor will not be marked as an error, and browsers will honor it instead of ignoring it.

In short (well, maybe it's too late for "in short"), XHTML 1.0 Transitional is the perfect DTD for designers who are making the transition to modern web standards. They don't call it Transitional for nothing.

It could be argued that XHTML 1.0 Strict is the best choice for transitioning designers and developers, as it could be argued that joining the Marine Corps is a great way to shed unwanted pounds. Leaping from sloppy old HTML 4 to rigorous XHTML 1.0 Strict will put hair on the chest of your understanding that markup is for structure, not visual effects, and it might be the right decision for you. But for this chapter's (and most readers') purposes, we'll use XHTML 1.0 Transitional. Its DOCTYPE declaration appears next:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

The frameset DOCTYPE is used for those documents that have a <frameset> element in them; in fact, you must use this DOCTYPE with your <frameset> documents.

The DOCTYPE declaration must be typed into the top of every XHTML document, before any other code or markup. It precedes the <head> element, the <title> element, the meta elements, and the links to style sheet and JavaScript files. It also, quite naturally, comes before your content. In short, the DOCTYPE declaration precedes everything.

(Standards-savvy readers might wonder why we haven't mentioned an optional element that can come before the DOCTYPE declaration, namely the optional XML prolog. We'll get to it in a few paragraphs.)

Follow DOCTYPE with Namespace

The DOCTYPE declaration is immediately followed by an XHTML namespace declaration that enhances the old-fashioned <html> element:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"
lang="en">

A namespace in XML is a collection of element types and attribute names associated with a specific DTD, and the namespace declaration allows you to identify your namespace by pointing to its online location, which in this case is www.w3.org/1999/xhtml. The two additional attributes, in reverse order of appearance, specify that your document is written in English, and that the version of XML you're using is also written in English.

With the DOCTYPE and namespace declarations in place, your XHTML Transitional 1.0 page would start out like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"
lang="en">

Declare Your Content Type

To be correctly interpreted by browsers and to pass markup validation tests, all XHTML documents must declare the type of character encoding that was used in their creation, be it Unicode, ISO-8859-1 (also known as Latin-1), or what have you.

There are three ways to tell browsers what kind of character encoding you're using, but only one works reliably as of this writing, and it's not one the W3C especially recommends.

The XML Prolog (And How to Skip It)

Many XHTML pages begin with an optional XML prolog, also known as an XML declaration. When used, the XML prolog precedes the DOCTYPE and namespace declarations described earlier, and its mission in life is to specify the version of XML and declare the type of character encoding being used in the page.

The W3C recommends beginning any XML document, including XHTML documents, with an XML prolog. To specify ISO-8859-1 (Latin-1) encoding, for example, you would use the following XML prolog:

<?xml version="1.0" encoding="ISO-8859-1"?>

Nothing complicated is going on here. The tag tells the browser that XML version 1.0 is in use and that the character encoding is ISO-8859-1. About the only thing new or different about this tag is the question mark that opens and closes it.

Unfortunately, many browsers, even those from "nice" homes, can't handle their XML prolog. After imbibing this XML element, they stagger and stumble and soil themselves, bringing shame to their families and eventually losing their place in society.

Actually, the browsers go unpunished. It's your visitors who suffer when the site fails to work correctly. In some cases, your entire site might be invisible to the user. It might even crash the user's browser. In other cases, the site does not crash, but it displays incorrectly. (This is what happens when IE6/Windows encounters the prolog.)

Fortunately, there is a solution. In place of the troublesome prolog, you can specify character encoding by inserting a Content-Type element into the <head> of your document. To specify ISO-8859-1 encoding, type the following:

<meta http-equiv="Content-Type" content="text/html; charset=
ISO-8859-1" />

The beginning of your XHTML document would then look something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
     <head>
<title>Transitional Industries: Working for Change</title>
<meta http-equiv="Content-Type" content="text/html; charset=
ISO-8859-1" />
     </head>

If you're working on an international site that will include a plethora of non-ASCII characters, you might author in Unicode and insert the following Content-Type element into your markup:

<meta http-equiv="Content-Type" content="text/html;charset=
UTF-8" />

You might also want to forget all the geeky details of the preceding discussion. Many designers know nothing about these issues aside from which tags to copy and paste into the tops of their templates, and they seem to live perfectly happy and productive lives.

Aside from one even geekier topic to be touched upon with merciful brevity a few pages from now, you have now waded through the brain-addling, pocket-protector-equipped portion of this chapter. Congratulations! The rest will be cake.

Write All Tags in Lowercase

Unlike HTML, XML is case sensitive. Thus, XHTML is case sensitive. All XHTML element and attribute names must be typed in lowercase, or your document will not validate. Validation ensures that your pages are error free.

Let's look at a typical HTML element:

<TITLE>Transitional Industries: Our Privacy Policy</TITLE>

You will recognize this as the TITLE element, and you will recognize the Privacy Policy page as the one nobody outside your legal department ever reads. Translating this element to XHTML is as simple as switching from uppercase to lowercase:

<title>Transitional Industries: Our Privacy Policy</title>

Likewise, <P> becomes <p>, <BODY> becomes <body>, and so on.

Of course, if your original HTML used lowercase element and attribute names throughout, you won't have to change them. But most of us learned to write our HTML element and attribute names in all caps, so we'll have to change them to lowercase when converting to XHTML.

Popular HTML editors like BareBones BBEdit, Optima-Systems PageSpinner, and Allaire HomeSite let you automatically convert tag and attribute names to lowercase, and the free tool HTML Tidy will do this for you as well.

Don't Worry About the Case of Attribute Values or Content

In the preceding example, notice that only the element name (title) converted to lowercase. "Transitional Industries: Our Privacy Policy" could stay just the way it was, initial caps and all. For that matter, the title element's content could be set in all caps (TRANSITIONAL INDUSTRIES: OUR PRIVACY POLICY) and would still be valid XHTML, although it would hurt your eyes to look at it.

Element and attribute names must be lowercased; attribute values and content need not be. All the following are perfectly valid XHTML:

<img src="/images/whopper.jpg" alt="Big John catches a
whopper." />

<img src="/images/WHOPPER.JPG" alt="Big John catches a
whopper." />

<img src="/images/whopper.jpg" alt="Big John catches a
Whopper and fries." />

Note that, depending on your server software, the filename mentioned in the src attribute might be case sensitive, but XHTML doesn't care. Class and ID values, on the other hand, are case sensitive.

Be careful with mixed-case attribute names. If you use a WYSIWYG tool like Macromedia Dreamweaver or an image editor like Adobe ImageReady to generate JavaScript rollovers, you will need to change the mixed-case onMouseOver to the lowercase onmouseover. Yes, really.

The following will get you into trouble:

onMouseOver="changeImages

But this is perfectly okay:

onmouseover="changeImages

Tidy Time

By far, the easiest method of creating valid XHTML pages is to write them from scratch. But much web design is really redesign, and you'll often find yourself charged with updating old pages. Redesign assignments provide the perfect opportunity to migrate to XHTML, and you don't have to do so by hand. The free tool HTML Tidy can quickly convert your HTML to valid XHTML.

Tidy was created by standards geek Dave Raggett and is now maintained as open source software by Source Forge (http://tidy.sourceforge.net/), although individuals have created some versions as a labor of love.

There are online versions of Tidy as well as downloadable binaries for Windows, UNIX, various Linux distributions, Mac (OS 9 and OS X), and other platforms. Some versions work as plug-ins to enhance the capability of existing web software. For instance, BBTidy is a plug-in for BareBones Software's BBEdit (X)HTML editor.

Each version offers different capabilities and consequently includes quite different documentation. Tidy might look simple, but it is a power tool, and reading the manual can save you a lot of grief.

Quote All Attribute Values

In HTML, you needn't quote attribute values, but in XHTML, they must be quoted (height="55", not height=55). That's pretty much all there is to say about this one. La de da. Nice weather we're having.

Okay, here's something else worth mentioning. Suppose your attribute value includes quoted material. For instance, what if your alt attribute value must read, "The Happy Town Reader's Theater Presents 'A Christmas Carol.'" How would you handle that? You would do it like this, of course:

<img src="/images/carol.jpg" alt="The Happy Town Reader's
Theater presents 'A Christmas Carol.'" />

If you preferred to get fancy and use the escaped character sequences for typographically correct apostrophes and single and double quotation marks, you would do that like this:

<img src="/images/carol.jpg" alt="The Happy Town
Reader&#8217;s Theater presents &#8216;A Christmas
Carol.&#8217;" />

Now that you are quoting attributes, you must separate your attributes with blanks. The following is an error:

<hr width="75%"size="7" />

Note: The W3C validator has to handle both HTML and XHTML, and its parser will not detect this error. Any parser that is designed to work with pure XML will catch it.

If, for some strange reason, you need straight quotes in an attribute value, use &quot;, as in the following:

<img src="/images/hello.jpg" alt="Mrs. O'Hara says,
&quot;Hello&quot; to us." />

You can use apostrophes to quote an attribute; if you need an embedded apostrophe, use &apos;, as in the following:

<img src="/images/hello.jpg" alt='Mrs. O&apos;Hara says,
"Hello" to us.' />

Quoting attribute values was optional in HTML, but many of us did it, so converting to XHTML often represents no work at all in that area. In an effort to trim their bandwidth, some commercial sites avoided quoting attribute values. Those sites will have to start quoting those values when they convert to XHTML.

HTML Tidy can quote all your attribute values automatically. In fact, it can automatically perform every conversion task mentioned in this chapter.

All Attributes Require Values

All attributes must have values; thus, the attributes in the following HTML

<td nowrap>
<input type="checkbox" name="shirt" value="medium" checked>
<hr noshade>

must be given values. The value must be identical to the attribute name.

<td nowrap="nowrap">
<hr noshade="noshade" />
<input type="checkbox" name="shirt" value="medium" checked=
"checked" />

We know, we know. Looks weird, feels weird, takes getting used to.

Close All Tags

In HTML, you have the option to open many tags such as <p> and <li> without closing them. The following is perfectly acceptable HTML but bad XHTML:

<p>This is acceptable HTML but would be invalid XHTML.
<p>I forgot to close my Paragraph tags!
<p>But HTML doesn't care. Why did they ever change these
darned rules?

In XHTML, every tag that opens must close:

<p>This is acceptable HTML and it is also valid XHTML.</p>
<p>I close my tags after opening them.</p>
<p>I am a special person and feel good about myself.</p>

This rule—every tag that opens must close—makes more sense than HTML's confusing and inconsistent approach, and it might help avoid trouble nobody needs. For instance, if you don't close your paragraph tags, you might run into CSS display problems in some browsers. XHTML forces you to close your tags, and in so doing, helps ensure that your page works as you intend it to.

Close "Empty" Tags, Too

In XHTML, even "empty" tags such as <br> and <img> must close themselves by including a forward slash /> at the end of the tag:

<br />
<img src="zeldman.gif" />

Note the slash /> at the end of the XHTML break tag. Then note the slash /> at the end of the XHTML image tag. See that a blank space precedes each instance of the slash to avoid confusing browsers that were developed prior to the XHTML standard. None of this is rocket science.

Nor does it require much (if any) work. Recent versions of BBEdit, PageSpinner, and HomeSite automatically add the required space and slash to "empty" tags if you tell these editors you're working in XHTML. Visual web editing tools Dreamweaver and GoLive do likewise.

Naturally, to remain valid and accessible, the image element in the second example would also include an alt attribute, and an optional title attribute wouldn't hurt:

<img src="zeldman.gif" alt="Jeffrey Zeldman, author of
Designing with Web Standards." title="Jeffrey Zeldman,
debonair web designer and billionaire author of Designing
with Web Standards, now in its 400th printing." />

Now that is good XHTML.

No Double Dashes Within a Comment

Double dashes can occur only at the beginning and end of an XHTML comment. That means that these are no longer valid:

<!--Invalid -- and so is the classic "separator" below. -->
<!------------------------------------>

Either replace the inner dashes with equal signs, or put spaces between the dashes:

<!-- Valid - - and so is the new separator below -->
<!--==================================-->

Encode All < and & Characters

Any less than signs (<) that aren't part of a tag must be encoded as &lt;, and any ampersands (&) that aren't part of an entity must be encoded as &amp;. Thus,

<p>She & he say that x < y when z = 3.</p>

must be marked up as this:

<p>She &amp; he say that x &lt; y when z = 3.</p>

The W3C validation service will let you off with warning messages on the nonencoded markup; a pure XML parser will generate a fatal error.

Note: We recommend that you always encode > as &gt;. Even though you never have to encode it (with the exeception of one highly esoteric circumstance), it is there for symmetry, and your markup will be easier for others to read if you use it.

Let's review the XHTML rules we've learned.

Executive Summary: The Rules of XHTML

Viewed as a short list in this Executive Summary, the rules of XHTML look few and simple enough—and they are few and simple enough. One sadly dull additional point must be made before we move on to the good stuff.