Dev.Opera - Follow the standards, break the rulesDev.Opera - Follow the standards, break the rules

Login

Lost password?

MAMA: W3C validator research

By Brian Wilson · 15 Oct, 2008

Published in: , , , ,

Page 1 index : Page 2 index : Page 3 index

  1. About markup validation—an introduction
  2. Previous validation studies
  3. Sources and tools: The URL set and the validator
  4. What use is markup validation to an author?
  5. How many pages validated?
  6. Interesting views of validation rates, part 1: W3C-Member companies

Note that this document is large, so has been broken up into 3 pages; use the navigation at the bottom of the document to navigate between pages.

About markup validation—an introduction

MAMA is an in-house Opera research project developed to create a repeatable and cross-referenced analysis of a significant population of Web pages that represent real world markup. Of course, part of that examination must also cover markup validation—an important measure of a page's adherence to a specific standard. The W3C markup validation tool produces useful metrics that add to the rest of MAMA's breakdown of its URL set. We will look at what validation reveals about these URLs, what it means to validate a document, and what benefits or drawbacks are derived from the process.

The readership of this section of MAMA's research is expected to be the casual Web page author out for a relaxing weekend browse, as well as those developing the W3C validator tool itself, looking for incisive statistics about the validation "State Of The Union". As a result of this diverse audience, some readers will find that some sections are redundant or mystifying (possibly both at the same time even!). Feel free to skip around the article as needed, but the best first-time reading flow is definitely a linear read. Some of the data presented may need some prerequisite knowledge, but I hope that even the most detailed examinations here may be of interest to all readers in some way. There are some positive trends, some surprises, and some disappointments in the figures to follow.

A quick summary:

The good news: Markup validation pass rates are definitely improving over time.
The bad news: The overall validation pass rate is still miserably low and is not increasing as fast as one would hope

Previous validation studies

There are two previous, large-scale studies of markup validation to which we can compare MAMA's results regarding markup validation trends. Direct correlation with these previous studies was not an original goal of MAMA, but it is a happy accident, given that many of MAMA's design choices happen to coincide.

The analysis tools and target URL group were roughly the same between MAMA and these other projects. Both Parnas's and Saarsoo's studies used the WDG validator (see next section), which shares much of the same back-end mechanics with the W3C validator. Both studies also used the DMoz URL set (see next section). The main difference between the URL sets used lies in the amount of DMoz analyzed; where MAMA's research overlaps with Parnas's and Saarsoo's studies, we will attempt to compare results.

Fig 2-1: URL set sizes of validation studies
Study Date URL Set Full DMoz Size Study Set Size
Parnas Dec. 2001 DMoz ~2.5 million ~2.4 million
Saarsoo Jun. 2006 DMoz ~4.4 million ~1.0 million
MAMA Jan. 2008 DMoz ~4.7 million ~3.5 million

Sources and tools: The URL set and the validator

[For more details about the URLs and tools used in this study, take a look at the Methodology Appendix section of this document.]

Treading on familiar ground: The Open Directory Project (DMoz)

There is a lot of MAMA coverage elsewhere about the DMoz URL set and the decision to use it as the basis of MAMA's research. MAMA did not analyze ALL of the DMoz URLs, though. Transient network issues, dead URLs, and other problems inevitably kept the final URLs analyzed from being bigger than its final total of about 3.5 million. The number of URLs from any given domain was limited in order to decrease per-domain bias in the results. This was an important design decision, because DMoz has a big problem with domain bias (~5% of all URLs in it are solely from cnn.com, for example). Parnas and Saarsoo did not do this, but it has proven to be a useful strategy to employ. I set an arbitrary per-domain limit of 30 URLs, and this seems to be a fair limitation. This restriction policy also helps track per-domain trends—if any are noticeable, they will be presented where they seem interesting.

Any comparison of MAMA's data to other similar studies, even if they also use DMoz, must take into account that DMoz grows and changes over time as editors add, freshen, or delete URLs from its roster. URLs can grow stale or obsolete through removal, and domains can and do die on a distressingly regular basis. The aggregation source of these URLs remains the same, but the set itself is an evolving, dynamic entity.

The W3C validator

To test the URL set, MAMA used the W3C Markup Validator tool (http://validator.w3.org/, v. 0.8.2 released Oct. 2007), which uses the OpenSP parser for its main validation engine. The W3C Markup Validator is a free service from the W3C that helps authors improve the quality of their documents by checking adherence to standards via DTDs. The Parnas and Saarsoo studies both used the WDG validator, but for MAMA's analysis, the W3C validator was the validation tool of choice. As stated on the WDG's Web site, there are many similarities between these two validators,

"Most of the previous differences between the two validators have disappeared with recent development of the W3C validator".

So, even though the validators used are different, there is significant overlap between MAMA's validation study data and the other previous studies. The W3C Quality Assurance group has produced many excellent tools and processes over the years, and that hard work definitely deserves to be showcased in a study like this. Kudos to the W3C validator team!

What use is markup validation to an author?

Why would an author validate a document at all? A validator does not write a Web page for you— the inspiration and perspiration must still come completely from the author. There does not appear to be any real negative consequences to omitting this step. Sticking rigorously to a standard does not necessarily spell success—using a validator on a page and correcting any problems it brings to light does not guarantee that the result will look right on one browser, let alone all of them. Conversely, an invalid page may render exactly the way an author was expecting.

Both authors and readers have come to expect that all browsers perform impeccable error recovery in the face of the worst tag soups the Web can throw at it. Forgiveness is perhaps the most under-appreciated yet important feature we expect from a browser. However, that is asking a lot, especially for the increasingly lightweight devices that are being used to browse the Web. If there are any consequences for sloppy authoring practices, it would be here.

Henri Sivonen properly framed the role of the markup validator in an author's toolkit:

"[A] validator is just a spell checker for the benefit of markup writers so that they can identify typos and typo-like mistakes instead of having to figure out why a counter-intuitive error handling mechanism kicks in when they test in browsers."

Continuing with the spell-checker analogy, there are no dire consequences for a page failing to validate, just as there is seldom a serious consequence of having spelling typos in a document—the overall full meaning is still conveyed well enough to get the point across.

Using the spell-checker analogy also helps dispel a practice that the W3C encourages, something that we will talk more about in a later section—proclaiming that a page has been validated. This is a pointless exercise and means nothing (W3C tool evangelism aside). It is like saying a document has been spell-checked at some time during its history. Any subsequent change to a document can introduce errors—both spelling and syntax-wise—and make the claim superfluous code baggage. As we will show in later sections, pages that have passed validation in the past often do not STAY validated!

Markup validation is a useful tool to help insure that a page conforms to a target you are aiming for. The most obvious thing to take away from the entirety of the MAMA research is that people are BAD at this "HTML thing". Improper tag nesting is rampant, and misspelled or misplaced element and attribute names happen all the time. It is very easy to make silly, casual mistakes—we all make them. Validation of Web pages would expose all these types of simple (and avoidable) errors in moments.

For even more (and probably better) reasons to validate your documents, have a look at the W3C's excellent treatment of the subject: "Why Validate?".

How many pages validated?

The raw validation numbers

The validator's SOAP response has an <m:validity > element with Boolean content values of "true" and "false". A "true" value is considered a successful validation. MAMA found that 145,009 out of 3,509,180 URLs passed validation.

Fig 5-1: Validation pass rate studies
Study Date Passed validation Total validated Percentage
Parnas Dec. 2001 14,563 2,034,788 0.71%
Saarsoo Jun. 2006 25,890 1,002,350 2.58%
MAMA Jan. 2008 145,009 3,509,180 4.13%

Another interesting view of MAMA's URL validation study is how many domains in MAMA that contained ANY page that validated: 130,398 (of 3,011,661 distinct domains validated) [4.33%]

Validation rates where select Web-page authoring features are also involved

Now, we need to ask the same basic "does it validate?" question multiple ways, keeping our main variable (validation rate) constant, while varying other criteria. This has the potential to say some interesting things about the validation rates as a whole, while also providing insight to biases that can arise when mixing popular factors and technologies found in web pages. Note: instead of listing overall URL totals, the totals mentioned are only for the URLs that use each technology.

Fig 5-2: Validation pass rates relating to various features
Quantities are per-URL. Numbers in "[]" brackets indicate per-domain quantities
Authoring
feature used
Criteria used to match Quantity
validating
Total quantity
using technology
Percentage
Script/JavaScript
  • Any "javascript:" URL
  • Any external script pointed to by SCRIPT element
  • Any script embedded in a SCRIPT element
  • Any known event handler content (for attributes beginning with "on")
99,299
[90,233]
2,617,828
[2,306,921]
3.79%
[3.91%]
CSS
  • Any Style attribute content
  • Any content of STYLE element
  • Any external stylesheet pointed to by LINK element (Rel="stylesheet")
129,893
[117,361]
2,821,141
[2,487,898]
4.64%
[4.72%]
Adobe Flash
  • EMBED: MIME type of the Src attribute contains "flash"
  • PARAM: Element contains the string ".swf" or "flash"
  • OBJECT: MIME type of the object contains "flash"
  • Script: Any mention of "flash" or ".swf"
44,491
[41,058]
1,176,227
[1,050,121]
3.78%
[3.91%]
Frames
  • Usage of the FRAMESET element
5,905
[5,741]
378,033
[354,321]
1.56%
[1.62%]
Iframes
  • Usage of the IFRAME element
4,615
[4,238]
222,462
[193,489]
2.07%
[2.19%]
Font
  • Usage of the FONT element (common, CSS-obsoleted formatting markup)
29,723
[27,491]
2,061,422
[1,762,528]
1.44%
[1.56%]
IIS Web Server
  • Detection of "iis" string in HTTP header Server field
24,743
[22,227]
883,854
[769,375]
2.80%
[2.89%]
Apache Web Server
  • Detection of "apache" string in HTTP header Server field
110,834
[99,866]
2,347,328
[2,011,088]
5.38%
[4.97%]

Validation, content management systems (CMS), and editors

MAMA looked at the META "Generator" value to find popular CMS and editors in use for the following table, looking for any noticeable trends in validation rates. One might expect per-domain numbers to be more interesting in this case than per-URL, because sites are often developed using a single platform, but there is very little difference between the two views. In general, CMS systems generate valid pages at markedly higher rates than the overall average, with "Typo3" variants leading at almost 13%. On the other hand, the editor situation has some wild differences. Microsoft's FrontPage has a VERY wide deployment rate, but a depressingly low validation pass rate of ~0.5%. Apple's iWeb editor, however, has a freakishly high validation rate. Kudos to iWeb for this happy discovery.

Fig 5-3: Validation pass rates relating to editors
Quantities are per-URL. Numbers in "[]" brackets indicate per-domain quantities
Editor Quantity
passing
validation
Total
occurrences
Percentage
Apple iWeb 2,051
[2,016]
2,504
[2,465]
81.91%
[81.78%]
Microsoft FrontPage 1,923
[1,846]
347,095
[305,220]
0.55%
[0.60%]
Adobe GoLive 1,086
[1,057]
41,865
[39,035]
2.59%
[2.71%]
NetObjects Fusion 802
[793]
26,355
[25,466]
3.04%
[3.11%]
IBM WebSphere 626
[585]
32,218
[24,460]
1.94%
[2.39%]
Microsoft MSHTML 518
[502]
40,030
[38,328]
1.29%
[1.31%]
Microsoft Visual Studio 272
[245]
22,936
[21,051]
1.19%
[1.16%]
Adobe Dreamweaver 205
[198]
5,954
[5,647]
3.44%
[3.51%]
Microsoft Word 154
[153]
24,892
[22,503]
0.62%
[0.68%]
Adobe PageMill 100
[92]
15,148
[12,142]
0.66%
[0.76%]
Claris Home Page 48
[41]
6,259
[4,798]
0.77%
[0.85%]
Fig 5-4: Validation pass rates relating to CMS
Quantities are per-URL. Numbers in "[]" brackets indicate per-domain quantities
CMS Quantity
passing
validation
Total
occurrences
of CMS
percentage
Typo3 2,301
[2,170]
18,067
[16,930]
12.74%
[12.82%]
Joomla 2,248
[2,233]
34,852
[34,237]
6.45%
[6.52%]
WordPress 1,494
[1,472]
16,594
[16,046]
9.00%
[9.17%]
Blogger 30
[30]
9,907
[9,808]
0.30%
[0.31%]

Interesting views of validation rates, part 1: W3C-Member companies

The W3C is the organization that creates the markup standards and the markup validator used in this study. One would hope that the individual companies that support and comprise the W3C would spearhead the effort to follow the standards that the W3C creates. Well, it turns out that is indeed the case. The top pages of W3C-member companies definitely adhere to markup standards at much higher rates than the rest of the Web. However, these "standard-bearers" (pun intended) could definitely do better at this than they currently do.

In February 2002, Marko Karppinen validated 506 URLs of all the W3C-member companies at that time. Only 18 of these pages passed validation. Compared to Parnas's validation study of the DMoz URLs just two months before, the W3C-member company validation rate of 3.56% was considerably better than the 0.7% rate for URLs "in the wild", but it is nothing for the paragons of Web standards to brag about. Such a low validation pass rate could easily be perturbed by any number of transient conditions or other factors.

Saarsoo also did a study of W3C-member company validation rates in Jun. 2006. By that point, the validation situation had improved nicely for the member companies to 17.00%. Fast-forwarding now to Jan. 2008 [W3C-member-company list snapshot], and we see that the general Web-at-large has caught up to, and even exceeded, the previous validation pass rate of W3C-member companies from Karppinen's study era. The general validation pass rate in the DMoz population is now running at ~4.13%, and the W3C-member company pass rate is a strong 20.15%, with more member companies than ever claiming the validation crown.

Fig 6-1: W3C-Member-company list validation studies
W3C-member list study Date Total in
member list
Total
validated
Passed
validation
Percentage
Marko Karppinen Feb. 2002 506 506 18 3.56%
Saarsoo Jun. 2006 401 352 61 17.00%
MAMA Jan. 2008 429 412 83 20.15%

Just showcasing the increased validation rate does not tell the whole story. Saarsoo left an excellent data trail to which to compare the present validation pass rate. It is interesting to note that, although the overall pass rate has increased, many of the sites that passed validation previously no longer do so at the time of writing. Achieving a passing validation status does not seem to be as hard as maintaining that status over time. Compared to Saarsoo's study, there are just as many URLs that previously validated but currently do not as there are URLs that maintained their passing validation status.

Fig 6-2: Validation comparison to Saarsoo's W3C-Member-Company study
Validation comparison Quantity
URLs that validated before and do now 25
URLs that validated before but do not now and are still in W3C-member-company list 25
URLs that validated before but are no longer in W3C-member-company list 11

Saarsoo commented in 2006 on the dynamic nature of the W3C company roster. From early 2002 there were 506 member companies, dipping down to 401 in mid-2006, to the present time (early 2008) where we see the list back up to 429. To put the change in some perspective, the net loss of companies in the list over this time-frame is 77, which is almost as many companies as the number that currently pass validation. Put simply, a pessimist might say that a company on this list is just about as likely to drop out of the W3C as it is to achieve a successful validation.

The W3C-Member List successful validation Honor Roll

In his 2002 study, Karppinen prominently listed the W3C-member companies whose main URLs passed validation in order to,

"highlight the effort that goes into making an interoperable web site".

This is an excellent idea and is becoming a bit of a time-honored tradition that both the Saarsoo study and this one has followed. The first list from Karppinen was easy to keep inline with the rest of the study, because it was (unfortunately) short and sweet. As the pass rate has improved over time, this list becomes progressively longer. This is the goal, though; everyone wants the list to be too long to display easily. [See the Honor Roll list here.]

And the crown goes to ...

Two companies' URLs have maintained valid sites throughout all three studies from 2002-2008. These companies deserve extra congratulations for this feat.

Many sites are constantly changing, but being a member of an organization that creates standards should be compulsion enough to attain a recognized level of excellence in those standards. Saarsoo ended his 2006 look at the W3C-member list with an optimistic wish for the future,

"Maybe at 2008 we have 50% of valid W3C member sites."

Unfortunately, that number is nowhere close to the current reality. It may be too much for the W3C to require its member-companies' sites to pass validation, but they should definitely try to push for higher levels than they currently attain, to serve as a good example if nothing else.

Article categories