Dev.Opera - Follow the standards, break the rulesDev.Opera - Follow the standards, break the rules

Login

Lost password?

MAMA: W3C validator research

Page 1 index : Page 2 index : Page 3 index

  1. Interesting views of validation rates, part 2: Alexa Global Top 500
  2. Validation badge/icons: An interesting diversion?
  3. Doctypes
  4. Character sets

Interesting views of validation rates, part 2: Alexa Global Top 500

About the Alexa Global Top 500

Now, we will look at another "interesting" small URL set, the Alexa service from Amazon. Alexa utilizes Web crawling and user-installed browser toolbars to track "important sites". It maintains, among many other useful measures, a global "Top 500" list of URLs considered popular on the Web. The Alexa list was chosen primarily because the size of the list was similar in size to the W3C list—so even though MAMA might be comparing apples to oranges, at least it compares a fairly equal number of apples and oranges. The W3C-company list skews toward academic and "big money" commercial computer sites. The Alexa list is representative of what and how people actually use and experience on the Web on a day-to-day basis.

While few could argue that Alexa's "Top 500" list is relevant and popular, there are some definite biases in its list:

  • It is prejudiced toward big/popular sites with many country-specific variants, such as Google, Yahoo!, and eBay. This ends up reducing the breadth of the list. Google is the most extreme example of this, with 63 of the 487 URLs in the analyzed set being various regional Google sites.
  • It includes the top pages of domain aggregators with varied user content, such as LiveJournal, Facebook, and fc2.com. These top pages are not representative of the wide variety of the user-created content they contain.
  • The list consists entirely of top-level, entrance, or "surface" pages of a site. There is no intentional "deep" URL representation.

Validating the Alexa Top 500

On 28 January 2008, the then-latest Alexa Top 500 list was inserted into MAMA [January 2008 snapshot list, latest live version]. About half of these URLs were already in MAMA, having been part of other sources. Of the 500 URLs in this list, 487 were successfully analyzed and validated. Only 32 of these URLs passed validation (6.57%). This is a slightly higher percentage rate than the much larger overall MAMA population, but the quantity and difference are still too small to declare any trends.

Fig 7-1: Alexa Top 500 validation studies
Alexa Top 500 List study DatePassed validation Total set-sizepercentage
MAMA Jan. 2008 32 487 6.57%

For future Alexa studies

OK, so the Alexa Top 500 does have some drawbacks. Should the URL set be tossed out entirely? Can this set be improved? Aside from the Top 500, Alexa has a very deep catalog and categorization of URLs, some of them available freely, but most are available only for a fee. Some categories of URLs include division by country and by language. Alexa currently has publicly-available lists of the top 100 URLs for 21 different languages (2,100 URLs) and 117 countries (11,700 URLs). Note: The per-country list represents popularity among users in a country, not sites hosted in the country. An undoubtedly-interesting expanded list of the Alexa Global Top 500 could be created by aggregating all of these sources, which would probably yield 5,000-10,000 URLs (if duplicates were eliminated).

If the validation rates of the Alexa Global Top 500 are studied in the future, the current version of the Top 500 list of URLs will likely be quite different than it is at this time of writing. The topicality of the list—a strength that promotes the relevance of the analysis—and also makes cross-comparisons over time difficult. Documenting the list that was used in each analysis will be helpful in doing that.

Validation badge/icons: An interesting diversion?

Before MAMA had validated even a single URL, the author discovered this page at the W3C's site: http://www.w3.org/QA/Tools/Icons. This page lists icons that,

"may be used on documents that successfully passed validation for a specific technology, using the W3C validation services".

It seemed like an interesting idea to compare the pages that were using these images claiming validation with how they actually validate. This can only be a crude measure for a number of reasons, but, by far, the main one is as follows: an author can easily host the validation icon/badge on their own server and name it anything they want.

For those gearheads in the audience who have some "regexp savvy", the following Perl regular expression was used to identify validation icon/badges utilizing the W3C naming scheme. This pattern match was used against the Src attribute of the IMG elements of URLs analyzed:

Regexp:
/valid-((css|html|mathml|svg|xhtml|xml).*?)(-blue)?(\.png|\.gif|-v\.svg|-v\.eps)?$/i || /(wcag1.*?)(\.png|\.gif|-v\.svg|-v\.eps)?$/i

This seems to capture fully all the variations of the W3C's established naming conventions (any corrections are very welcome if it does not). Note that the regexp errs on the cautious side and can also capture unintended matches like JPEG files matching the naming scheme. One might think this an error, but it turns out it is not. JPEG versions of the validation icons are not (currently) listed on the W3C's Web site, but a random spot-check of JPEG images thus detected by MAMA ARE validation badge icons! In this case, what appears to be false-positives are actually valid after all.

Ex: http://www.w3.org/Icons/valid-html401-blue.png is stored as 'html401-blue'

Validation rates of URLs having validation badge/icons

Now we will look at the list of W3C Validation Image Badges found in MAMA by URL [also by domain]. Even with the various pitfalls that could occur with MAMA's pattern matching, there is still a comparison that is interesting to explore: how many pages that use a badge actually validate? If we consider that the only type of badge of real interest in our sample is an HTML variant (html, xhtml), looking for the substrings "html" and "xhtml" within this field in MAMA gives us:

Fig 8-1: Validation rates of URLs with validation icons
Type of badge
identified
Total Actually
validated
Percentage
xhtml 11,657 5,480 47.01%
html 22,033 10,995 49.90%

This is just under 50% in each case, which is frankly a rather miserable hit ratio. If these URLs do not validate, do they bear ANY resemblance to the badge they are claiming?

Comparison of stated validation badge/icon type versus actual detected Doctype

Next, we will try comparing the actual Doctypes detected compared to the badges claiming compliance to those respective Doctypes. Doctypes detected in both the validator and MAMA analyses are listed for comparison. The situation definitely improves here over the previous figures. Note: Fatal validation errors cause the validator to under-report Doctypes by reporting no Doctype at all in such cases.

Fig 8-2: Reported validation icon type versus MAMA-detected Doctype
Type Of badge
identified
Validator-
detected
Doctype
MAMA-
detected
Doctype
Total according
to badge/icon
xhtml 10,553 11,054 11,657
html 20,570 21,475 22,033

The validation badges certainly increase public awareness of validation as something for which the authors strive, but it does not appear to be the best measure of reality. For the half of badged URLs that claim validation compliance but currently do not validate, one has to wonder whether they ever did validate in the past. Pages definitely tend to change over time and removing or updating an icon badge may not be high on most author's list of "Things To Do". The next time you see such an icon, consider its current state with a grain of salt.

For future W3C badge studies

After this survey was completed, the following rather prominent quote was noticed on the W3C's Validation Icons page,

"The image should be used as a link to re-validate the document."

It may be useful to incorporate this fact to identify further validation badges in the future.

Doctypes

What are we examining?

First up is the Doctype. The Doctype statement tells the validator which DTD to use when validating—it is the basic evaluation metric for the document. MAMA used its own methods to divine the Doctype for every document, but the validator actually detects the Doctype in two slightly different ways: one by the validator itself and the other by the SGML parser at the core of the validator.

Fig 9-1: Detected Doctype factors used in this study
Source of
Doctype
Information being used
MAMA Detected Doctype statement
Validator SOAP <m:doctype > content
Validator 'W09'/'W09x' warning messages

This is a good time to dissect a Doctype and see what makes it tick. We will look at a typical Doctype statement, and examine all of its parts:

Ex: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Fig 9-2: Components of a DTD
Component Description
"<!DOCTYPE" The beginning of the Doctype
"html" This string specifies the name of the root element for the markup type.
"PUBLIC" This indicates the availability of the DTD resource. It can be a publicly-accessible object ("PUBLIC") or a system resource ("SYSTEM") such as a local file or URL. HTML/XHTML DTDs are specified by "PUBLIC" identifiers.
"-//W3C//DTD XHTML 1.0 Transitional//EN" This is the Formal Public Identifier (FPI). This compact, quoted string gives a lot of information about the DTD, such as its Registration, Organization, Type, Label, and the Encoding language. For HTML/XHTML DTDs, the most interesting part of this is the label portion (the "XHTML 1.0 Transitional" part). If the processing entity does not already have local access to this DTD, it can get it from the System Identifier (next portion).
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" The System Identifier (SI); the URL location of the DTD specified in the FPI
">" The ending of the Doctype

MAMA's analysis stores the entire DOCTYPE statement, but the validator's SOAP response only returns a portion of it— generally the FPI, but some situations may return the SI instead or even nothing at all if an error condition is detected. These situations are infrequent, though; only 70 URLs analyzed by the validator returned the Doctype's SI, for example.

!Doctypes!

The validator examined 3,509,10 URLs overall. Of those, the validator says that 1,474,974 (42.03%) "definitely" did not use a DOCTYPE (indicated by an empty content for the <m:doctype > element in the SOAP response). In addition to the empty <m:doctype > element in the SOAP response, the validator also returns explicit warnings for the instances it does not encounter a Doctype statement: specifically, warning codes 'W09' and 'W09x' are generated by the SGML parser layer of the validator. Is there any correlation between these warning codes and the "official" empty Doctype mentioned in the SOAP response? The quick answer is yes. Some 1,373,352 URLs have either the 'W09' or 'W09x' warnings. Looking closer for a direct correlation, 1,371,899 URLs were issued a 'W09'/'W09x' warning AND do not have a Doctype listed in the SOAP response. This leaves 1,453 URLs that had some sort of validator-detectable Doctype, but a warning for No Doctype was issued. Sampling several URLs from the above set showed that, in every case, the Doctype statement was not at the very beginning of the document. So, it appears that the OpenSP parser does not like this, but the validator itself is OK with this scenario.

MAMA also looked at Doctypes in its main analysis. We have compared cases where both tools found no Doctype. MAMA found 1,720,886 URLs without a Doctype. This is a rather large discrepancy compared to the validator's numbers above. We must alter this figure further because the SOAP response for a validation failure error returns empty <m:doctype > and <m:charset > elements. To improve the quality of our comparison between MAMA and the validator's results, we must exclude from our mutual examination all URLs with a positive validator failure count. After this minor adjustment, the numbers are much more in line with each other. To the numbers:

Fig 9-3: Scenarios where Doctype is not present
Situation Qty
MAMA detected no Doctype. 1,465,367
Validator detected no Doctype. 1,474,974
MAMA and validator both detected no Doctype. 1,423,478
MAMA detected no Doctype, but the validator did. 41,889
Validator detected no Doctype, but MAMA did. 51,496

The final two numbers are the most interesting. These discrepancies are still quite large (~3% of the overall 'no Doctype detected' count). What could account for this? Some reasons noticed for the differences (there could be others):

  • MAMA did not look for a Doctype in the destination document of a META refresh/redirect. The validator appears to do this.

    Ex: http://disneyworld.disney.go.com/wdw/parks/parkLanding?id=TLLandingPage

  • MAMA does not request or handle gzipped content, but it was occasionally served to it anyway. The validator appears to handle this.

    Ex: http://nds.gamezone.com/gamesell/p29690.htm

  • MAMA looked anywhere in the document for a Doctype, but the validator only looks near the beginning of the document. A rather large set of URLs unfortunately fit this description.

    Ex: http://www.ruready.com/

  • URL content can change over time, including the addition or deletion of Doctypes. MAMA's analysis occurred in November 2007, and the validation of those same URLs happened in January 2008—over 2 months later. In sampling random parts of the URL set where MAMA did not initially detect a Doctype, a current, live analysis by MAMA does indeed detect a Doctype in most cases tried. Other than a bug existing in MAMA (unfortunately, always possible in any software), this is the best explanation to put forth.

Doctype statement present details

What about URLs that had validator-detectable Doctypes? We will linger on the comparison between MAMA's Doctype detection and the Validator's before looking in depth at what those Doctypes were.

Fig 9-4: Scenarios where Doctype is present
Situation Qty
MAMA detected a Doctype. 1,788,294
The validator detected a Doctype. 1,625,509
MAMA and the validator both detected a Doctype, and it was the same. 1,583,620
MAMA and the validator both detected a Doctype, and it was different. 36,119

Where MAMA and the validator both found a Doctype, they disagree 2.28% of the time. Other than the aforementioned time delay between the MAMA and validator analyses, could there be other reasons to account for this difference? Scanning a list of results for MAMA/validator Doctypes that differed, there may indeed be a trend—and a positive one at that. Of the 36,119 URLs that changed Doctype, 23,390 of them (64.76%) changed from an HTML Doctype to an XHTML Doctype. There are a few reasons mentioned above that could be affecting these results, and the above numbers could be a coincidence, but this looks like a data point supporting the gradual shift from HTML to XHTML.

To summarize the per-URL and per-domain frequency tables for validator Doctype, Transitional FPI flavors have a lock on the top three most popular positions. The other variants trail far behind. If a document has a Doctype, it is likely to be a Transitional flavor of XHTML 1.0 or (even more likely) HTML 4.0x. XHTML 1.0 Strict dominates over any other Strict variant (98% of all Strict types).

Totals for common substrings found in the validator Doctype field

A survey of the FPIs the validator exposed is like a microcosm of the evolution of HTML—there are documents claiming to adhere to "ancient" versions from the early days all the way through to the language's present XHTML incarnations. Searching for a few, well-chosen substrings demonstrates this variety well, and we can see how well an author's choice of Doctype FPI results in actually passing validation. Out of the 1,625,509 URLs exposing a Doctype to the validator, Strict Doctypes pass validation twice as often as the other flavors, and XHTML Doctypes are much are heavily favored for passing validation than other Doctypes. More could be said about the final two items in the table below (to say the least), but that is left for a future discussion.

Fig 9-5: Detection of substrings in the Doctype field
Doctype flavor Qty Percentage
of total
Passing
validation
Percentage of
flavor
"Transitional" 1,341,024 82.50% 112,348 8.38%
"Strict" 100,002 6.15% 17,502 17.50%
"Frameset" 57,225 3.52% 4,133 7.22%
Doctype markup language Qty Percentage
of total
Passing
validation
Percentage of
markup language
" html 4" (HTML 4 variants) 987,701 60.76% 66,535 6.74%
" xhtml 1.0" 544,622 33.50% 71,537 13.14%
" html 3.2" 44,642 2.75% 1,753 3.93%
" xhtml 1.1" 19,984 1.23% 4,074 20.39%
" html 2" 4,792 0.29% 176 3.67%
" html 3.0" 884 0.05% 44 4.98%
"WAP" 789 0.05% 468 59.32%
" xhtml 2" 11 0.00% 0.00%

The studies from Parnas and Saarsoo did not use the W3C validator, and, as a consequence, there was not such an extreme focus on Doctype usage. Generally, the validator they used only tracked whether a Doctype was used at all. The main reported error type in Parnas' study was a missing Doctype, with only 18.8% of URLs having one present. By the time of Saarsoo's study, the number of URLs having a Doctype moved up to 39.08%. Fast-forward to now, and that number has grown considerably yet again—to 57.7% according to the W3C validator. This is a very respectable increase over time. If few authors are actually creating valid documents, at least most of them seem to understand that there IS a standard to which they should be adhering.

Doctypes for our small, special interest URL sets

Backtracking just a little, the next two tables are a quick look at the Doctypes used for the W3C-member-company URLs and the Alexa Top 500 list. Almost 76% of those URLs passing validation are XHTML variants in the W3C-company set, and in the Alexa list it is almost 66%.

Fig 9-6: Doctype FPIs of W3C-Member-Company Web sites and validation rates
Doctype FPI Passed
validation
Total Percentage
of FPI type
-//W3C//DTD XHTML 1.0 Transitional//EN 36 145 24.83%
-//W3C//DTD XHTML 1.0 Strict//EN 23 45 51.11%
-//W3C//DTD HTML 4.01 Transitional//EN 16 95 16.84%
-//W3C//DTD XHTML 1.1//EN 4 8 50.00%
-//W3C//DTD HTML 4.0 Transitional//EN 3 22 13.64%
-//W3C//DTD HTML 4.01//EN 1 7 14.29%
-//W3C//DTD HTML 3.2//EN 1 0.00%
-//W3C//DTD HTML 4.01 Frameset//EN 1 0.00%
-//W3C//DTD HTML 3.2 Final//EN 1 0.00%
-//W3C//DTD XHTML 1.0 Strict//FI 1 0.00%
-//W3C//DTD XHTML 1.0 Frameset//EN 1 0.00%
[None] 85 0.00%

 

Fig 9-7: Doctype FPIs of Alexa Top 500 Web sites and validation rates
Doctype FPI Passed
validation
Total Percentage
of FPI type
-//W3C//DTD XHTML 1.0 Strict//EN 10 37 27.03%
-//W3C//DTD XHTML 1.0 Transitional//EN 9 130 6.92%
-//W3C//DTD HTML 4.01 Transitional//EN 5 77 6.49%
-//W3C//DTD HTML 4.0 Transitional//EN 3 22 13.64%
-//W3C//DTD HTML 4.01//EN 2 12 16.67%
-//W3C//DTD XHTML 1.1//EN 2 5 40.00%
-//iDNES//DTD HTML 4//EN 1 1 100.00%
-//W3C//DTD HTML 4.01 Frameset//EN 1 0.00%
-//W3C//DTD XHTML 1.1//EN http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd 1 0.00%
-//W3C//DTD XHTML 1.0 Strict //EN 1 0.00%
-//W3C//DTD XHTML 1.0 Transitional//ES 1 0.00%
-//W3C//DTD HTML 4.0 Strict//EN 1 0.00%
[None] 193 0.00%

Character sets

In the previous section on Doctypes, there were many ways to look at just a single variable (presence or lack of a Doctype). Now, with character sets it becomes even more complex. Even a simplistic view of character set determination can involve at least three aspects of a document. MAMA, the validator, and the validator's SGML parser ALL have something to say about the choice of a document's character set. To cover every permutation and difference between the many possible charset specification vectors would definitely exhaust the author and most likely bore the reader. Every effort will be made to present some of this data in a way from that is not TOO overwhelming.

There are three main areas of interest when determining the character set to use when validating a document:

  • The charset parameter of the Content-Type field in a document's HTTP Header
  • The charset parameter of the Content attribute for a META "Content-Type" declaration
  • The encoding attribute of the XML prologue

For brevity, these will be shortened to "HTTP", "META", and "XML" respectively.

Character set differences between MAMA and the validator

An important difference exists between MAMA and the validator when talking about character sets. There is an HTTP header that allows a request to specify which character sets it prefers. MAMA sent this "Accept-Charset" header with a value of "windows-1252, utf-8, utf-16, iso-8859-1;q=0.6, *;q=0.1". This header field value is used by Opera (9.10), and MAMA tried to emulate this browser as closely as possible. The character sets that were specified reflect the author's own particular language bias. The validator is another story. It does not send an "Accept-Charset" header field at all. This may cause differences between the two and affect the reported character set results.

MAMA's view of character sets

First up is a look at what MAMA was able to determine about these three fields, and how they are used in combination with each other. The totals here account for all cases where a non-empty value was present for any of the HTTP/META/XML charset specification types. The following tables show the frequencies for the different ways that character sets are established and mixed. A document can have none, any or all of these factors. Note: The XML level in Fig 9-1 appears to be very low in comparison to the other specification methods, but this is because the number of documents with an XML declaration is also rather low. Looked at in this way, that ratio actually the highest, being even more favorable than the META case at 96,264 of 104,722 URLs (91.92%). Fig 9-2 offers a breakdown of all the combinations of ways to specify a character set. By a large majority, authors do this using only the META element method. The final table, Fig 9-3, shows what happens when more than one source for a character set existed in a document, and whether these multiple values agreed with one another.

Fig 10-1: MAMA—How character sets are specified
Charset
source
Number of
occurrences
Total where
any charset
specified
Percentage
where any
charset
specified
HTTP 686,749 2,626,206 26.15%
META 2,361,221 2,626,206 89.91%
XML 96,264 2,626,206 3.67%

 

Fig 10-2: MAMA—How character sets are specified in combination
Charset
specified in
Quantity Total where
any charset
specified
Percentage
where any
charset
specified
HTTP only 240,349 2,626,206 9.15%
META only 1,872,497 2,626,206 71.30%
XML only 17,858 2,626,206 0.68%
HTTP and META 417,109 2,626,206 15.88%
HTTP and XML 6,791 2,626,206 0.26%
META and XML 49,115 2,626,206 1.87%
All three sources 22,500 2,626,206 0.86%

 

Fig 10-3: MAMA—How character sets disagree when specified in combination
Specified
charset
sources
Disagree Total Percentage
HTTP and META 123,245 417,109 29.55%
HTTP and XML 2,238 6,791 32.96%
META and XML 4,086 49,115 8.32%
All three sources 4,399 22,500 19.55%

The validator's view of character sets

Now, we will look at the way the markup validator views charset information. The validator generally looks for the same three document sources mentioned previously to determine charset information. Before looking at these actual charset values, it is useful to examine whether the validator's view of charset information is internally consistent or not. It can also be instructive to compare, where possible, the validator's view of charset information versus MAMA's view.

To directly compare validator and MAMA charset information, we must remove some URLs from consideration. The validator's SOAP response returns an empty charset value in all cases where there is a validator failure. It is useful to know if the validator is returning a "truly" empty charset value, so all URLs with a failure error are removed from the examination set for this phase. This immediately reduces our URL group by 408,687 URLs.

The items of interest to look at in the validator response are the contents of the <m:charset > element and warnings issued for no detected charset or charset value mismatch from differing sources. We will explore how/if all these factors mesh when the validator is determining which charset to use.

Validator-detected charsets versus MAMA-detected charsets

The following table is mostly for sanity checking to see if the validator's results resemble MAMA's results. The first two entries have very low totals, but this may involve some corner charset detection cases worth taking a second glance. The third case is a definite indication that the validator has default fallback values used for character set when none is detected through the typical methods.

Fig 10-4: Validator versus MAMA charset detection
Validator
charset
detected
Scenario Total
No No MAMA charsets detected 47
No MAMA charset detected 1,179
Yes No MAMA charsets detected 592,361
Yes Validator also issued: "Warning! Conflicting charsets..." message 118,367
Yes Validator also issued: "Warning! No charset found..." message 480,942

Validator Warning 04 issued: No character encoding found

This table might be a little confusing with some of the double negatives being tossed around. The presence of a Warning 04 means that the SGML parser portion of the validator did not detect a character set. This result may differ from what the validator ends up deciding should be used for the charset. Note that Row 2 is the sum of rows 1, 3, and 4. Row 6 is the sum of rows 5, 7, and 8. Row 5 is another indication that the validator uses a default character set value.

Fig 10-5: Validator Warning 04 scenarios
Warning 04 Charset state Total
No No validator charset detected 1,226
No Validator charset detected 2,618,315
No No MAMA charset detected 137,286
No MAMA charset detected 2,482,255
Yes No validator charset detected
Yes Validator charset detected 480,942
Yes No MAMA charset detected 455,122
Yes MAMA charset detected 25,820

Validator Warnings 18-20 issued: Character encoding mismatches

In these cases, the validator discovers more than one encoding source, and there is some disagreement between them. The validator does not say what the disagreement was, so for some idea, we can look at the data MAMA discovered about these sources. Note that the final row in each table is the expected scenario for the warning to be generated; naturally, those totals are the highest by a wide margin. URLs from the other rows may merit further testing, but there is one reason mentioned before that can explain at least some of these quantities: the two-month delta between MAMA's analysis and the validator's analysis of the URL set.

Fig 10-6: Warning 18: Character encoding mismatch
(HTTP Header encoding/XML encoding)
MAMA
Detected
HTTP
MAMA
Detected
XML
Additional Factor Total
Yes No -- 483
No Yes -- 70
Yes Yes Both agree 80
Yes Yes Both different 2,517

 

Fig 10-7: Warning 19: Character encoding mismatch
(HTTP Header encoding/META encoding)
MAMA
Detected
HTTP
MAMA
Detected
META
Additional Factor Total
Yes No -- 6,712
No Yes -- 4,485
Yes Yes Both agree 4,153
Yes Yes Both different 97,028

 

Fig 10-8: Warning 20: Character encoding mismatch
(XML encoding/META element encoding)
MAMA
Detected
XML
MAMA
Detected
META
Additional Factor Total
Yes No -- 79
No Yes -- 50
Yes Yes Both agree 88
Yes Yes Both different 992

Validator-detected charset values

We have saved the best of our character set discussion for last: what values are actually used by the validator for character set? (We will be looking at similar frequency tables for each of the MAMA-detected charset sources (HTTP header, META, XML) in another section of this study.) The full per-URL and per-Domain frequency tables for validator charset show very little movement between the two—you have to go down to #17 before there is a difference! Below is an abbreviated per-URL frequency table for validator character-set values (out of 243 unique values found for this field).

Fig 10-9: Validator character-set short frequency table
Validator
charset value
Frequency Percentage   Validator
charset value
Frequency Percentage
iso-8859-1 1,510,827 43.05% iso-8859-15 12,276 0.35%
utf-8 943,326 26.88% big5 11,395 0.32%
windows-1252 293,595 8.37% windows-1254 9,756 0.28%
shift_jis 87,593 2.50% iso-8859-9 9,091 0.26%
iso-8859-2 60,663 1.73% us-ascii 8,134 0.23%
windows-1251 51,336 1.46% euc-jp 7,174 0.20%
windows-1250 30,353 0.86% x-sjis 5,564 0.16%
gb2312 19,412 0.55% euc-kr 4,768 0.14%

Article categories