MAMA: W3C validator research
Page 1 index : Page 2 index : Page 3 index
- Interesting views of validation rates, part 2: Alexa Global Top 500
- Validation badge/icons: An interesting diversion?
- Doctypes
- Character sets
Interesting views of validation rates, part 2: Alexa Global Top 500
About the Alexa Global Top 500
Now, we will look at another "interesting" small URL set, the Alexa service from Amazon. Alexa utilizes Web crawling and user-installed browser toolbars to track "important sites". It maintains, among many other useful measures, a global "Top 500" list of URLs considered popular on the Web. The Alexa list was chosen primarily because the size of the list was similar in size to the W3C list—so even though MAMA might be comparing apples to oranges, at least it compares a fairly equal number of apples and oranges. The W3C-company list skews toward academic and "big money" commercial computer sites. The Alexa list is representative of what and how people actually use and experience on the Web on a day-to-day basis.
While few could argue that Alexa's "Top 500" list is relevant and popular, there are some definite biases in its list:
- It is prejudiced toward big/popular sites with many country-specific variants, such as Google, Yahoo!, and eBay. This ends up reducing the breadth of the list. Google is the most extreme example of this, with 63 of the 487 URLs in the analyzed set being various regional Google sites.
- It includes the top pages of domain aggregators with varied user content, such as LiveJournal, Facebook, and fc2.com. These top pages are not representative of the wide variety of the user-created content they contain.
- The list consists entirely of top-level, entrance, or "surface" pages of a site. There is no intentional "deep" URL representation.
Validating the Alexa Top 500
On 28 January 2008, the then-latest Alexa Top 500 list was inserted into MAMA [January 2008 snapshot list, latest live version]. About half of these URLs were already in MAMA, having been part of other sources. Of the 500 URLs in this list, 487 were successfully analyzed and validated. Only 32 of these URLs passed validation (6.57%). This is a slightly higher percentage rate than the much larger overall MAMA population, but the quantity and difference are still too small to declare any trends.
| Alexa Top 500 List study | Date | Passed validation | Total set-size | percentage |
|---|---|---|---|---|
| MAMA | Jan. 2008 | 32 | 487 | 6.57% |
For future Alexa studies
OK, so the Alexa Top 500 does have some drawbacks. Should the URL set be tossed out entirely? Can this set be improved? Aside from the Top 500, Alexa has a very deep catalog and categorization of URLs, some of them available freely, but most are available only for a fee. Some categories of URLs include division by country and by language. Alexa currently has publicly-available lists of the top 100 URLs for 21 different languages (2,100 URLs) and 117 countries (11,700 URLs). Note: The per-country list represents popularity among users in a country, not sites hosted in the country. An undoubtedly-interesting expanded list of the Alexa Global Top 500 could be created by aggregating all of these sources, which would probably yield 5,000-10,000 URLs (if duplicates were eliminated).
If the validation rates of the Alexa Global Top 500 are studied in the future, the current version of the Top 500 list of URLs will likely be quite different than it is at this time of writing. The topicality of the list—a strength that promotes the relevance of the analysis—and also makes cross-comparisons over time difficult. Documenting the list that was used in each analysis will be helpful in doing that.
Validation badge/icons: An interesting diversion?
Before MAMA had validated even a single URL, the author discovered this page at the W3C's site: http://www.w3.org/QA/Tools/Icons. This page lists icons that,
"may be used on documents that successfully passed validation for a specific technology, using the W3C validation services".
It seemed like an interesting idea to compare the pages that were using these images claiming validation with how they actually validate. This can only be a crude measure for a number of reasons, but, by far, the main one is as follows: an author can easily host the validation icon/badge on their own server and name it anything they want.
For those gearheads in the audience who have some "regexp savvy", the following Perl regular expression was
used to identify validation icon/badges utilizing the W3C naming scheme. This pattern match was used against the
Src attribute of the IMG elements of URLs analyzed:
Regexp:
/valid-((css|html|mathml|svg|xhtml|xml).*?)(-blue)?(\.png|\.gif|-v\.svg|-v\.eps)?$/i ||
/(wcag1.*?)(\.png|\.gif|-v\.svg|-v\.eps)?$/i
This seems to capture fully all the variations of the W3C's established naming conventions (any corrections are very welcome if it does not). Note that the regexp errs on the cautious side and can also capture unintended matches like JPEG files matching the naming scheme. One might think this an error, but it turns out it is not. JPEG versions of the validation icons are not (currently) listed on the W3C's Web site, but a random spot-check of JPEG images thus detected by MAMA ARE validation badge icons! In this case, what appears to be false-positives are actually valid after all.
Ex: http://www.w3.org/Icons/valid-html401-blue.png
is stored as 'html401-blue'
Validation rates of URLs having validation badge/icons
Now we will look at the list of W3C Validation Image Badges found in MAMA by URL [also by domain]. Even with the various pitfalls that could occur with MAMA's pattern matching, there is still a comparison that is interesting to explore: how many pages that use a badge actually validate? If we consider that the only type of badge of real interest in our sample is an HTML variant (html, xhtml), looking for the substrings "html" and "xhtml" within this field in MAMA gives us:
| Type of badge identified |
Total | Actually validated |
Percentage |
|---|---|---|---|
| xhtml | 11,657 | 5,480 | 47.01% |
| html | 22,033 | 10,995 | 49.90% |
This is just under 50% in each case, which is frankly a rather miserable hit ratio. If these URLs do not validate, do they bear ANY resemblance to the badge they are claiming?
Comparison of stated validation badge/icon type versus actual detected Doctype
Next, we will try comparing the actual Doctypes detected compared to the badges claiming compliance to those respective Doctypes. Doctypes detected in both the validator and MAMA analyses are listed for comparison. The situation definitely improves here over the previous figures. Note: Fatal validation errors cause the validator to under-report Doctypes by reporting no Doctype at all in such cases.
| Type Of badge identified |
Validator- detected Doctype |
MAMA- detected Doctype |
Total according to badge/icon |
|---|---|---|---|
| xhtml | 10,553 | 11,054 | 11,657 |
| html | 20,570 | 21,475 | 22,033 |
The validation badges certainly increase public awareness of validation as something for which the authors strive, but it does not appear to be the best measure of reality. For the half of badged URLs that claim validation compliance but currently do not validate, one has to wonder whether they ever did validate in the past. Pages definitely tend to change over time and removing or updating an icon badge may not be high on most author's list of "Things To Do". The next time you see such an icon, consider its current state with a grain of salt.
For future W3C badge studies
After this survey was completed, the following rather prominent quote was noticed on the W3C's Validation Icons page,
"The image should be used as a link to re-validate the document."
It may be useful to incorporate this fact to identify further validation badges in the future.
Doctypes
What are we examining?
First up is the Doctype. The Doctype statement tells the validator which DTD to use when validating—it is the basic evaluation metric for the document. MAMA used its own methods to divine the Doctype for every document, but the validator actually detects the Doctype in two slightly different ways: one by the validator itself and the other by the SGML parser at the core of the validator.
| Source of Doctype |
Information being used |
|---|---|
| MAMA | Detected Doctype statement |
| Validator | SOAP <m:doctype > content |
| Validator | 'W09'/'W09x' warning messages |
This is a good time to dissect a Doctype and see what makes it tick. We will look at a typical Doctype statement, and examine all of its parts:
Ex: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
| Component | Description |
|---|---|
| "<!DOCTYPE" | The beginning of the Doctype |
| "html" | This string specifies the name of the root element for the markup type. |
| "PUBLIC" | This indicates the availability of the DTD resource. It can be a publicly-accessible object ("PUBLIC") or a system resource ("SYSTEM") such as a local file or URL. HTML/XHTML DTDs are specified by "PUBLIC" identifiers. |
| "-//W3C//DTD XHTML 1.0 Transitional//EN" | This is the Formal Public Identifier (FPI). This compact, quoted string gives a lot of information about the DTD, such as its Registration, Organization, Type, Label, and the Encoding language. For HTML/XHTML DTDs, the most interesting part of this is the label portion (the "XHTML 1.0 Transitional" part). If the processing entity does not already have local access to this DTD, it can get it from the System Identifier (next portion). |
| "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" | The System Identifier (SI); the URL location of the DTD specified in the FPI |
| ">" | The ending of the Doctype |
MAMA's analysis stores the entire DOCTYPE statement, but the validator's SOAP response only returns a portion of it— generally the FPI, but some situations may return the SI instead or even nothing at all if an error condition is detected. These situations are infrequent, though; only 70 URLs analyzed by the validator returned the Doctype's SI, for example.
!Doctypes!
The validator examined 3,509,10 URLs overall. Of those, the validator says that 1,474,974 (42.03%)
"definitely" did not use a DOCTYPE (indicated by an empty content for the <m:doctype > element
in the SOAP response). In addition to the empty <m:doctype > element in the SOAP response, the validator
also returns explicit warnings for the instances it does not encounter a Doctype statement: specifically, warning codes
'W09' and 'W09x' are generated by the SGML parser layer of the validator. Is there any correlation between these warning
codes and the "official" empty Doctype mentioned in the SOAP response? The quick answer is yes. Some 1,373,352 URLs have either
the 'W09' or 'W09x' warnings. Looking closer for a direct correlation, 1,371,899 URLs were issued a 'W09'/'W09x' warning
AND do not have a Doctype listed in the SOAP response. This leaves 1,453 URLs that had some sort of
validator-detectable Doctype, but a warning for No Doctype was issued. Sampling several URLs from the above set showed that,
in every case, the Doctype statement was not at the very beginning of the document. So, it appears that the OpenSP
parser does not like this, but the validator itself is OK with this scenario.
MAMA also looked at Doctypes in its main analysis. We have compared cases where both tools found no Doctype.
MAMA found 1,720,886 URLs without a Doctype. This is a rather large discrepancy compared to the validator's numbers above.
We must alter this figure further because the SOAP response for a validation failure error returns empty
<m:doctype > and <m:charset > elements. To improve the quality of our
comparison between MAMA and the validator's results, we must exclude from our mutual examination all URLs with
a positive validator failure count. After this minor adjustment, the numbers are much more in line with
each other. To the numbers:
| Situation | Qty |
|---|---|
| MAMA detected no Doctype. | 1,465,367 |
| Validator detected no Doctype. | 1,474,974 |
| MAMA and validator both detected no Doctype. | 1,423,478 |
| MAMA detected no Doctype, but the validator did. | 41,889 |
| Validator detected no Doctype, but MAMA did. | 51,496 |
The final two numbers are the most interesting. These discrepancies are still quite large (~3% of the overall 'no Doctype detected' count). What could account for this? Some reasons noticed for the differences (there could be others):
- MAMA did not look for a Doctype in the destination document of a META refresh/redirect. The validator appears to do this.
Ex: http://disneyworld.disney.go.com/wdw/parks/parkLanding?id=TLLandingPage
- MAMA does not request or handle gzipped content, but it was occasionally served to it anyway. The validator appears to handle this.
- MAMA looked anywhere in the document for a Doctype, but the validator only looks near the beginning of the document. A rather large set of URLs unfortunately fit this description.
- URL content can change over time, including the addition or deletion of Doctypes. MAMA's analysis occurred in November 2007, and the validation of those same URLs happened in January 2008—over 2 months later. In sampling random parts of the URL set where MAMA did not initially detect a Doctype, a current, live analysis by MAMA does indeed detect a Doctype in most cases tried. Other than a bug existing in MAMA (unfortunately, always possible in any software), this is the best explanation to put forth.
Doctype statement present details
What about URLs that had validator-detectable Doctypes? We will linger on the comparison between MAMA's Doctype detection and the Validator's before looking in depth at what those Doctypes were.
| Situation | Qty |
|---|---|
| MAMA detected a Doctype. | 1,788,294 |
| The validator detected a Doctype. | 1,625,509 |
| MAMA and the validator both detected a Doctype, and it was the same. | 1,583,620 |
| MAMA and the validator both detected a Doctype, and it was different. | 36,119 |
Where MAMA and the validator both found a Doctype, they disagree 2.28% of the time. Other than the aforementioned time delay between the MAMA and validator analyses, could there be other reasons to account for this difference? Scanning a list of results for MAMA/validator Doctypes that differed, there may indeed be a trend—and a positive one at that. Of the 36,119 URLs that changed Doctype, 23,390 of them (64.76%) changed from an HTML Doctype to an XHTML Doctype. There are a few reasons mentioned above that could be affecting these results, and the above numbers could be a coincidence, but this looks like a data point supporting the gradual shift from HTML to XHTML.
To summarize the per-URL and per-domain frequency tables for validator Doctype, Transitional FPI flavors have a lock on the top three most popular positions. The other variants trail far behind. If a document has a Doctype, it is likely to be a Transitional flavor of XHTML 1.0 or (even more likely) HTML 4.0x. XHTML 1.0 Strict dominates over any other Strict variant (98% of all Strict types).
Totals for common substrings found in the validator Doctype field
A survey of the FPIs the validator exposed is like a microcosm of the evolution of HTML—there are documents claiming to adhere to "ancient" versions from the early days all the way through to the language's present XHTML incarnations. Searching for a few, well-chosen substrings demonstrates this variety well, and we can see how well an author's choice of Doctype FPI results in actually passing validation. Out of the 1,625,509 URLs exposing a Doctype to the validator, Strict Doctypes pass validation twice as often as the other flavors, and XHTML Doctypes are much are heavily favored for passing validation than other Doctypes. More could be said about the final two items in the table below (to say the least), but that is left for a future discussion.
| Doctype flavor | Qty | Percentage of total |
Passing validation |
Percentage of flavor |
|---|---|---|---|---|
| "Transitional" | 1,341,024 | 82.50% | 112,348 | 8.38% |
| "Strict" | 100,002 | 6.15% | 17,502 | 17.50% |
| "Frameset" | 57,225 | 3.52% | 4,133 | 7.22% |
| Doctype markup language | Qty | Percentage of total |
Passing validation |
Percentage of markup language |
| " html 4" (HTML 4 variants) | 987,701 | 60.76% | 66,535 | 6.74% |
| " xhtml 1.0" | 544,622 | 33.50% | 71,537 | 13.14% |
| " html 3.2" | 44,642 | 2.75% | 1,753 | 3.93% |
| " xhtml 1.1" | 19,984 | 1.23% | 4,074 | 20.39% |
| " html 2" | 4,792 | 0.29% | 176 | 3.67% |
| " html 3.0" | 884 | 0.05% | 44 | 4.98% |
| "WAP" | 789 | 0.05% | 468 | 59.32% |
| " xhtml 2" | 11 | 0.00% | 0.00% |
The studies from Parnas and Saarsoo did not use the W3C validator, and, as a consequence, there was not such an extreme focus on Doctype usage. Generally, the validator they used only tracked whether a Doctype was used at all. The main reported error type in Parnas' study was a missing Doctype, with only 18.8% of URLs having one present. By the time of Saarsoo's study, the number of URLs having a Doctype moved up to 39.08%. Fast-forward to now, and that number has grown considerably yet again—to 57.7% according to the W3C validator. This is a very respectable increase over time. If few authors are actually creating valid documents, at least most of them seem to understand that there IS a standard to which they should be adhering.
Doctypes for our small, special interest URL sets
Backtracking just a little, the next two tables are a quick look at the Doctypes used for the W3C-member-company URLs and the Alexa Top 500 list. Almost 76% of those URLs passing validation are XHTML variants in the W3C-company set, and in the Alexa list it is almost 66%.
| Doctype FPI | Passed validation |
Total | Percentage of FPI type |
|---|---|---|---|
| -//W3C//DTD XHTML 1.0 Transitional//EN | 36 | 145 | 24.83% |
| -//W3C//DTD XHTML 1.0 Strict//EN | 23 | 45 | 51.11% |
| -//W3C//DTD HTML 4.01 Transitional//EN | 16 | 95 | 16.84% |
| -//W3C//DTD XHTML 1.1//EN | 4 | 8 | 50.00% |
| -//W3C//DTD HTML 4.0 Transitional//EN | 3 | 22 | 13.64% |
| -//W3C//DTD HTML 4.01//EN | 1 | 7 | 14.29% |
| -//W3C//DTD HTML 3.2//EN | 1 | 0.00% | |
| -//W3C//DTD HTML 4.01 Frameset//EN | 1 | 0.00% | |
| -//W3C//DTD HTML 3.2 Final//EN | 1 | 0.00% | |
| -//W3C//DTD XHTML 1.0 Strict//FI | 1 | 0.00% | |
| -//W3C//DTD XHTML 1.0 Frameset//EN | 1 | 0.00% | |
| [None] | 85 | 0.00% |
| Doctype FPI | Passed validation |
Total | Percentage of FPI type |
|---|---|---|---|
| -//W3C//DTD XHTML 1.0 Strict//EN | 10 | 37 | 27.03% |
| -//W3C//DTD XHTML 1.0 Transitional//EN | 9 | 130 | 6.92% |
| -//W3C//DTD HTML 4.01 Transitional//EN | 5 | 77 | 6.49% |
| -//W3C//DTD HTML 4.0 Transitional//EN | 3 | 22 | 13.64% |
| -//W3C//DTD HTML 4.01//EN | 2 | 12 | 16.67% |
| -//W3C//DTD XHTML 1.1//EN | 2 | 5 | 40.00% |
| -//iDNES//DTD HTML 4//EN | 1 | 1 | 100.00% |
| -//W3C//DTD HTML 4.01 Frameset//EN | 1 | 0.00% | |
| -//W3C//DTD XHTML 1.1//EN http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd | 1 | 0.00% | |
| -//W3C//DTD XHTML 1.0 Strict //EN | 1 | 0.00% | |
| -//W3C//DTD XHTML 1.0 Transitional//ES | 1 | 0.00% | |
| -//W3C//DTD HTML 4.0 Strict//EN | 1 | 0.00% | |
| [None] | 193 | 0.00% |
Character sets
In the previous section on Doctypes, there were many ways to look at just a single variable (presence or lack of a Doctype). Now, with character sets it becomes even more complex. Even a simplistic view of character set determination can involve at least three aspects of a document. MAMA, the validator, and the validator's SGML parser ALL have something to say about the choice of a document's character set. To cover every permutation and difference between the many possible charset specification vectors would definitely exhaust the author and most likely bore the reader. Every effort will be made to present some of this data in a way from that is not TOO overwhelming.
There are three main areas of interest when determining the character set to use when validating a document:
- The charset parameter of the
Content-Typefield in a document's HTTP Header - The charset parameter of the
Contentattribute for aMETA"Content-Type" declaration - The encoding attribute of the XML prologue
For brevity, these will be shortened to "HTTP", "META", and "XML" respectively.
Character set differences between MAMA and the validator
An important difference exists between MAMA and the validator when talking about character sets. There is an HTTP header that allows a request to specify which character sets it prefers. MAMA sent this "Accept-Charset" header with a value of "windows-1252, utf-8, utf-16, iso-8859-1;q=0.6, *;q=0.1". This header field value is used by Opera (9.10), and MAMA tried to emulate this browser as closely as possible. The character sets that were specified reflect the author's own particular language bias. The validator is another story. It does not send an "Accept-Charset" header field at all. This may cause differences between the two and affect the reported character set results.
MAMA's view of character sets
First up is a look at what MAMA was able to determine about these three fields, and how they are used in combination with each other. The totals here account for all cases where a non-empty value was present for any of the HTTP/META/XML charset specification types. The following tables show the frequencies for the different ways that character sets are established and mixed. A document can have none, any or all of these factors. Note: The XML level in Fig 9-1 appears to be very low in comparison to the other specification methods, but this is because the number of documents with an XML declaration is also rather low. Looked at in this way, that ratio actually the highest, being even more favorable than the META case at 96,264 of 104,722 URLs (91.92%). Fig 9-2 offers a breakdown of all the combinations of ways to specify a character set. By a large majority, authors do this using only the META element method. The final table, Fig 9-3, shows what happens when more than one source for a character set existed in a document, and whether these multiple values agreed with one another.
| Charset source |
Number of occurrences |
Total where any charset specified |
Percentage where any charset specified |
|---|---|---|---|
| HTTP | 686,749 | 2,626,206 | 26.15% |
| META | 2,361,221 | 2,626,206 | 89.91% |
| XML | 96,264 | 2,626,206 | 3.67% |
| Charset specified in |
Quantity | Total where any charset specified |
Percentage where any charset specified |
|---|---|---|---|
| HTTP only | 240,349 | 2,626,206 | 9.15% |
| META only | 1,872,497 | 2,626,206 | 71.30% |
| XML only | 17,858 | 2,626,206 | 0.68% |
| HTTP and META | 417,109 | 2,626,206 | 15.88% |
| HTTP and XML | 6,791 | 2,626,206 | 0.26% |
| META and XML | 49,115 | 2,626,206 | 1.87% |
| All three sources | 22,500 | 2,626,206 | 0.86% |
| Specified charset sources |
Disagree | Total | Percentage |
|---|---|---|---|
| HTTP and META | 123,245 | 417,109 | 29.55% |
| HTTP and XML | 2,238 | 6,791 | 32.96% |
| META and XML | 4,086 | 49,115 | 8.32% |
| All three sources | 4,399 | 22,500 | 19.55% |
The validator's view of character sets
Now, we will look at the way the markup validator views charset information. The validator generally looks for the same three document sources mentioned previously to determine charset information. Before looking at these actual charset values, it is useful to examine whether the validator's view of charset information is internally consistent or not. It can also be instructive to compare, where possible, the validator's view of charset information versus MAMA's view.
To directly compare validator and MAMA charset information, we must remove some URLs from consideration. The validator's SOAP response returns an empty charset value in all cases where there is a validator failure. It is useful to know if the validator is returning a "truly" empty charset value, so all URLs with a failure error are removed from the examination set for this phase. This immediately reduces our URL group by 408,687 URLs.
The items of interest to look at in the validator response are the contents of the
<m:charset > element and warnings issued for no
detected charset or charset value mismatch from differing sources. We will explore
how/if all these factors mesh when the validator is determining which charset to use.
Validator-detected charsets versus MAMA-detected charsets
The following table is mostly for sanity checking to see if the validator's results resemble MAMA's results. The first two entries have very low totals, but this may involve some corner charset detection cases worth taking a second glance. The third case is a definite indication that the validator has default fallback values used for character set when none is detected through the typical methods.
| Validator charset detected |
Scenario | Total |
|---|---|---|
| No | No MAMA charsets detected | 47 |
| No | MAMA charset detected | 1,179 |
| Yes | No MAMA charsets detected | 592,361 |
| Yes | Validator also issued: "Warning! Conflicting charsets..." message | 118,367 |
| Yes | Validator also issued: "Warning! No charset found..." message | 480,942 |
Validator Warning 04 issued: No character encoding found
This table might be a little confusing with some of the double negatives being tossed around. The presence of a Warning 04 means that the SGML parser portion of the validator did not detect a character set. This result may differ from what the validator ends up deciding should be used for the charset. Note that Row 2 is the sum of rows 1, 3, and 4. Row 6 is the sum of rows 5, 7, and 8. Row 5 is another indication that the validator uses a default character set value.
| Warning 04 | Charset state | Total |
|---|---|---|
| No | No validator charset detected | 1,226 |
| No | Validator charset detected | 2,618,315 |
| No | No MAMA charset detected | 137,286 |
| No | MAMA charset detected | 2,482,255 |
| Yes | No validator charset detected | |
| Yes | Validator charset detected | 480,942 |
| Yes | No MAMA charset detected | 455,122 |
| Yes | MAMA charset detected | 25,820 |
Validator Warnings 18-20 issued: Character encoding mismatches
In these cases, the validator discovers more than one encoding source, and there is some disagreement between them. The validator does not say what the disagreement was, so for some idea, we can look at the data MAMA discovered about these sources. Note that the final row in each table is the expected scenario for the warning to be generated; naturally, those totals are the highest by a wide margin. URLs from the other rows may merit further testing, but there is one reason mentioned before that can explain at least some of these quantities: the two-month delta between MAMA's analysis and the validator's analysis of the URL set.
| MAMA Detected HTTP |
MAMA Detected XML |
Additional Factor | Total |
|---|---|---|---|
| Yes | No | -- | 483 |
| No | Yes | -- | 70 |
| Yes | Yes | Both agree | 80 |
| Yes | Yes | Both different | 2,517 |
| MAMA Detected HTTP |
MAMA Detected META |
Additional Factor | Total |
|---|---|---|---|
| Yes | No | -- | 6,712 |
| No | Yes | -- | 4,485 |
| Yes | Yes | Both agree | 4,153 |
| Yes | Yes | Both different | 97,028 |
| MAMA Detected XML |
MAMA Detected META |
Additional Factor | Total |
|---|---|---|---|
| Yes | No | -- | 79 |
| No | Yes | -- | 50 |
| Yes | Yes | Both agree | 88 |
| Yes | Yes | Both different | 992 |
Validator-detected charset values
We have saved the best of our character set discussion for last: what values are actually used by the validator for character set? (We will be looking at similar frequency tables for each of the MAMA-detected charset sources (HTTP header, META, XML) in another section of this study.) The full per-URL and per-Domain frequency tables for validator charset show very little movement between the two—you have to go down to #17 before there is a difference! Below is an abbreviated per-URL frequency table for validator character-set values (out of 243 unique values found for this field).
| Validator charset value |
Frequency | Percentage | Validator charset value |
Frequency | Percentage | |
|---|---|---|---|---|---|---|
| iso-8859-1 | 1,510,827 | 43.05% | iso-8859-15 | 12,276 | 0.35% | |
| utf-8 | 943,326 | 26.88% | big5 | 11,395 | 0.32% | |
| windows-1252 | 293,595 | 8.37% | windows-1254 | 9,756 | 0.28% | |
| shift_jis | 87,593 | 2.50% | iso-8859-9 | 9,091 | 0.26% | |
| iso-8859-2 | 60,663 | 1.73% | us-ascii | 8,134 | 0.23% | |
| windows-1251 | 51,336 | 1.46% | euc-jp | 7,174 | 0.20% | |
| windows-1250 | 30,353 | 0.86% | x-sjis | 5,564 | 0.16% | |
| gb2312 | 19,412 | 0.55% | euc-kr | 4,768 | 0.14% |