Dev.Opera - Follow the standards, break the rulesDev.Opera - Follow the standards, break the rules

Login

Lost password?

Forums » Article Discussions

Discuss the articles posted on Dev.Opera.

Note: You need to login to post in the forums. if you don't have an account you first need to sign up.

By blooberry anchor Wednesday, 15. October 2008, 00:21:19

MAMA: Key findings



( Read the article )

By porneL anchor Wednesday, 15. October 2008, 19:48:26

avatarNice findings!

Do you detect SWFObject script? (technically it may not contain "flash" or ".swf")

Do you group domains somehow? e.g. does *.blogspot.com count as one or million domains?

Do you try to detect and eliminate inflated results from countless wordpress installs (and other popular CMS/blog software)?

How about weighting results by pagerank or similar? Geocities is probably full of sites that use font, etc. and that nobody cares about.

And can you change layout of tables from weird two-tables-in-a-table layout to something simpler? It's hard to read, especially top 10 lists are ambiguous.

By nathany anchor Thursday, 16. October 2008, 17:28:56

avatarVery interesting findings. If I understand correctly, these statistics are all at a page level? I'm guessing that URL is the same as page, rather than domain.

It would be interesting to see the statistics by site/domain instead. For example, XMLHttpRequest often is used via a JavaScript library, and would appear in a single .js file rather than on several pages. That being the case, 3% doesn't seem representative of the real usage.

Flash usage could be similarly skewed, in that many web sites using Flash have only a single page.

Overall, I think a per site stat would be much more relevant.

By blooberry anchor Thursday, 16. October 2008, 20:38:55

avatar

Originally posted by porneL:

Do you detect SWFObject script? (technically it may not contain "flash" or ".swf")



I actually didn't add that as part of MAMA's group of detections for determining Flash usage, so I just did some checking on that.

Use of SWFObject does not appear to make a difference in this total. SWFObject was found as a script identifier in 55,903 cases, and all but 9 cases (55,894) fell under MAMA's flash detection strategies. The remaining 9 were almost universally the same code template system doing the same exact thing.

Ex: http://www.americanroofing.com/home.html

I'm now mulling over adding that detection, but it doesn't seem to be a big enough issue to worry about.

Originally posted by porneL:

Do you group domains somehow? e.g. does *.blogspot.com count as one or million domains?



I instituted what I call "domain capping" - limiting the number of URLs in MAMA from any single domain. This was especially necessary because of CNN's over-representation in DMoz (~5%) The upper bound I set for the domain cap was 30 URLs. 4,413 domains hit the domain cap in this study.

Keeping track of domains also allows MAMA to perform queries solely based on those domains, so you can make more blanket statements like "X domains have at least 1 URL that..."

Originally posted by porneL:

Do you try to detect and eliminate inflated results from countless wordpress installs (and other popular CMS/blog software)?



Not directly, no. Only on a domain basis so far. I'm of two minds about changing toward filtering out some results because we think they are over-represented. By the same token, we could limit URLs using CSS because there are too many. It seems like a slippery slope. Domain categorization is useful, and (I hope) the domain cap allows for *some* variety within the domain for authoring differences between different pages.

Originally posted by porneL:

How about weighting results by pagerank or similar? Geocities is probably full of sites that use font, etc. and that nobody cares about.



That is certainly something to consider, but having access to a search engine's page ranking was not something I could easily do at the time and something I'm not sure I can yet guarantee. Geocities and sites like it present a unique problem though...unlike many commercial sites that are produced by the same entity (and will likely have the same or similar coding throughout), community sites like Geocities have a broader authoring base that should be taken into account.

Originally posted by porneL:

And can you change layout of tables from weird two-tables-in-a-table layout to something simpler? It's hard to read, especially top 10 lists are ambiguous.



Yeah, we're trying to address that. The original version had a vertical gutter in between the two column groups, but dev.opera's publishing system appears to have some stringent limitations that I am not growing fond of. =) It strips out some things from my markup and keeps them out.

Hope this is helpful information,

-Brian

By porneL anchor Thursday, 16. October 2008, 20:48:48

avatarVery helpful, thanks.

As for "PageRank" - in a similar study I've used Yahoo's free search API to get number of sites linking to a given one. Unfortunately the API is limited to 5000 queries per IP per day, so you might need a special deal...

By blooberry anchor Thursday, 16. October 2008, 21:57:58

avatar

Originally posted by nathany:

Very interesting findings. If I understand correctly, these statistics are all at a page level? I'm guessing that URL is the same as page, rather than domain.



Not *all* of the results are at the URL level. MAMA tracks things by domain as well, so you can get stats like "at least 1 URL in the domain satisfies X condition". In a few places in the research I included domain-centric results, but I'm sure I could put out even more analysis in some spots.

Originally posted by nathany:

It would be interesting to see the statistics by site/domain instead. For example, XMLHttpRequest often is used via a JavaScript library, and would appear in a single .js file rather than on several pages. That being the case, 3% doesn't seem representative of the real usage.

Flash usage could be similarly skewed, in that many web sites using Flash have only a single page.



For XMLHttpRequest:
- By URL: 112,277 (3.20% of all URLs) using XMLHttpRequest
- By Domain: 94,767 (3.15% of all domains) using XMLHttpRequest

For Flash:
- By URL: 1,176,227 (33.5% of all URLs) using Flash
- By Domain: 1,050,121 (34.87% of all domains) using Flash

So, the overall percentages don't seem to move much. I agree that with MAMA's current URL disposition (top-page/surface URL-centric) that it would be better in many cases to examine usage by domain instead of by the URL.

Moderators: pepelsbey | dstorey | mcx | operadev | chrismills | shwetankdixit | brucelawson | iheni | andreasbovens | zibin | mollydotcom