Dev.Opera - Follow the standards, break the rulesDev.Opera - Follow the standards, break the rules

Login

Lost password?

Forums » Article Discussions

Discuss the articles posted on Dev.Opera.

Note: You need to login to post in the forums. if you don't have an account you first need to sign up.

By blooberry anchor Wednesday, 15. October 2008, 00:21:31

MAMA: The URL Set



( Read the article )

By scipio anchor Saturday, 18. October 2008, 21:04:53

avatarGiven the goal of MAMA and the other uses that you've thought of:

Originally posted by Brian Wilson:

The intent has always been for MAMA to provide those developing the Opera Web browser with a tool to quickly find live examples of markup and other Web page structural components. We at Opera believe this tool can also be useful to other stakeholders in the standards and browser-making world. For example:

  • Browser manufacturers and others can use MAMA data on the popularity of widely used technologies to prioritize bugs and justify adding support for new technology to in-progress releases.
  • Standards bodies can use the data to measure the success and adoption rates of various technologies.
  • Web developers can use the same data to justify support of various technologies in their work.
    It can provide real-world, practical samples of the Web developer's "art", for inspiration and instruction.
(article)

what do you think are ways to improve the URL set? Something I immediately thought of was that you don't need a population that is as large as possible to be respresentative of all the web, but rather a population that accounts for a significant part of web usage. In other words, to determine which technologies are popular, do you really need to know how many webmasters have used them? Isn't it more relevant to browser manufacturers to know how often browser users encounter these technologies? I was therefore a bit surprised that the country frequency list of MAMA didn't quite match data on the popularity of country code TLDs (even though admittedly that in itself doesn't indicate web site popularity either). I would think that at Opera - if you ignore other criteria for prioritizing bug fixes - bugs that can be witnessed at high profile sites (such as for example CNN.com) have a higher priority than bugs that occur on some obsure pages nobody visits.
In the section about Surface vs. Deep URLs, you too seem to suggest that this is what you're really after:

Originally posted by Brian Wilson:

We are interested in an average user browsing experience, (...)
(article)


I haven't really thought about how you could create a URL set that takes web site popularity into account, but don't you think that in that case a (much) smaller set than the current MAMA set would suffice to cover most browser users' needs?

By the way, the MAMA project is VERY interesting and I loved reading your articles about it.

By blooberry anchor Tuesday, 28. October 2008, 20:09:28

avatarHi and thanks for your thoughts. You raise many great points. I held off responding in order to organize my thoughts a bit on the issue (not that I haven't already been thinking about this a lot :wink: )

Originally posted by scipio:

Given the goal of MAMA and the other uses that you've thought of:
{snip}
what do you think are ways to improve the URL set? Something I immediately thought of was that you don't need a population that is as large as possible to be representative of all the web, but rather a population that accounts for a significant part of web usage.

The first thing I want to do is even the skew between surface URLs and deep URLs. The DMoz URL set has a high skew toward surface URLs and most documents on the Web are not surface pages. The rough plan I have for that is to look for deep URLs when analyzing a surface URL, adding it to a group of URLs to still be analyzed. MAMA already looks at all hyperlinks in a document, so this should not be much of a stretch to do. Deep URLs would be chosen at random from the available candidates. After a static URL set is analyzed, with deep URL candidates vetted and added where possible, post-processing of the list will determine where the set is weak on surface URLs (domains with only deep URLs represented), and add those to the queue to be analyzed. I think a good goal would be at least 1 to 1 surface versus deep URLs, but an even better next step would a 2:1 ratio of deep to surface URLs.

DMoz is a great URL set starting point, although MAMA is already bigger than that. I'm trying to be very egalitarian in URL acquisition. Quantcast has a significant URL list available. Alexa has URLs available for a fee. Opera's MINI platform may have URLs to add as well. MAMA itself also stored lists of external domains it encountered that weren't in its database already, and those could be used. Until infrastructure issues are worked out, the size of the set will be the least of MAMA's problems. :smile:

Originally posted by scipio:

In other words, to determine which technologies are popular, do you really need to know how many webmasters have used them? Isn't it more relevant to browser manufacturers to know how often browser users encounter these technologies? I was therefore a bit surprised that the country frequency list of MAMA didn't quite match data on the popularity of country code TLDs (even though admittedly that in itself doesn't indicate web site popularity either). I would think that at Opera - if you ignore other criteria for prioritizing bug fixes - bugs that can be witnessed at high profile sites (such as for example CNN.com) have a higher priority than bugs that occur on some obscure pages nobody visits.

Geographical location is just one measure, and it is one I focused on occasionally in the writeup, but there are definitely others that can be used. These are just constraints that will narrow the set in some way, so the overall URL set needs to be large so as to encompass as many of these constraints we want to use and still have useful results under any given constraint. Popularity is just one measure, and it is a really useful one. For this research I didn't have an easy solution to muster for MAMA to incorporate popularity, although something like Alexa or Google pagerank would do. URLs accessed through Opera MINI may be leverage-able too, but I haven't talked to those guys about it and I don't know how any privacy concerns might impact a choice like that.

One constraint that I abandoned early on can still be pasted on to MAMA - categorization. Among the DMoz metadata of URLs is category information. Unfortunately, with a set this large, you would have a hard (read: impossible) time increasing a categorization set manually without HUGE effort. It would be constrained only to the DMoz set. I want MAMA to be able to say "how popular is X in Australia?", but also "how popular is X among the top Y sites? The Y*Z sites?"...or even "how popular is X among shopping sites?" Different information consumers will want to know different things, even combining some of those factors. Each constraint narrows the set and we want the result set space to be big enough to be interesting and persuasive - oh, and be as accurate as possible! :wink:

MAMA's current URL set skew can be blamed almost entirely on DMoz. If its country skew does not match the TLD popularity list, it can only be the responsibility of DMoz and its editors. Since DMoz is entirely human-filtered, this skew must be deliberate in some way (whether through sins of comission or omission, I don't know.) Efforts could be made in MAMA to further crawl and only add URLs that help balance that skew if that was deemed a worthwhile activity.

Originally posted by scipio:


In the section about Surface vs. Deep URLs, you too seem to suggest that this is what you're really after:

Originally posted by Brian Wilson:

We are interested in an average user browsing experience, (...)


I haven't really thought about how you could create a URL set that takes web site popularity into account, but don't you think that in that case a (much) smaller set than the current MAMA set would suffice to cover most browser users' needs?

We already have some smaller popularity classifications (top 1000 and some other small sets), but I did not constrain the URL set that way in the write up. When you see the eventual size of this write-up, you'll thank me. =) But all this should definitely be possible to add.

Regarding the "average user browsing experience", you run into a number of factors that are not just popularity based. Say a user browses to CNN's web site, that is a very popular web site, sure...but what about popular browsing flows? (I haven't seen any data on this, so forgive) One way to view an average user's experience could be as a series of website categories visited. Say, a typical chain is represented by a browsing session 10 sites long. That chain might include 2 news sites, followed by a series of 4 blogs, 1 entertainment site, followed by a banking site, an online comic and a personals site. One could build statistics profiles based on what is encountered in the aggregate of the top 10 sites in each category, or even narrowed further by country. The variables tend to make the head swim. In sum, there are many ways to build an "average" experience, each one with some interesting validity.

So, I'm not sure in this loooong response if I've really answered your questions well enough to satisfy, but I can promise that the URL set does need some polishing, adding and additional metadata, and that it is all definitely on the to-do list.

By scipio anchor Sunday, 2. November 2008, 22:05:04

avatar

Originally posted by blooberry:

So, I'm not sure in this loooong response if I've really answered your questions well enough to satisfy, but I can promise that the URL set does need some polishing, adding and additional metadata, and that it is all definitely on the to-do list.


Thanks for your reply. I guess your article invited readers (me at least) to think about the URL selection so I raised a few points I thought were relevant. Your response shows you've spent much more time thinking about it, so I can only say: keep up the good work. :smile:

By dantesoft anchor Tuesday, 27. January 2009, 18:10:14

avatarAlexa offers a free download with the top 1,000,000 sites http://www.alexa.com/site/ds/top_sites?ts_mode=global

By blooberry anchor Thursday, 29. January 2009, 19:42:25

avatar

Originally posted by dantesoft:

Alexa offers a free download with the top 1,000,000 sites

I'm actually already integrating that into the next crawl, but thanks for the suggestion...it is spot on! It looks like MAMA already had surveyed about 25% of those Alexa domains via the DMoz URL set (I was surprised that the number was so low).

One thing that this will also give MAMA for "free" is an easy way to rank URLs and domains. I just added an "Alexa ranking" field to one of the MAMA database tables yesterday. :smile: I'll be looking for interesting ways to look at URL data with this extra selection criteria. Ideas welcome!

By dantesoft anchor Thursday, 29. January 2009, 20:09:30

avatarI'd assume that the most popular sites pull in more 3rd party content/advertisments and contain more outbound links than the general population. They'd also be more likely to employ user tracking, I think.

Moderators: pepelsbey | dstorey | mcx | operadev | chrismills | shwetankdixit | brucelawson | iheni | andreasbovens | zibin | mollydotcom