By blooberry
Wednesday, 15. October 2008, 00:21:31
Note: You need to login to post in the forums. if you don't have an account you first need to sign up.
By blooberry
Wednesday, 15. October 2008, 00:21:31
By scipio
Saturday, 18. October 2008, 21:04:53
Originally posted by Brian Wilson:
what do you think are ways to improve the URL set? Something I immediately thought of was that you don't need a population that is as large as possible to be respresentative of all the web, but rather a population that accounts for a significant part of web usage. In other words, to determine which technologies are popular, do you really need to know how many webmasters have used them? Isn't it more relevant to browser manufacturers to know how often browser users encounter these technologies? I was therefore a bit surprised that the country frequency list of MAMA didn't quite match data on the popularity of country code TLDs (even though admittedly that in itself doesn't indicate web site popularity either). I would think that at Opera - if you ignore other criteria for prioritizing bug fixes - bugs that can be witnessed at high profile sites (such as for example CNN.com) have a higher priority than bugs that occur on some obsure pages nobody visits.The intent has always been for MAMA to provide those developing the Opera Web browser with a tool to quickly find live examples of markup and other Web page structural components. We at Opera believe this tool can also be useful to other stakeholders in the standards and browser-making world. For example:
(article)
- Browser manufacturers and others can use MAMA data on the popularity of widely used technologies to prioritize bugs and justify adding support for new technology to in-progress releases.
- Standards bodies can use the data to measure the success and adoption rates of various technologies.
- Web developers can use the same data to justify support of various technologies in their work.
It can provide real-world, practical samples of the Web developer's "art", for inspiration and instruction.
Originally posted by Brian Wilson:
We are interested in an average user browsing experience, (...)
(article)
By blooberry
Tuesday, 28. October 2008, 20:09:28
Originally posted by scipio:
The first thing I want to do is even the skew between surface URLs and deep URLs. The DMoz URL set has a high skew toward surface URLs and most documents on the Web are not surface pages. The rough plan I have for that is to look for deep URLs when analyzing a surface URL, adding it to a group of URLs to still be analyzed. MAMA already looks at all hyperlinks in a document, so this should not be much of a stretch to do. Deep URLs would be chosen at random from the available candidates. After a static URL set is analyzed, with deep URL candidates vetted and added where possible, post-processing of the list will determine where the set is weak on surface URLs (domains with only deep URLs represented), and add those to the queue to be analyzed. I think a good goal would be at least 1 to 1 surface versus deep URLs, but an even better next step would a 2:1 ratio of deep to surface URLs.Given the goal of MAMA and the other uses that you've thought of:
{snip}
what do you think are ways to improve the URL set? Something I immediately thought of was that you don't need a population that is as large as possible to be representative of all the web, but rather a population that accounts for a significant part of web usage.
Originally posted by scipio:
Geographical location is just one measure, and it is one I focused on occasionally in the writeup, but there are definitely others that can be used. These are just constraints that will narrow the set in some way, so the overall URL set needs to be large so as to encompass as many of these constraints we want to use and still have useful results under any given constraint. Popularity is just one measure, and it is a really useful one. For this research I didn't have an easy solution to muster for MAMA to incorporate popularity, although something like Alexa or Google pagerank would do. URLs accessed through Opera MINI may be leverage-able too, but I haven't talked to those guys about it and I don't know how any privacy concerns might impact a choice like that.In other words, to determine which technologies are popular, do you really need to know how many webmasters have used them? Isn't it more relevant to browser manufacturers to know how often browser users encounter these technologies? I was therefore a bit surprised that the country frequency list of MAMA didn't quite match data on the popularity of country code TLDs (even though admittedly that in itself doesn't indicate web site popularity either). I would think that at Opera - if you ignore other criteria for prioritizing bug fixes - bugs that can be witnessed at high profile sites (such as for example CNN.com) have a higher priority than bugs that occur on some obscure pages nobody visits.
Originally posted by scipio:
We already have some smaller popularity classifications (top 1000 and some other small sets), but I did not constrain the URL set that way in the write up. When you see the eventual size of this write-up, you'll thank me. =) But all this should definitely be possible to add.
In the section about Surface vs. Deep URLs, you too seem to suggest that this is what you're really after:Originally posted by Brian Wilson:
We are interested in an average user browsing experience, (...)
I haven't really thought about how you could create a URL set that takes web site popularity into account, but don't you think that in that case a (much) smaller set than the current MAMA set would suffice to cover most browser users' needs?
By scipio
Sunday, 2. November 2008, 22:05:04
Originally posted by blooberry:
So, I'm not sure in this loooong response if I've really answered your questions well enough to satisfy, but I can promise that the URL set does need some polishing, adding and additional metadata, and that it is all definitely on the to-do list.
By dantesoft
Tuesday, 27. January 2009, 18:10:14
By blooberry
Thursday, 29. January 2009, 19:42:25
Originally posted by dantesoft:
I'm actually already integrating that into the next crawl, but thanks for the suggestion...it is spot on! It looks like MAMA already had surveyed about 25% of those Alexa domains via the DMoz URL set (I was surprised that the number was so low).Alexa offers a free download with the top 1,000,000 sites
By dantesoft
Thursday, 29. January 2009, 20:09:30
Moderators: pepelsbey | dstorey | mcx | operadev | chrismills | shwetankdixit | brucelawson | iheni | andreasbovens | zibin | mollydotcom