Slashdot Log In
A Statistical Review of 1 Billion Web Pages
Posted by
ScuttleMonkey
on Wed Jan 25, 2006 03:41 PM
from the demanding-a-recount dept.
from the demanding-a-recount dept.
chrisd writes "As part of a recent examination of the most popular html authoring techniques, my colleague Ian Hickson parsed through a billion web pages from the Google repository to find out what are the most popular class names, elements, attributes, and related metadata. We decided that to publish this would be of significant utility to developers. It's also a fascinating look into how people create web pages. For instance one thing that surprised me was that the <title> is more popular than <br>. The graphs in the report require a browser with SVG and CSS support (like Firefox 1.5!). Enjoy!"
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
I clicked I'm Feeling Lucky on this article (Score:2, Funny)
Sheesh.
We've come a long way (Score:4, Funny)
Blink (Score:5, Funny)
Parent
Re:Blink (Score:4, Funny)
I must have blinked, I didn't see it the first time.
Parent
Re:Blink (Score:3, Funny)
Re:Blink (Score:5, Funny)
Every other usage just caused me to browse elsewhere.
Parent
Re:We've come a long way (Score:2)
Re:We've come a long way (Score:2)
I'm guessing one reason for its decreased use is that a lot of browsers refuse to honor that tag.... On the other hand, most browsers still honor the property in CSS. :-D
is more popular than (Score:5, Funny)
Re: is more popular than (Score:2, Funny)
Hooray! I've never been so happy to see a period!
Re: is more popular than (Score:5, Funny)
Parent
Re: is more popular than (Score:5, Funny)
Parent
The reason not to do this (Score:5, Informative)
i helped my uncle jack off a horse
Parent
Finally... (Score:5, Funny)
Re:Pretty crappy page authoring... (Score:3, Insightful)
It's explicitly mentioned on the very first page ("Note: You will need a browser with SVG and CSS support to view the result graphs correctly. We recommend Firefox 1.5.").
Re:Pretty crappy page authoring... (Score:3, Insightful)
Gecko fascism indeed, I mean what a bunch of bastard, using completely valid SVG files [w3.org], oooh the nerve of them blokes...
BR tag? (Score:5, Insightful)
Re:BR tag? (Score:4, Interesting)
Small stat? are you joking?
This is about the number of sites that use the tag, not the number of tags out in the wild, and <br> is used on more pages than <table>, there are as many pages with at least one <br> than pages with at least an <img> tag
That's freaking huge, for a tag that should almost never be used.
Parent
Re:BR tag? (Score:3, Insightful)
The <br> element type is kept around for a few minority uses. Things like poetry, code listings, etc, where dividing something up into lines is necessary. These things are rare, which is why masklinn said "should almost never be used" and not "should never be used".
Yes, and if you take into account the idea that most pages that use the <br> element type do so in precisely this
Re:BR tag? (Score:3, Insightful)
My site is XHTML, so the closing tag is required (not that that's stopping me).
Re:BR tag? is used in 7 out of 8 pages (Score:4, Informative)
the study states that there are more pages using title, than pages using br. NOT that more title tags are used than br tags.
Approximatly 98% of all pages have a title tag and approximatly 7 out of 8 pages have (at least one, probably more) br tags.
Parent
Not complete (Score:5, Funny)
what's the point of a 1 billion page sample? (Score:3, Interesting)
Aside from the cool factor of saying they sampled a billion pages, I don't see what extra benefits are gained from that extra effort.
Re:what's the point of a 1 billion page sample? (Score:5, Informative)
Parent
Re:what's the point of a 1 billion page sample? (Score:5, Insightful)
Parent
Re:what's the point of a 1 billion page sample? (Score:3, Informative)
If you start with a sample size of 1000 and add an additional 10000, the accuracy will increase dramatically. But if you start with 1,000,000,000, and increase it by another 1,000,000,000, the accuracy won't go up even by as much as 0.0001%
Yes, I'm pulling the numbers out of the air, but the point is that there exists a sweet spot where the additional effort does not pay off.
Re:what's the point of a 1 billion page sample? (Score:3, Interesting)
A couple of people have pointed out that the larger the sample size, the less chance there is to attribute a meaningful difference to a situation that is actually a random fluctuation. That may be true, but I believe the point the parent is trying to make is that one of the key advantages of statistical modeling is that you can accurately model very large groups by studying very small samples of that group. If there was actually a need for this large a sample, then fine. Otherwise, the sample size is more s
dude (Score:2)
\. shows up in the Web Authoring Statistics (Score:5, Funny)
The br element is a simple one, yet used on so many pages that it is the 8th most-used element. It is used more than the p element.
clear, style, class, soft, id, and \.
Wow! I never knew you guys were that popular.
Re:\. shows up in the Web Authoring Statistics (Score:5, Funny)
(sheesh)
Parent
Best bash I've seen in a long time: (Score:5, Funny)
Some of these results... (Score:4, Insightful)
Prove that most people (and WYSIWYGs) don't know how to produce valid and accessible markup. The img alt attibute (an accessibility requirement) was found significantly less than width, height, and border.
I'm working on a site now where the project owner is continually reducing usability and accessibilty of the entire site (Never mind that he secretly had a third party come up with an ugly design and ambushed the dev team with it).
I keep telling everyone to deconstruct the adage "form follows function". It means function comes first. He doesn't care what anything *is* or how it *works*, only what it looks like. And, of course, that it's ugly.
Ad for anti-IE (Score:5, Insightful)
Way to go Google! Pour on the pressure!
Re:Ad for anti-IE (Score:4, Informative)
I don't believe Ian Hickson has been involved with Firefox; if I remember correctly, he used to hack on Mozilla, but then started work at Opera before Firefox took off.
I don't think it's a jab at Internet Explorer, it's just that he knows that the target audience is likely to have a decent browser, so he's used the features likely to be available.
Parent
Opera also supports SVG (Score:5, Informative)
Heh (Score:4, Interesting)
I wonder how much of what they found is influenced by how people learned to write HTML - which in all likelihood was to copy code from existing pages... might explain parts of what they found, such as:
Font still popular (Score:3, Interesting)
Of course, there may have been a lot of old pages in the sample, or pages built with older versions of HTML. But I've seen first-hand people using font tags to make an error message red, for example, even in a page that's using XHTML 1.0. I try to explain to the developers I work with why they shouldn't use them. I remove the font tags when those same developers add them to pages I've laid out for them. Zombie-like, they refuse to die.
table with no (Score:5, Informative)
Your code usually goes like this:
So it is quite easy to get the empty table if the collection is empty.
What about plugins? (Score:3, Insightful)
For folks does not (want) to run Firefox (Score:4, Informative)
Notice that I got SVG plugin installed for ages, Safari didn't display the graphs. Is it because I am not using "a browser with CSS"? Well, nevermind really...
This is the thing why I and others have negative views against firefox, svg and even
Wisdom (Score:3, Interesting)
There are several statistics they quoted which I have suspected for a long time, but only now can confirm with numbers.
I can't begin to describe the frustration I feel when I'm forced to use Internet Explorer and clicking links causes pages to fire up in a million new windows. Whether or not a link opens in a new window, a new tab, or the current window/tab really should be a client-side choice. Webmasters think they're being helpful by letting you separate your workspace into many windows, but they're really just slowing people down. Thank God for Firefox.
This makes perfect sense. While colors, fonts and styles are pretty much standard in a cross-browser environment, due to many various interpretations of the CSS Box Model, coding layout purely in CSS can be a terrible chore. It's usually much quicker to do a few simply layouts in tables (header, sidebar, content) and use CSS for pretty much everything else.
Set-Cookie2 insecure? (Score:3, Interesting)
Re:Set-Cookie2 insecure? (Score:3, Informative)
Fix for Firefox 1.5 (Score:3, Informative)
Apparently there's a problem in Firefox 1.5 regarding SVG images if you
had SVG in the registry. Try following the steps described here:
https://bugzilla.mozilla.org/show_bug.cgi?id=3035
I'm feeling violated (Score:3, Insightful)
Re:No GOTOs? (Score:5, Funny)
IF(Post=Old_And_Tired) GOTO Mod_Down
Parent
Re:Beford's Law (Score:4, Interesting)
You see, my hard drive crashed about two weeks ago. It had three partitions on it, and two of them are still perfectly readable. The third is pretty well shot. (Fortunately, it was the most useless partition; it's main contents was Windows itself. This does mean ANOTHER Windows installation -- after having to do one a few weeks before -- but really that's no biggie compared with my actual data. And while I'm on that subject, I had two hard drives; when I got the newer one, I put all my work stuff on it as well as a new Linux installation specifically because it was less likely to fail, and I look back at that decision now with great happiness, because it is that foresight that has made this no big deal at all.)
I've been trying to recover data off of the third partition, and it seems that if you do a full scan of the partition it appears as if the data was just deleted. Most of the time it's able to recover information, but not always: folder names are often lost. They show up in the recovery programs I tried as just Folder2393 for example. (Numbers ranged from 2 to 5 digits.)
The folder numbers approximately follow Benford's law.
Here is the approximate distribution:
(M. S. Digit) (% of folders) (Ideal Benford %)
1 32 30.1
2 15 17.6
3 12 12.5
4 12 9.7
5 19 7.9
6 03 6.7
7 03 5.8
8 02 5.1
9 02 4.6
Parent
Re:Dumb (Score:5, Insightful)
Parent
Re:Worst use of SVG ever (Score:3, Funny)
Re:Poor style by Google (Score:3, Insightful)
I don't think that's the point
If you view code of one of the graphs http://code.google.com/webst [google.com]