Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

A Statistical Review of 1 Billion Web Pages 294

Posted by ScuttleMonkey on Wednesday January 25, 2006 @03:41PM from the demanding-a-recount dept.

chrisd writes "As part of a recent examination of the most popular html authoring techniques, my colleague Ian Hickson parsed through a billion web pages from the Google repository to find out what are the most popular class names, elements, attributes, and related metadata. We decided that to publish this would be of significant utility to developers. It's also a fascinating look into how people create web pages. For instance one thing that surprised me was that the <title> is more popular than <br>. The graphs in the report require a browser with SVG and CSS support (like Firefox 1.5!). Enjoy!"

This discussion has been archived. No new comments can be posted.

A Statistical Review of 1 Billion Web Pages

Load All Comments

Search 294 Comments Log In/Create an Account

Comments Filter:

I clicked I'm Feeling Lucky on this article (Score:2, Funny)

by dada21 ( 163177 ) * writes:

and all I got was Britney Spears.

Sheesh.
We've come a long way (Score:4, Funny)

by suso ( 153703 ) * writes: on Wednesday January 25, 2006 @03:42PM (#14561604) Journal

if the tag isn't on the top elements list.

Share
twitter facebook
- Blink (Score:5, Funny)
  
  by suso ( 153703 ) * writes: on Wednesday January 25, 2006 @03:44PM (#14561622) Journal
  
  the <blink> tag.
  
  Parent Share
  twitter facebook
  - Re:Blink (Score:4, Funny)
    
    by mysqlrocks ( 783488 ) writes: on Wednesday January 25, 2006 @03:46PM (#14561643) Homepage Journal
    
    the <blink> tag.
    
    I must have blinked, I didn't see it the first time.
    
    Parent Share
    twitter facebook
    - Re:Blink (Score:3, Funny)
      
      by Repton ( 60818 ) writes:
      
      All you need to do is blink at the right frequency and you'll never see it at all!
  - Re:Blink (Score:5, Funny)
    
    by ReverendLoki ( 663861 ) writes: on Wednesday January 25, 2006 @04:47PM (#14562228)
    
    Still, the only good use I ever saw for that tag was the line:
    
    Schrodinger's cat is <blink>not</blink> dead.
    
    Every other usage just caused me to browse elsewhere.
    
    Parent Share
    twitter facebook
  - - Re:Blink (Score:2)
      
      by Billy the Mountain ( 225541 ) writes:
      
      You can, it's called the <HTML> tag
- Re:We've come a long way (Score:2)
  
  by dgatwood ( 11270 ) writes:
  
  I'm guessing that was <blink>, but that's just a hunch.
  - Re:We've come a long way (Score:2)
    
    by dgatwood ( 11270 ) writes:
    
    Sigh. A few seconds too late. Oh, well.
    I'm guessing one reason for its decreased use is that a lot of browsers refuse to honor that tag.... On the other hand, most browsers still honor the property in CSS. :-D
is more popular than (Score:5, Funny)

by InsideTheAsylum ( 836659 ) writes: on Wednesday January 25, 2006 @03:46PM (#14561636)

well when people talk like this and dont bother using punctuation spacekeys or any of the skills that they have been taught in school its no wonder why webpages turn out like this not to mention those long runon sentences and also all that broken code that are the fist attempt at a webpage by a twelve year old kid who tried to steal someone elses layout and replaced the word with his own then you start to look at all of those dynamically generated webpages and the layouts and the style sheets and its no wonder why the good old br tag never get a work out.

Share
twitter facebook
- Re: is more popular than (Score:2, Funny)
  
  by Fr05t ( 69968 ) writes:
  
  "...out."
  
  Hooray! I've never been so happy to see a period!
  - Re: is more popular than (Score:5, Funny)
    
    by aussie_a ( 778472 ) writes: on Wednesday January 25, 2006 @03:54PM (#14561717) Journal
    
    Never been scared your girlfriend was pregnant? Oh wait, this is slashdot. Nevermind.
    
    Parent Share
    twitter facebook
    - Re: is more popular than (Score:5, Funny)
      
      by Anonymous Coward writes: on Wednesday January 25, 2006 @05:42PM (#14562724)
      
      Women and Compilers... miss a period and they go wild.
      
      Parent Share
      twitter facebook
  - Re: is more popular than (Score:2)
    
    by FinestLittleSpace ( 719663 ) * writes:
    
    I don't mind them. It's just the decorating afterwards that I'm not a fan of.
- The reason not to do this (Score:5, Informative)
  
  by winkydink ( 650484 ) * writes: <sv.dude@gmail.com> on Wednesday January 25, 2006 @04:37PM (#14562132) Homepage Journal
  
  Capitalization makes all the difference in the sentence:
  
  i helped my uncle jack off a horse
  
  Parent Share
  twitter facebook
Finally... (Score:5, Funny)

by RandoX ( 828285 ) writes: on Wednesday January 25, 2006 @03:46PM (#14561637)

An un-slashdottable server.

Share
twitter facebook
- Not so fast - I'm pulling up mostly blank pages... (Score:2)
  
  by xxxJonBoyxxx ( 565205 ) writes:
  
  Not so fast - I'm pulling up mostly blank pages...
  
  Classes
  
  How many different class names do pages use? Well, most pages apparently don't use the class attribute at all, and it's downhill from there:
  
  (nothing for about 15 lines)
  
  Which class names are used on the most pages? Here are the top 20:
  
  (nothing for about 15 lines)
  
  This actually maps very well to the elements that are being proposed in HTML5:
  
  etc...
  - Re:Not so fast - I'm pulling up mostly blank pages (Score:2, Informative)
    
    by stunt_penguin ( 906223 ) writes:
    
    Try using a SVG compatible browser. SVG graphics *tend to* work better that way.
    - Pretty crappy page authoring... (Score:2)
      
      by xxxJonBoyxxx ( 565205 ) writes:
      
      Pretty crappy page authoring...not to tell a poor end user that he/she was missing a required viewer (w/ Mozilla 1.7.6). My old Firefox 1.0 showed a "click here to download plug-in", but never came back with a plug-in. (OK, so then I tried Firefox 1.5 and it worked.)
      - Re:Pretty crappy page authoring... (Score:3, Insightful)
        
        by Bogtha ( 906264 ) writes:
        
        Pretty crappy page authoring...not to tell a poor end user that he/she was missing a required viewer
        
        It's explicitly mentioned on the very first page ("Note: You will need a browser with SVG and CSS support to view the result graphs correctly. We recommend Firefox 1.5.").
        
        Re:Pretty crappy page authoring... (Score:2)
        
        by xxxJonBoyxxx ( 565205 ) writes:
        
        ...but remember that the Web supports "direct" links. In other words, if someone gets a link to just this report's "elements" page, there's no hint. Thus, it's crappy page authoring, because it will look like a broken web page to the average user.
        
        Re:Pretty crappy page authoring... (Score:3, Insightful)
        
        by masklinn ( 823351 ) writes:
        
        Gecko fascism indeed, I mean what a bunch of bastard, using completely valid SVG files [w3.org], oooh the nerve of them blokes...
- - Re:Finally... (Score:2)
    
    by menkhaura ( 103150 ) writes:
    
    "Googled" as in "slashdotted"? Clusterfucked!
    "Googled" as "searched for"? Googled.
BR tag? (Score:5, Insightful)

by p0 ( 740290 ) writes: on Wednesday January 25, 2006 @03:46PM (#14561638)

With css power you really do not need to use br, maybe that is the reason for the small stats for the tag's use?

Share
twitter facebook
- Re:BR tag? (Score:4, Interesting)
  
  by masklinn ( 823351 ) writes: <[slashdot.org] [at] [masklinn.net]> on Wednesday January 25, 2006 @03:53PM (#14561711)
  
  Small stat? are you joking?
  
  This is about the number of sites that use the tag, not the number of tags out in the wild, and <br> is used on more pages than <table>, there are as many pages with at least one <br> than pages with at least an <img> tag
  
  That's freaking huge, for a tag that should almost never be used.
  
  Parent Share
  twitter facebook
  - Re:BR tag? (Score:2)
    
    by poot_rootbeer ( 188613 ) writes:
    
    a tag that should almost never be used. I don't understand on what you're basing that opinion on, as the BR tag is not deprecated in the HTML 4.x nor XHTML 1.x standards. Demanding a line break at a particular location is perfectly cromulent syntactic markup. (Actually, it's more of a suggestion than a demand; non-page-based devices will quietly ignore the tag, should anyone ever develop a practical non-page-based device for the web.) What SHOULD never happen, I think, is for BR to be treated as a subst
    - Re:BR tag? (Score:3, Insightful)
      
      by Bogtha ( 906264 ) writes:
      
      The <br> element type is kept around for a few minority uses. Things like poetry, code listings, etc, where dividing something up into lines is necessary. These things are rare, which is why masklinn said "should almost never be used" and not "should never be used".
      
      What SHOULD never happen, I think, is for BR to be treated as a substitute for proper block-level delineation.
      Yes, and if you take into account the idea that most pages that use the <br> element type do so in precisely this
  - Re:BR tag? (Score:2)
    
    by Metasquares ( 555685 ) writes:
    
    I disagree. <br> is extremely useful when you want to allow your users to enter in certain HTML tags without allowing them to launch XSS attacks.
    
    For that matter, <br> is useful when users enter in a combination of text and HTML. Putting a BR where the newline was preserves the formatting of the text as the user entered it (for example, see the HTML of this Slashdot post. I'm entering it as plain old text and I placed no BR tags in it). A tag like <pre> may be better for that, though.
    - Re:BR tag? (Score:2)
      
      by Luyseyal ( 3154 ) writes:
      
      Of course with <pre> you get those asshole side-scroller trolls (scroll-trolls?).
      -l
    - - Re:BR tag? (Score:3, Insightful)
        
        by Metasquares ( 555685 ) writes:
        
        Because I don't know if the user wants to enter a paragraph. What the user entered is a line break (that's what hitting return does), thus br is the tag to use. If the user wants to enter in a paragraph, he can enter his own p tag or skip a line (which is the default p tag behavior anyway) and the p tag will be used.
        
        My site is XHTML, so the closing tag is required (not that that's stopping me).
- Re:BR tag? CSS, duh! (Score:2)
  
  by conJunk ( 779958 ) * writes:
  
  It's weird; a lot of this study seems to ignore CSS where it's fairly obvious that's what's going on.
  You're right about BR. It's just about useless these days.
  
  Look at this sentence from the 'HTTP Headers' section:
  There are pages that use the Window-Target header, and even some that use the Link header (though we haven't yet checked what for!). There are even some pages that include the Content-Style-Type header.
  Excuse me? the link header is for including stylesheets (among other uses). The fact th
  - Re:BR tag? CSS, duh! (Score:2)
    
    by gstein ( 2577 ) * writes:
    
    Euh... who is being thick now? That page is about HTTP Headers, not elements you find in the <header> element of an HTML page. Specifically, the Link HTTP header mentioned refers to Section 19.6.2.4 of RFC 2068.
    
    And yeah... it ignored CSS. It's looking at page elements in order to help out the WHAT folks.
  - Re:BR tag? CSS, duh! (Score:2)
    
    by Bogtha ( 906264 ) writes:
    
    the link header is for including stylesheets (among other uses).
    This kind of misunderstanding is why people should learn the proper names for things. The study is referring to the Link HTTP header. You are referring to the <link> HTML element type. Headers are not element types, even if most people call both of them "tags".
    Using the Link HTTP header for stylesheets is not practical because most browsers don't support it and those that do only added support recently.
- Re:BR tag? is used in 7 out of 8 pages (Score:4, Informative)
  
  by TekGoNos ( 748138 ) writes: on Wednesday January 25, 2006 @06:14PM (#14562967) Journal
  
  The summary got it wrong,
  
  the study states that there are more pages using title, than pages using br. NOT that more title tags are used than br tags.
  
  Approximatly 98% of all pages have a title tag and approximatly 7 out of 8 pages have (at least one, probably more) br tags.
  
  Parent Share
  twitter facebook
- - Re:BR tag? (Score:2, Funny)
    
    by crumley ( 12964 ) * writes:
    
    But don't we all use br's when we quote other people on slashdot?
    No.
  - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
    - Re:BR tag? (Score:2)
      
      by crabpeople ( 720852 ) writes:
      
      you must have used a br because otherwise your text would be right above ["reply to this"] (or your sig) as this will probably be. br and p are the ownage.. i dont see why we need more than html 3. its good enough for everyone.
- - Re:BR tag? (Score:2)
    
    by MyHair ( 589485 ) writes:
    
    That's probably what they mean. I started using <br /> for a while before I realized it was only valid for xhtml and not valid for any definition of html. I've seen other people do this, too. I also did the same thing for <img src="boobies.jpg" /> and other no-content tags; I don't have the plugins to see the graph so I can't see if this "element" shows up in img, meta or the others.
Not complete (Score:5, Funny)

by Anonymous Coward writes: on Wednesday January 25, 2006 @03:47PM (#14561653)

It didn't have everything of course. Some elements were censored on behalf of the Chinese government.

Share
twitter facebook
what's the point of a 1 billion page sample? (Score:3, Interesting)

by ecklesweb ( 713901 ) writes: on Wednesday January 25, 2006 @03:49PM (#14561674)

I have to ask, what's the purpose of a 1-BILLION page sample? That's the beautiful thing about statistics. If you can say something about the distribution of characteristics within a population, you don't have to survey the entire population to get meaningful results. Are the study authors proposing that no standard distribution can be applied to the entire universe of web pages? If that's the case, then do the statistics they apply to their sample of one billion really say anything predictive about the entire population?

Aside from the cool factor of saying they sampled a billion pages, I don't see what extra benefits are gained from that extra effort.

Share
twitter facebook
- Re:what's the point of a 1 billion page sample? (Score:2)
  
  by chrisd ( 1457 ) * writes:
  
  Well, I'm guessing that 1000 was too small of a sample :-)
- Re:what's the point of a 1 billion page sample? (Score:5, Informative)
  
  by Anonymous Coward writes: on Wednesday January 25, 2006 @03:53PM (#14561708)
  
  You get a decrease of the variance of the mean.
  
  Parent Share
  twitter facebook
- Re:what's the point of a 1 billion page sample? (Score:5, Insightful)
  
  by Durinthal ( 791855 ) writes: on Wednesday January 25, 2006 @03:55PM (#14561728)
  
  If you can have a larger sample, why not use it? It's more accurate that way.
  
  Parent Share
  twitter facebook
  - Re:what's the point of a 1 billion page sample? (Score:3, Informative)
    
    by shoolz ( 752000 ) writes:
    
    Because with statistics, increasing the sample size does not result in a uniform increase in accuracy.
    
    If you start with a sample size of 1000 and add an additional 10000, the accuracy will increase dramatically. But if you start with 1,000,000,000, and increase it by another 1,000,000,000, the accuracy won't go up even by as much as 0.0001%
    
    Yes, I'm pulling the numbers out of the air, but the point is that there exists a sweet spot where the additional effort does not pay off.
  - Re:what's the point of a 1 billion page sample? (Score:2)
    
    by poot_rootbeer ( 188613 ) writes:
    
    If you can have a larger sample, why not use it? It's more accurate that way.
    
    Because there's a point of diminishing returns.
    
    If a 1-million-page sample gives you 85% accuracy, and a 2-million-page sample gives you 95% accuracy, it may be worth the extra time and effort to process the 2-million-page sample. But if reaching 96% accuracy requires you to process 1 BILLION pages, it's probably not worth the time or the effort.
    - Re:what's the point of a 1 billion page sample? (Score:2)
      
      by 99BottlesOfBeerInMyF ( 813746 ) writes:
      
      But if reaching 96% accuracy requires you to process 1 BILLION pages, it's probably not worth the time or the effort.
      
      You're assuming it took significantly more work. They just pulled all of these from Google's cache, so the extra work may have been letting their script run overnight instead of for an hour in the morning. More pages will make it more accurate and I'm sure they are more qualified to judge the proper amount of work/reward more-so than anyone not doing the project.
- Re:what's the point of a 1 billion page sample? (Score:2)
  
  by metlin ( 258108 ) * writes:
  
  I doubt Google was doing that just for the purposes of data gathering, though.
  
  Imagine - they were able to scale the system to process 1 BILLION webpages. That is a significant achievement, which means that somewhere in Google, they have the ability to not only gather and sort/search a lot of data, but also derive meaning from it (statistical or otherwise).
  
  That is a significant achievement.
  
  Data by itself becomes fairly pointless after a while, however finding relations and meaning within that data is what ma
  - Re:what's the point of a 1 billion page sample? (Score:2)
    
    by haluness ( 219661 ) writes:
    
    Obviously I don't know whether Google performed more (or sophisticated) analysis on the billion pages. But if it simply calculating sums and means, it's more a matter of time than sexy algorithms.
    
    I mean, just distribute the counting over processors - this problem seems trivially parallel
    
    But of course, I don't work for Google, so who knows what those wizards are doing with the stats!
    - Re:what's the point of a 1 billion page sample? (Score:2)
      
      by metlin ( 258108 ) * writes:
      
      I would think that tokenise-ing and statistically analyzing such data would not be a trivial task for that large a sample.
      
      Then again, maybe someone from Google could tell us? (Chris?)
      - Re:what's the point of a 1 billion page sample? (Score:2)
        
        by chrisd ( 1457 ) * writes:
        
        So I'd like to preface my response to your question by saying that I don't want to sound like we are showing off here, but Google has invested a lot of time and resources into making this kind of thing somewhat trivial to do. From a computer science and cpu-time perspective, not so trivial, but we do have the available spare facilities to do this kind of thing as we like within a reasonable amount of time.
        Chris
- Re:what's the point of a 1 billion page sample? (Score:2)
  
  by leuk_he ( 194174 ) writes:
  
  a billion is way beyond cool. Do you even understand how much a billion is. for a Billion dollar you could buy your own small counltr, a billion bricks build a tower that is unbelieveable big. And so on.
  
  but that billion is the thing that is most interresting. the other part is just statistics that are just fun, nothing more.
- Re:what's the point of a 1 billion page sample? (Score:2)
  
  by ChrisGilliard ( 913445 ) writes:
  
  I agree with the people who said that basically a billion sounds cool. I suppose you could use a million, but that would not be as cool as a billion. A billion is the new million. A company that is as media savy as Google is understands this.
- Re:what's the point of a 1 billion page sample? (Score:2)
  
  by Sebastopol ( 189276 ) writes:
  
  I thought google had like 10 billion pages archived?
  
  That would be 10%, which is still pretty large, I guess.
  
  IANAStatitician, and I never understood how a confidence interval isn't tied to the population size...
  
  Too weird.
- Re:what's the point of a 1 billion page sample? (Score:3, Interesting)
  
  by finelinebob ( 635638 ) writes:
  
  A couple of people have pointed out that the larger the sample size, the less chance there is to attribute a meaningful difference to a situation that is actually a random fluctuation. That may be true, but I believe the point the parent is trying to make is that one of the key advantages of statistical modeling is that you can accurately model very large groups by studying very small samples of that group. If there was actually a need for this large a sample, then fine. Otherwise, the sample size is more s
- Depth (Score:2)
  
  by xant ( 99438 ) writes:
  
  So far everyone who has replied to you has ignored one thing. A thousand may be fine for seeing a simple "A or B" statistical difference at significant levels; with ANOVA you can even track a few different significant traits.
  
  The number of traits they were trying to discover was unknown at the start; furthermore, they expected it to be very high. Lots of different HTML tags in the standards, but even more nonstandard tags, nonstandard attributes; they even found information about how different attributes a
dude (Score:2)

by dotpavan ( 829804 ) writes:

I am still at the 22nd page, lot more to go (1 billion? OMG!).. see you all there
\. shows up in the Web Authoring Statistics (Score:5, Funny)

by digitaldc ( 879047 ) * writes: on Wednesday January 25, 2006 @03:52PM (#14561701)

The 'br' element [google.com]

The br element is a simple one, yet used on so many pages that it is the 8th most-used element. It is used more than the p element.

clear, style, class, soft, id, and \.

Wow! I never knew you guys were that popular.

Share
twitter facebook
- Re:\. shows up in the Web Authoring Statistics (Score:5, Funny)
  
  by shrikel ( 535309 ) writes: <[hlagfarj] [at] [gmail.com]> on Wednesday January 25, 2006 @03:59PM (#14561754)
  
  You're confused. Backslashdot is across the street.
  (sheesh)
  
  Parent Share
  twitter facebook
Google is good today. (Score:2)

by hey ( 83763 ) writes:

Not just non-evil. This is useful and interesting stuff.
Best bash I've seen in a long time: (Score:5, Funny)

by Benanov ( 583592 ) writes: <[gro.fsf.rebmem] [ta] [pmek.nairb]> on Wednesday January 25, 2006 @03:53PM (#14561710) Journal

From TFA, the classes page:

The rest of the top 20 classes are either presentational or otherwise meaningless (msonormal, for example, which is one of the classes that Microsoft Office uses in its "HTML" output).

Share
twitter facebook
With apologies to Warren Zevon (Score:2)

by Tackhead ( 54550 ) writes:

> As part of a recent examination of the most popular html authoring techniques, my colleague Ian Hickson parsed through a billion web pages from the Google repository to find out what are the most popular class names, elements, attributes, and related metadata.
"Unfortunately, it was also of significant interest to the DOJ, who wanted to know how many times the word 'boobs' appeared in the first 50 characters after the string "IMG SRC". Because we didn't actually look for this data, and because th
Some of these results... (Score:4, Insightful)

by Dracos ( 107777 ) writes: on Wednesday January 25, 2006 @03:55PM (#14561727)

Prove that most people (and WYSIWYGs) don't know how to produce valid and accessible markup. The img alt attibute (an accessibility requirement) was found significantly less than width, height, and border.

I'm working on a site now where the project owner is continually reducing usability and accessibilty of the entire site (Never mind that he secretly had a third party come up with an ugly design and ambushed the dev team with it).

I keep telling everyone to deconstruct the adage "form follows function". It means function comes first. He doesn't care what anything *is* or how it *works*, only what it looks like. And, of course, that it's ugly.

Share
twitter facebook
SVG, uh. (Score:2)

by Janek Kozicki ( 722688 ) writes:

so I'm using debian sarge, and oh well - flame about dozens of other distros, but currently I'm too lazy[1] to update to etch, or anything else. And in sarge there is firefox 1.0.4 without SVG. Anyone knows some backported debs for sarge that will provide SVG support?

[1] everything is about priorites, I spend some time reading /. but in fact I have some work to do, and this work is not switching linux distros around.
Ad for anti-IE (Score:5, Insightful)

by jamienk ( 62492 ) writes: on Wednesday January 25, 2006 @04:01PM (#14561779)

It looks like a subtle push against IE: many mantions of the HTML 5 spec (which is being written by WHAT a workgroup that includes many browser companies but not MS); use of SVG; written by a major FF developer.

Way to go Google! Pour on the pressure!

Share
twitter facebook
- Re:Ad for anti-IE (Score:4, Informative)
  
  by Bogtha ( 906264 ) writes: on Wednesday January 25, 2006 @04:48PM (#14562251)
  
  written by a major FF developer
  
  I don't believe Ian Hickson has been involved with Firefox; if I remember correctly, he used to hack on Mozilla, but then started work at Opera before Firefox took off.
  
  I don't think it's a jab at Internet Explorer, it's just that he knows that the target audience is likely to have a decent browser, so he's used the features likely to be available.
  
  Parent Share
  twitter facebook
Beford's Law (Score:2)

by SIGFPE ( 97527 ) writes:

I'm curious to see how closely Benford's Law [wikipedia.org] is followed by these pages. It should be easy for Google to run the stats.
- Re:Beford's Law (Score:4, Interesting)
  
  by EvanED ( 569694 ) writes: <evanedNO@SPAMgmail.com> on Wednesday January 25, 2006 @04:38PM (#14562140)
  
  I had an interesting run-in with Benford's law a bit ago. I had this typed up already, so here goes (description of the law omitted; read the Wikipedia link in the parent -- it's really cool):
  
  You see, my hard drive crashed about two weeks ago. It had three partitions on it, and two of them are still perfectly readable. The third is pretty well shot. (Fortunately, it was the most useless partition; it's main contents was Windows itself. This does mean ANOTHER Windows installation -- after having to do one a few weeks before -- but really that's no biggie compared with my actual data. And while I'm on that subject, I had two hard drives; when I got the newer one, I put all my work stuff on it as well as a new Linux installation specifically because it was less likely to fail, and I look back at that decision now with great happiness, because it is that foresight that has made this no big deal at all.)
  
  I've been trying to recover data off of the third partition, and it seems that if you do a full scan of the partition it appears as if the data was just deleted. Most of the time it's able to recover information, but not always: folder names are often lost. They show up in the recovery programs I tried as just Folder2393 for example. (Numbers ranged from 2 to 5 digits.)
  
  The folder numbers approximately follow Benford's law.
  
  Here is the approximate distribution:
  (M. S. Digit) (% of folders) (Ideal Benford %) 1 32 30.1 2 15 17.6 3 12 12.5 4 12 9.7 5 19 7.9 6 03 6.7 7 03 5.8 8 02 5.1 9 02 4.6
  
  Parent Share
  twitter facebook
One thing that screws up web page studies (Score:2)

by MonkeyBoyo ( 630427 ) writes:

One thing that screws up web page studies is that some sites duplicate pages hundreds or thousands of times.

Oliver Steele did a cute study on how to spell aargh. [osteele.com]

Unfortunately much of his data is screwed up because he counted pages for each spelling not unique pages.

For this study, I don't see this problem ocurring.
Opera also supports SVG (Score:5, Informative)

by TheJavaGuy ( 725547 ) writes: on Wednesday January 25, 2006 @04:15PM (#14561901) Homepage

FYI, Opera also supports SVG. I'm surprised that Ian Hickson didn't have Opera also mentioned on that Google page, after all he worked at Opera until a few months ago.

Share
twitter facebook
- Re:Opera also supports SVG (Score:2)
  
  by MagicM ( 85041 ) writes:
  
  It does, but the latest version (8.51) doesn't appear to deal with the graphs very well. It just shows black blocks.
TITLE vs. BR (Score:2)

by HTH NE1 ( 675604 ) writes:

For instance one thing that surprised me was that the <title> is more popular than <br>

I'm not surprised. The TITLE container is required for every HTML page to be considered valid across all versions and is the most important text on the page, used by search engines to link to the page. Though browsers will accept pages without it, you'd be a damn fool not to use it.

BR is optional and generally unnecessary when P handles your general hard line breaking needs. Even with TITLE being once, only
Heh (Score:4, Interesting)

by Z0mb1eman ( 629653 ) writes: on Wednesday January 25, 2006 @04:20PM (#14561944) Homepage

This reminds me of the old joke that there only ever was one 'make' script, and everyone else modified it.

I wonder how much of what they found is influenced by how people learned to write HTML - which in all likelihood was to copy code from existing pages... might explain parts of what they found, such as:

Most people (roughly 98%) include head, html, title and body elements. This is somewhat ironic, since three of those four elements are optional in HTML

Share
twitter facebook
- Re:Heh (Score:2, Informative)
  
  by Blink Tag ( 944716 ) writes:
  
  Most people (roughly 98%) include head, html, title and body elements. This is somewhat ironic, since three of those four elements are optional in HTML
  
  Somewhat true. The HEAD tag is technically optional (per spec), but TITLE is required, and must be in the HEAD. Thus HEAD is required in practice.
  From the HTML 4.01 spec [w3.org]:
  Every HTML document must have a TITLE element in the HEAD section.
  
  Though marked as "start tag optional"/"end tag optional", the BODY and HTML tags do provide useful symantec relevanc
Font still popular (Score:3, Interesting)

by superflippy ( 442879 ) writes: on Wednesday January 25, 2006 @04:22PM (#14561961) Homepage Journal

In their list of the 19 most popular elements, the font tag was #16. This element was deprecated when, back in 2000 or so?

Of course, there may have been a lot of old pages in the sample, or pages built with older versions of HTML. But I've seen first-hand people using font tags to make an error message red, for example, even in a page that's using XHTML 1.0. I try to explain to the developers I work with why they shouldn't use them. I remove the font tags when those same developers add them to pages I've laid out for them. Zombie-like, they refuse to die.

Share
twitter facebook
- Re:Font still popular (Score:2)
  
  by poot_rootbeer ( 188613 ) writes:
  
  I've seen first-hand people using font tags to make an error message red
  
  You know, there's something to be said for the straightforwardness of the "Font. Color. Red. Do it." approach.
  
  With CSS, the developer has to decide whether to set the color as an inline style, as a page-defined style, or as part of an external stylesheet. Whether to apply that style to an existing element containing the error message, or to wrap the error text in a new SPAN element. Whether the CSS style should be applied based on tag
table with no (Score:5, Informative)

by saigon_from_europe ( 741782 ) writes: on Wednesday January 25, 2006 @04:32PM (#14562070)

From the article:
If someone can explain why so many pages would use a
<table>

tag and then not put any cells in it, please let us know.

I don't know if they counted dynamic pages, but I guess they did. In dynamic pages, an empty table is quite normal.

Your code usually goes like this:

<table> <% for each element in collection %> <tr><td> something </td></tr> <% end for %> </table>

So it is quite easy to get the empty table if the collection is empty.

Share
twitter facebook
- Re:table with no (Score:2)
  
  by mblase ( 200735 ) writes:
  
  I don't know if they counted dynamic pages, but I guess they did. In dynamic pages, an empty table is quite normal.
  
  I doubt it. This is from Google, which only searches the server's output, not the uncompiled code.
What about plugins? (Score:3, Insightful)

by AndrewStephens ( 815287 ) writes: on Wednesday January 25, 2006 @04:43PM (#14562190) Homepage

I would be interested in seeing how many web pages use Java applets, Flash, Shockwave, Quicktime, ActiveX controls etc, etc. Sadly the authors did not include this information.

Share
twitter facebook
Script attributes (Score:2)

by Stan Vassilev ( 939229 ) writes:

Among the top 15 attributes used in the [script] tag are the following:

"langauge"
"langugage"
"languaje"

Link to that page in the stats:
http://code.google.com/webstats/2005-12/scripting. html [google.com]

I just have no comment to this.
Poor style by Google (Score:2, Redundant)

by Jugalator ( 259273 ) writes:

Web developers shouldn't aim for writing for one browser, but as many as possible.

They're doing the exact opposite of what they should be doing.

They're doing what led us into this shitty IE situation in the first place; targetting specific browsers instead of the public.

Can anyone tell me what's here that can't be visualized with GIF's?

Even if it'd mean less features for the user, they should at least graciously fall back to a more basic technology than SVG's.

How do these pages look on IE, Opera, Safari, or
- Re:Poor style by Google (Score:3, Insightful)
  
  by pbhj ( 607776 ) writes:
  
  >>> "Can anyone tell me what's here that can't be visualized with GIF's?"
  
  I don't think that's the point ... it's about the creation of the images, not their visualisation. These images can be created on the fly from varying data with only textual manipulation of the code - the processing will be extremely light as will the data load on the servers. Presumably the xml-to-image parsing in the browser incurs a processing penalty though.
  
  If you view code of one of the graphs http://code.google.com/webst [google.com]
For folks does not (want) to run Firefox (Score:4, Informative)

by Ilgaz ( 86384 ) writes: on Wednesday January 25, 2006 @05:43PM (#14562733) Homepage

http://www.adobe.com/svg/viewer/install/main.html [adobe.com] got suitable plugins for browsers/OS of choice.

Notice that I got SVG plugin installed for ages, Safari didn't display the graphs. Is it because I am not using "a browser with CSS"? Well, nevermind really...

This is the thing why I and others have negative views against firefox, svg and even .ogg. Rootless promotion of this kind...

Share
twitter facebook
Wisdom (Score:3, Interesting)

by AeroIllini ( 726211 ) writes: <aeroillini@gmail. c o m> on Wednesday January 25, 2006 @05:44PM (#14562737)

They've really hit on some wisdom here.

There are several statistics they quoted which I have suspected for a long time, but only now can confirm with numbers.

more than half of pages use the target attribute on the a element somewhere.

I can't begin to describe the frustration I feel when I'm forced to use Internet Explorer and clicking links causes pages to fire up in a million new windows. Whether or not a link opens in a new window, a new tab, or the current window/tab really should be a client-side choice. Webmasters think they're being helpful by letting you separate your workspace into many windows, but they're really just slowing people down. Thank God for Firefox.

It seems most pages use presentational attributes: the fourth most used attribute across all elements is the table element's border attribute, followed by the height and width attributes on img, followed by <table width="">, <table cellspacing="">, <img border="">, and <table cellpadding="">. Interestingly, though, the most frequently used attribute on the body element (namely bgcolor) is only used on around half of pages, with all the other presentational attributes on body being used even less. One possible explanation is that on average, colors are mostly done using CSS, while layout is mostly done using HTML tables.

This makes perfect sense. While colors, fonts and styles are pretty much standard in a cross-browser environment, due to many various interpretations of the CSS Box Model, coding layout purely in CSS can be a terrible chore. It's usually much quicker to do a few simply layouts in tables (header, sidebar, content) and use CSS for pretty much everything else.

Share
twitter facebook
Set-Cookie2 insecure? (Score:3, Interesting)

by tedhiltonhead ( 654502 ) writes: on Wednesday January 25, 2006 @07:03PM (#14563304)

The linked site claims the Set-Cookie header is "considered insecure":
The Set-Cookie header (which is one of the ten most-used headers) is present on about two orders of magnitude more pages than the Set-Cookie2 header (despite the former being considered insecure).
After glancing over the RFC [ietf.org] for Set-Cookie2, I can't see where it says Set-Cookie is "insecure". Google turns up nothing useful. Does anybody know more about this?

Share
twitter facebook
- Re:Set-Cookie2 insecure? (Score:3, Informative)
  
  by hixie ( 116369 ) writes:
  
  Yeah, I misspoke on this. Set-Cookie is insecure (due to domain-crossing problems -- should a cookie sent to a.b.c get sent to z.b.c? Depends on "b" and "c" in ways that depend on month-to-month political changes around the globe), but as far as I can tell, Set-Cookie2 is also insecure. I had thought it fixed this, but apparently not.
Fix for Firefox 1.5 (Score:3, Informative)

by bigbadbuccidaddy ( 160676 ) writes: on Wednesday January 25, 2006 @08:10PM (#14563792)

If your Firefox 1.5 doesn't display the graphs, or crashes, do the following as suggested by the Google webstats author:

Apparently there's a problem in Firefox 1.5 regarding SVG images if you
had SVG in the registry. Try following the steps described here:

https://bugzilla.mozilla.org/show_bug.cgi?id=30358 1#c3 [mozilla.org]

Share
twitter facebook
I'm feeling violated (Score:3, Insightful)

by Sontas ( 6747 ) writes: on Friday January 27, 2006 @01:23AM (#14576587)

1 billion pages! Talk about a violation of privacy! The justice department is only asking for a random sample of 1 million addresses and the search results for any 1 week period. This guy gets access to 1 billion pages via the google repository (whatever that is), conducts detailed analysis of the contents of those pages, and nary a word of dissent from the vast Slashdot audience.

Share
twitter facebook
- Re:No GOTOs? (Score:5, Funny)
  
  by the computer guy nex ( 916959 ) writes: on Wednesday January 25, 2006 @04:00PM (#14561763)
  
  How about:
  
  IF(Post=Old_And_Tired) GOTO Mod_Down
  
  Parent Share
  twitter facebook
- Bah... (Score:2, Interesting)
  
  by Run4yourlives ( 716310 ) writes:
  
  Again, properly formatted this time: For example, looking at what HTML ids and classes are most common, and at how many sites validate (and yes, we know that we're not leading the way in terms of validation). There are more <o:p> elements (from Microsoft Office) on the Web than there are <h6> elements. If someone can explain why so many pages would use a <table> tag and then not put any cells in it, please let us know. Web "professionals" (and I am one of that group) have got a long, long, lon
  - Now that's what I call... (Score:2)
    
    by SIGFPE ( 97527 ) writes:
    
    ...irony!
- Re:Strangely... (Score:2)
  
  by Billosaur ( 927319 ) * writes:
  
  (Score:2, Troll)
  Talk about knee-jerk moderation...
- - Re:Strangely... (Score:2)
    
    by menkhaura ( 103150 ) writes:
    
    Go grab Seamonkey [mozilla.org].
    - Re:Strangely... (Score:2)
      
      by Billosaur ( 927319 ) * writes:
      
      Cool browser! Unfortunately, it didn't help... I suspect the content's being blocked locally somehow.
    - Stop spreading FUD (Score:2)
      
      by jonasj ( 538692 ) writes:
      
      Firefox 1.5 and Seamonkey 1.0 are both based on Mozilla 1.8.0. When they use the exact same rendering engine, how would switching from one of them to the other give you any difference in how a site works? It won't, and I suspect that you knew that and just used this opportunity to advertise your favorite browser.
      
      Now, if you had just replied to the article and pointed out that Seamonkey is another browser that also supports SVG, that would be totally fair; but instead you chose to reply to someone asking for
- Re:Firefox 1.5 (Score:2, Interesting)
  
  by bigbadbuccidaddy ( 160676 ) writes:
  
  The black box is caused by them not using type="text/css" on the ?xml-stylesheet declaration. type is a required attribute. If I add that it renders properly on all the svg viewers I tried.
- - Re:Firefox 1.5 (Score:2)
    
    by LWATCDR ( 28044 ) writes:
    
    Works fine for me using Firefox 1.5 under Suse 9.3.
- - Re:Cool statistics (Score:2)
    
    by Anonymous Brave Guy ( 457657 ) writes:
    
    I can't quite agree with that. Like frames and Flash, image maps have their place, and can be a useful tool in the right circumstances. It's just that those circumstances happen rather more rarely than image maps are used in practice, which gets them a rather unfair bad reputation. I've seen a couple of nicely presented graphical site maps that were much easier to understand than a big nested list, and relied on image maps to link through to the content pages, for example.
- Re: is NOT more popular than (Score:2)
  
  by grimJester ( 890090 ) writes:
  
  Ah, that makes sense. I was wondering why the average table had one row and one cell.
- Re:Dumb (Score:5, Insightful)
  
  by Spad ( 470073 ) writes: <.slashdot. .at. .spad.co.uk.> on Wednesday January 25, 2006 @05:24PM (#14562594) Homepage
  
  It's even dumber to state that someone is presenting pictures with Flash when they're actually using SVG.
  
  Parent Share
  twitter facebook
- Re:Worst use of SVG ever (Score:3, Funny)
  
  by jamesots ( 214246 ) writes:
  
  Yeah, and what's the point of using HTML? They could have posted an image of the text to the same effect.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

I clicked I'm Feeling Lucky on this article (Score:2, Funny)

We've come a long way (Score:4, Funny)

Blink (Score:5, Funny)

Re:Blink (Score:4, Funny)

Re:Blink (Score:3, Funny)

Re:Blink (Score:5, Funny)

Re:Blink (Score:2)

Re:We've come a long way (Score:2)

Re:We've come a long way (Score:2)

is more popular than (Score:5, Funny)

Re: is more popular than (Score:2, Funny)

Re: is more popular than (Score:5, Funny)

Re: is more popular than (Score:5, Funny)

Re: is more popular than (Score:2)

The reason not to do this (Score:5, Informative)

Finally... (Score:5, Funny)

Not so fast - I'm pulling up mostly blank pages... (Score:2)

Re:Not so fast - I'm pulling up mostly blank pages (Score:2, Informative)

Pretty crappy page authoring... (Score:2)

Re:Pretty crappy page authoring... (Score:3, Insightful)

Re:Pretty crappy page authoring... (Score:2)

Re:Pretty crappy page authoring... (Score:3, Insightful)

Re:Finally... (Score:2)

BR tag? (Score:5, Insightful)

Re:BR tag? (Score:4, Interesting)

Re:BR tag? (Score:2)

Re:BR tag? (Score:3, Insightful)

Re:BR tag? (Score:2)

Re:BR tag? (Score:2)

Re:BR tag? (Score:3, Insightful)

Re:BR tag? CSS, duh! (Score:2)

Re:BR tag? CSS, duh! (Score:2)

Re:BR tag? CSS, duh! (Score:2)

Re:BR tag? is used in 7 out of 8 pages (Score:4, Informative)

Re:BR tag? (Score:2, Funny)

Re: (Score:2)

Re:BR tag? (Score:2)

Re:BR tag? (Score:2)

Not complete (Score:5, Funny)

what's the point of a 1 billion page sample? (Score:3, Interesting)

Re:what's the point of a 1 billion page sample? (Score:2)

Re:what's the point of a 1 billion page sample? (Score:5, Informative)

Re:what's the point of a 1 billion page sample? (Score:5, Insightful)

Re:what's the point of a 1 billion page sample? (Score:3, Informative)

Re:what's the point of a 1 billion page sample? (Score:2)

Re:what's the point of a 1 billion page sample? (Score:2)

Re:what's the point of a 1 billion page sample? (Score:2)

Re:what's the point of a 1 billion page sample? (Score:2)

Re:what's the point of a 1 billion page sample? (Score:2)

Re:what's the point of a 1 billion page sample? (Score:2)

Re:what's the point of a 1 billion page sample? (Score:2)

Re:what's the point of a 1 billion page sample? (Score:2)

Re:what's the point of a 1 billion page sample? (Score:2)

Re:what's the point of a 1 billion page sample? (Score:3, Interesting)

Depth (Score:2)

dude (Score:2)

\. shows up in the Web Authoring Statistics (Score:5, Funny)

Re:\. shows up in the Web Authoring Statistics (Score:5, Funny)

Google is good today. (Score:2)

Best bash I've seen in a long time: (Score:5, Funny)

With apologies to Warren Zevon (Score:2)

Some of these results... (Score:4, Insightful)

SVG, uh. (Score:2)

Ad for anti-IE (Score:5, Insightful)

Re:Ad for anti-IE (Score:4, Informative)

Beford's Law (Score:2)

Re:Beford's Law (Score:4, Interesting)

One thing that screws up web page studies (Score:2)

Opera also supports SVG (Score:5, Informative)

Re:Opera also supports SVG (Score:2)

TITLE vs. BR (Score:2)

Heh (Score:4, Interesting)

Re:Heh (Score:2, Informative)

Font still popular (Score:3, Interesting)

Re:Font still popular (Score:2)

table with no (Score:5, Informative)

Re:table with no (Score:2)

What about plugins? (Score:3, Insightful)

Script attributes (Score:2)

Poor style by Google (Score:2, Redundant)