Googlebot and Document.Write 180

Posted by kdawson on Monday March 12, 2007 @01:06AM from the ajax-the-foaming-indexer dept.

With JavaScript/AJAX being used to place dynamic content in pages, I was wondering how Google indexed web page content that was placed in a page using the JavaScript "document.write" method. I created a page with six unique words in it. Two were in the plain HTML; two were in a script within the page document; and two were in a script that was externally sourced from a different server. The page appeared in the Google index late last night and I just wrote up the results.

This discussion has been archived. No new comments can be posted.

Googlebot and Document.Write

Load All Comments

Search 180 Comments Log In/Create an Account

Comments Filter:

Nonsense words? (Score:5, Funny)

by Whiney Mac Fanboy ( 963289 ) * writes: <whineymacfanboy@gmail.com> on Monday March 12, 2007 @01:09AM (#18312800) Homepage Journal

An alert came in in the late evening of March 10th for "zonkdogfology", one of the words in the first pair

zonkdogfology is a real word:
zonk-dog-fol-o-gy zohnk-dog--ful-uh-jee
noun, plural -gies.

1. the name given to articles from zonk where the summary makes no sense whatsoever.
Serious question now - is the author of the article worried that the ensuing slashdot discussion will mention all his other nonsense words? I've no doubt slashdotters will find & mention the other words here, polluting google's index....

Share
twitter facebook
- Re:Nonsense words? (Score:4, Funny)
  
  by Anonymous Coward writes: on Monday March 12, 2007 @01:26AM (#18312902)
  
  zonkdogfology is a real word:
  
  It's a perfectly cromulent word, and it's use embiggens all of us.
  
  Parent Share
  twitter facebook
- - Re: (Score:2)
    
    by Dogtanian ( 588974 ) writes:
    
    Seriously, he shouldn't have posted these words until he was done with the test.
    Absolutely, and a major problem I have with taking it seriously now is that Google uses words in links *to* a particular site. This means that there is now a very high risk of false positives if *anyone* has done so with the words that are only dynamically-written on the original page.
    
    If he'd been serious that "Over the next two weeks, I'll be watching to see two things", he should have kept his mouth shut for those two weeks.
The Results: (Score:5, Informative)

by XanC ( 644172 ) writes: on Monday March 12, 2007 @01:12AM (#18312812)

Save a click: No, Google does not "see" text inserted by Javascript.

Share
twitter facebook
- Re:The Results: (Score:5, Informative)
  
  by temojen ( 678985 ) writes: on Monday March 12, 2007 @01:27AM (#18312904) Journal
  
  And rightly so. You should be hiding & un-hiding or inserting elements using the DOM, never using document.write (which F's up your DOM tree).
  
  Parent Share
  twitter facebook
  - How does document.write mess up your DOM tree? (Score:2)
    
    by catbutt ( 469582 ) writes:
    
    I don't believe you.
    - Re:How does document.write mess up your DOM tree? (Score:5, Informative)
      
      by XanC ( 644172 ) writes: on Monday March 12, 2007 @01:58AM (#18313033)
      
      If you're using document.write, you're writing directly into the document stream, which only works in text/html, not an XHTML MIME type, because there's no way to guarantee the document will continue to be valid.
      
      In this day and age, document.write should never be used, in favor of the more verbose but more future-proof document.createElement and document.createTextNode notation.
      
      Parent Share
      twitter facebook
      - Re:How does document.write mess up your DOM tree? (Score:5, Insightful)
        
        by jesser ( 77961 ) writes: on Monday March 12, 2007 @02:40AM (#18313193) Homepage Journal
        
        Perhaps more importantly, document.write can't be used to modify a page that has already loaded, limiting its usefulness for AJAX-style features.
        
        Parent Share
        twitter facebook
        
        Re:How does document.write mess up your DOM tree? (Score:4, Funny)
        
        by hackstraw ( 262471 ) writes: on Monday March 12, 2007 @09:14AM (#18315201)
        
        One of the most clever uses of document.write I've seen was something like: document.write("<--") YOU NEED JAVSCRIPT FOR THIS PAGE document.write("--&gt")
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by aymanh ( 892834 ) writes:
        
        Whatever happened to the <noscript> element [w3.org]?
      - Re: (Score:2)
        
        by CastrTroy ( 595695 ) writes:
        
        What if you need to insert large amounts of HTML into a page? What if you don't have all your HTML that you want to insert laid into a perfect, XML compliant document? I realize that in most cases document.createElement is the better of the 2 methods, but it isn't always possible to not use document.write. There are some instances where it is unavoidable.
        
        Re: (Score:2)
        
        by XanC ( 644172 ) writes:
        
        There are some instances where it is unavoidable.
        
        Can you give an example?
      - Re: (Score:2)
        
        by CarpetShark ( 865376 ) writes:
        
        because there's no way to guarantee the document will continue to be valid.
        
        Except that the programmer might know what they're doing. But I guess we're getting past the point of trusting people more than machines ;)
        
        Not that it's wrong to have failsafes in place, and not that XHTML isn't fine without document.write, but this "validity guarantee" argument is a little worrying.
        
        Re: (Score:2)
        
        by thePowerOfGrayskull ( 905905 ) writes:
        
        Except that the programmer might know what they're doing. But I guess we're getting past the point of trusting people more than machines ;) Not that it's wrong to have failsafes in place, and not that XHTML isn't fine without document.write, but this "validity guarantee" argument is a little worrying.
        That's right up there with saying it's a little worrisome that an invalid cast in a strictly typed language generates a compiler error. After all, can't we trust humans to know what they're doing?
        
        Re: (Score:3, Insightful)
        
        by ultranova ( 717540 ) writes:
        
        Except that the programmer might know what they're doing. But I guess we're getting past the point of trusting people more than machines ;)
        Based on all the segfaults, blue screens of death, X-Window crashes, Firefox crashes, code insertion bugs et cetera I've seen, I'd say that no, in general programmers don't know what they're doing, and certainly shouldn't be trusted to not fuck it up. The less raw access to any resource - be it memory or document stream - they are given, the better.
        
        Re: (Score:2)
        
        by CarpetShark ( 865376 ) writes:
        
        That's fine up to a point, but there should be a way around these limitations. In C, it's all too easy to screw things up with null pointers etc., but if we didn't have those low-level features, a lot of important software would be impossible to write.
        
        I'm not saying that Javascript should ENCOURAGE low-level access to the document, but to flatly deny those things is to falsely limit a language. Languages, after all, are supposed to allow you to express ANYTHING.
        
        Re: (Score:2)
        
        by ultranova ( 717540 ) writes:
        
        That's fine up to a point, but there should be a way around these limitations.
        No. If there's a way around these limitations, then most programmers will simply turn them off because they are experts and know what they're doing. And then the user has to suffer the consequences of the expert's ego and laziness.
        No, bounds checking needs to be mandatory, not voluntary; otherwise it goes unused and the problems continue.
        In C, it's all too easy to screw things up with null pointers etc., but if we didn't h
      - Re: (Score:2)
        
        by jcuervo ( 715139 ) writes:
        
        Hmm. I just do <span> or <div> and document.getElementById('whatever').innerHTML = "...";
        
        Am I wrong?
        
        Re: (Score:2)
        
        by Sancho ( 17056 ) * writes:
        
        innerHTML is nonstandard, but supported in most modern browsers, so I often do it anyway.
        Ah, the mentality that makes so many web pages crap out in Firefox or Opera...
        
        Code to the standard and let your client decide what obscure browser to use.
        
        Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
        
        Re: (Score:2)
        
        by Raenex ( 947668 ) writes:
        
        Code to the standard and let your client decide what obscure browser to use.
        Actually you should code to what works in all the major browsers. If that's the standard, great. Even better, as much as possible use a library that takes care of it for you.
        
        Re: (Score:3, Interesting)
        
        by Sancho ( 17056 ) * writes:
        
        How should 'major', though? When most Firefox-borked sites were coded, Firefox probably had less than 5% (around what Safari had, last I heard). Is 5% enough to overlook? What about 3%? 1%?
        
        If you code to the standard, at least you can blame browsers for their broken implementation.
        
        Re: (Score:2)
        
        by Raenex ( 947668 ) writes:
        
        Back before there was Firefox there was Netscape/Mozilla. Firefox inherited from that. So if you were coding for the "major" browsers you would probably have been ok when Firefox came out.
        
        There are also standards which neither browser support. Bottom line is you have to code to what works, and preferably the standards if the browsers support them. I recommend watching the lecture: An Inconvenient API: The Theory of the DOM (three parts, downloadable here [yahoo.com]).
        
        I wish it was just as simple as "follow the sta
      - Re: (Score:2)
        
        by suv4x4 ( 956391 ) writes:
        
        If you're using document.write, you're writing directly into the document stream, which only works in text/html, not an XHTML MIME type, because there's no way to guarantee the document will continue to be valid.
        
        In this day and age, document.write should never be used, in favor of the more verbose but more future-proof document.createElement and document.createTextNode notation.
        
        element.innerHTML works even on XHTML MIME documents however (Firefox, Opera etc), and there's no significant hurdle to support doc
  - True (Score:2)
    
    by XanC ( 644172 ) writes:
    
    I doubt Google will notice DOM-created elements, either. But the author should re-test with that. And I would suggest that he post the result only if it turns out Google can see that, because we all assume it can't.
- google.com/?q=slashdotting+in+google+dollars (Score:5, Insightful)
  
  by kale77in ( 703316 ) writes: on Monday March 12, 2007 @02:30AM (#18313159) Homepage
  I think the actual experiment here is:
  
  Create a 6-odd-paragraph page saying what everybody already knows.
  
  Slashdot it, by suggesting something newsworthy is there.
  
  Pack the page with Google ads.
  
  Profit.
  
  I look forward to the follow-up piece which details the financial results.
  Parent Share
  twitter facebook
  - Re:google.com/?q=slashdotting+in+google+dollars (Score:5, Insightful)
    
    by Scarblac ( 122480 ) writes: <slashdot@gerlich.nl> on Monday March 12, 2007 @03:46AM (#18313467) Homepage
    
    Exactly, this is the typical sort of fluff that Digg seems to love. As far as I know, Slashdot had avoided this particular type of adword blog post crap until now.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by caluml ( 551744 ) writes:
      
      But with the Firehose, Slashdot will now start using the "wisdom" of crowds to produce the same pap that Digg does.
      Shall we all migrate to Technocrat, anyone? It has decent stories.
    - Re: (Score:3, Insightful)
      
      by dr.badass ( 25287 ) writes:
      
      As far as I know, Slashdot had avoided this particular type of adword blog post crap until now
      
      It used to be that the web as a whole avoided this crap. Now, it's so easy to make stupid amounts of money from stupid content that a huge percentage of what gets submitted only even exists for the money -- it's like socially-acceptable spam. Digg is by far the worst confluence of this kind of crap, but the problem is web-wide, and damn near impossible to avoid.
      - Re: (Score:2)
        
        by Raenex ( 947668 ) writes:
        
        The sad thing is I bet the vast majority of crap like this earns enough to buy lunch or something. There's a lot of people running around trying to get rich doing this, but Google is the real winner.
    - Re: (Score:2)
      
      by ColaMan ( 37550 ) writes:
      
      As far as I know, Slashdot had avoided this particular type of adword blog post crap until now.
      
      Two words:
      Roland Piquepaille.
  - Re: (Score:2)
    
    by Restil ( 31903 ) writes:
    
    I've noticed something with regards to my own site and the few google ads I have placed on the back pages. On those occasions when I get heavy traffic from a link on a popular tech site, my average click ratio goes way down. Slashdot users aren't going to pages to search for products to buy, so it's highly unlikely more than a very few will ever click on any ads, if any at all. Now if the article was promoting a product that the average geek would be interested in, and there were ads on the page for that
  - - Re: (Score:2)
      
      by geoffspear ( 692508 ) writes:
      
      I'm seeing ads right after the first paragraph in the blog story. You must be blocking them.
      
      I'm not sure if 2 of the 3 ads are for brain injury and brain tumor treatment because the name of his blog is "Brain handles" or because you'd need to have a brain injury to do the "research" he's doing.
Google Pigeon technolog (Score:3, Funny)

by sdugoten2 ( 449392 ) writes: on Monday March 12, 2007 @01:14AM (#18312832)

The Google Pigeon [google.com] is smart enough to read through Document.write. Duh!

Share
twitter facebook
If they weren't, then they're trying (Score:4, Interesting)

by AnonymousCactus ( 810364 ) writes: on Monday March 12, 2007 @01:18AM (#18312866)

Google needs to consider script if they want high-quality results. Besides the obvious fact that they'll miss content supplied by dynamic page elements, they could also sacrifice page quality. Page-rank and the like will get them very far, but an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser. It's interesting to know the extent to which they correct for this.

Of course, there are much more subtle ways of changing content once it's been put out there. One might imagine a script that waits 10 seconds and then removes all relevant content and displays Viagra instead. Who knew web search would be restricted by the halting problem? I wonder how far Google goes...

Share
twitter facebook
- Re: (Score:2)
  
  by TubeSteak ( 669689 ) writes:
  
  Page-rank and the like will get them very far, but an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser.
  
  Of course, there are much more subtle ways of changing content once it's been put out there. One might imagine a script that waits 10 seconds and then removes all relevant content and displays Viagra instead.
  Google tends to nuke those sites from orbit once it disc
- Re:If they weren't, then they're trying (Score:5, Insightful)
  
  by gregmac ( 629064 ) writes: on Monday March 12, 2007 @03:16AM (#18313339) Homepage
  
  You have to also remember though, that often the content generated dynmically is going to be of no use to a search engine, it will often be user-specific - there's obviously some reason it's being generated that way.
  
  And if pages are designed using AJAX and dynamic rendering just for the sake of using AJAX and dynamic rendering.. well, they deserve what they get :)
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by jrumney ( 197329 ) writes:
  
  Google should index the static content, but run/analyse the Javascript and throw out any pages where the user-visible content changes drastically. To be 100% effective though, they'd have to fake the IE or Firefox User-Agent, and use IP addresses from an ISP's dynamically assigned range for their crawling, which some people might see as evil.
- Re: (Score:2)
  
  by CastrTroy ( 595695 ) writes:
  
  Rightfully so for google news. Is there any way to configure google news only to show links to articles on certain sites? Or to blacklist certain sites? I really hate those "news" sites that put javascript on every seventh word, so that if you hover over the word, it shows a little pop-up div type ad. It's especially annoying because I like to highlight text as I read it, because I find it easier. I wish google would run all the JS in a page, and lower the ranking if it contained too many ads.
- - Re: (Score:2)
    
    by Skreems ( 598317 ) writes:
    
    They already can and do serve different content to GoogleBot than to normal web users. All it takes is checking the client string.
    - Re: (Score:2)
      
      by Mixel ( 723232 ) writes:
      
      Which is a PITA when the IEEE does it. Look! Free searchable online papers in pdf format!... NOT*!
      
      *apologies, watched the Borat trailer too many times
How did this make the front page? (Score:2, Insightful)

by Anonymous Coward writes:

It should be pretty obvious that no search engine should interpret javascript, let alone remotely sourced javascript. I was actually hoping this guy would show me wrong and demonstrate otherwise, but to my disappointment this was just another mostly pointless blog post.
- Re: (Score:2)
  
  by Jake73 ( 306340 ) writes:
  
  Yeah, I was kinda shocked, really. I always wondered how people with bad blogs were able to break into the mainstream and gather regular readers. I guess they just try like hell to get picked up on Slashdot/Digg/etc with some worthless blog post.
  - Re: (Score:2)
    
    by kv9 ( 697238 ) writes:
    
    Yeah, I was kinda shocked, really. I always wondered how people with bad blogs were able to break into the mainstream and gather regular readers. I guess they just try like hell to get picked up on Slashdot/Digg/etc with some worthless blog post.
    well that too, but in general it's even easier. just aim low and hope for the best. it's not very hard to appeal to the mainstream. shit, it's the largest audience out there.
- - Re: (Score:2)
    
    by NoTheory ( 580275 ) writes:
    
    Why should it? I mean, isn't the point of javascript content that responds dynamically to the intentions of an agent? The googlebot, although an extremely complicated AI agent, isn't intentional. It doesn't know what it's doing on a site, and so i figure probably shouldn't just be allowed out to wreak havoc. Also, wouldn't that allow one an opportunity to fork-bomb the googlebot then as well?
    - - Re: (Score:2)
        
        by zobier ( 585066 ) writes:
        
        Also, wouldn't that allow one an opportunity to fork-bomb the googlebot then as well?
        JavaScript doesn't have fork AFAIK.
        The setTimeout function can do a similar thing.
        
        Re: (Score:3, Informative)
        
        by Bitsy Boffin ( 110334 ) writes:
        
        From memory, setTimeout forms a time-delayed but synchronous entry into the execution stream, you will not get two threads in the same javascript code pile running simultaneously, the timeout will not fire until the execution stream is idle.
        
        Re: (Score:2)
        
        by zobier ( 585066 ) writes:
        
        From memory, setTimeout forms a time-delayed but synchronous entry into the execution stream, you will not get two threads in the same javascript code pile running simultaneously, the timeout will not fire until the execution stream is idle.
        Uh-uh, it has its own execution context. You can absolutely run timed out functions concurrently. Try this:
        <script type="text/javascript">  </script>
        
        Re: (Score:2)
        
        by zobier ( 585066 ) writes:
        
        Without any output it doesn't prove concurrent execution. It could just be a garden variety infinite loop, albeit one with an indirect execution path.
        You want output, how's this for output?
        <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transition al.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <me ta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title></title> </head> <body> <script type="text/javascript"> <!-- var total = 0; function counter ( ) {
      - Re: (Score:3, Informative)
        
        by VGPowerlord ( 621254 ) writes:
        
        Because JavaScript can create content. Since 99% of people run with it enabled, they will see this content, so it makes sense to index it.
        Did you know that 99% of all statistics are made up?
        
        I can source some Javascript statistics: W3Schools reports [w3schools.com] that, as of January 2007, 94% of their audience has Javascript turned on, a significantly lower statistic than you are reporting. Not only that, but it is actually the highest percentage since they started recording them binannually in late 2002.
        
        It's a moot poi
        
        Re: (Score:2)
        
        by xoyoyo ( 949672 ) writes:
        
        "You cannot - as a web developer - rely only on statistics. "
        
        No, indeed, because doing things based on empirical evidence is foolish behaviour. On the other hand you should take a political position (that data, presentation and behaviour should be kept separate) and behave as though that was in some way more true than a statistical value.
        
        I'm not saying that the separation of data, presentation and behaviour is wrong, just that you have to realise that it's a human engineered best practise, not a law of the
        
        Re: (Score:2)
        
        by xoyoyo ( 949672 ) writes:
        
        (Slashdot swallowed my sarcasm tag there - the first paragraph should be read in a mildly mocking voice - the second one is the meat of the matter)
  - Re: (Score:3, Interesting)
    
    by vidarh ( 309115 ) writes:
    
    Because doing so without massive limitations would involve the halting problem. A search engine simply CAN'T determine whether a certain piece of javascript will terminate in the general case. In lots of special cases, yes (such as when there's no control constructs, or the control constructs can't possibly cause loops or recursion etc.) and they could use timeouts etc. or only execute the first "n" steps of an interpreter, yes. But all of it would mean essentially crippling the feature.
    And for what? So t
Google request external JavaScript file? (Score:4, Insightful)

by JAB Creations ( 999510 ) writes: on Monday March 12, 2007 @01:22AM (#18312878) Homepage

Check your access log to see if Google actually requested the external JavaScript file. If it didn't there would be no reason to assume Google is interested in non-(X)HTML based content.

Share
twitter facebook
- Re: (Score:3, Informative)
  
  by The Amazing Fish Boy ( 863897 ) writes:
  
  I have actually seen some reports [google.com] of a "new" Googlebot requesting the CSS and Javascript. The rumour I heard was that it was using the Gecko rendering engine or something along those lines. This was some time ago. I'm not sure what ever became of this.
Doesn't work; Good (kind of) (Score:5, Insightful)

by The Amazing Fish Boy ( 863897 ) writes: on Monday March 12, 2007 @01:29AM (#18312919) Homepage Journal
FTFA:
Why was I interested? Well, with all the "Web 2.0 technologies that rely on JavaScript (in the form of AJAX) to populate a page with content, it's important to know how it's treated to determine if the content is searchable.
Good. I am glad it doesn't work. Google's crawler should never support Javascript.

The model for websites is supposed to work something like this:
- (X)HTML holds the content
- CSS styles that content
- Javascript enhances that content (e.g. provides auto-fill for a textbox)
In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.

So why would Google's crawler look at the Javascript? Javascript is supposed to enhance content, not add it.

Now, that's not saying many people don't (incorrectly) use Javascript to add content to their pages. But maybe when they find out search engines aren't indexing them, they'll change their practices.

The only problem I can see is with scam sites, where they might put content in the HTML, then remove/add to it with Javascript so the crawler sees something different than the end-user does. I think they already do this with CSS, either by hiding sections or by making the text the same color as the background. Does anyone know how Google deals with CSS that does this?
Share
twitter facebook
- Re: (Score:3, Informative)
  
  by doormat ( 63648 ) writes:
  
  I thought I remember a while ago about some search engine using intelligence to ignore hidden text (text with the same or a similar color as the background). Of course the easy work around for that is to use an image for your background and then that may fool the bot, but who knows, they could code to accomidate that too.
  
  Regardless, I'm pretty sure you'd get banned from the search engines for using such tactics.
  - Re: (Score:2)
    
    by zobier ( 585066 ) writes:
    
    I thought I remember a while ago about some search engine using intelligence to ignore hidden text (text with the same or a similar color as the background). Of course the easy work around for that is to use an image for your background and then that may fool the bot, but who knows, they could code to accomidate that too.
    You could use OCR to detect that (and to index images used for text content).
- Re: (Score:3, Insightful)
  
  by cgenman ( 325138 ) writes:
  
  In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.
  
  Define "work". A web page without formatting is going to be useless to anyone who isn't a part-time web developer. To them, it's just going to be one big, messy looking freak out... akin to a television show whose cable descrambler broke. Sure all the "information" is there, somewhere, but in such a horrible format that a human being can't use it.
  
  Web pages ar
  - Re: (Score:3, Insightful)
    
    by WNight ( 23683 ) writes:
    
    I don't know about you, but I write my webpages so that when the style goes away, the page still views in a basic 1996 kind of style. Put the content first and your index bars and ads last then use CSS to position them first, visibly. This way if a blind user or someone without style sheets sees the site it at least reads in order.
    - Re: (Score:2)
      
      by caluml ( 551744 ) writes:
      
      View, Page Style, No Style in Firefox will show you what your page looks like to browsers/spiders.
  - Re: (Score:3, Insightful)
    
    by The Amazing Fish Boy ( 863897 ) writes:
    
    Define "work". A web page without formatting is going to be useless to anyone who isn't a part-time web developer.
    How's this? Disable CSS on Slashdot. First you get the top menu, then some options to skip to the menu, the content, etc. Then you get the menu, then the content. It's very easy to use it that way.
    To them, it's just going to be one big, messy looking freak out... akin to a television show whose cable descrambler broke. Sure all the "information" is there, somewhere, but in such a horrible
- Re: (Score:3, Insightful)
  
  by Animats ( 122034 ) writes:
  The model for websites is supposed to work something like this:
  If only. Turn off JavaScript and try these sites:
  
  Ford Motor Company [ford.com]
  
  Jeep [jeep.com]
  
  Credit Suisse [credit-suisse.com]
- Re: (Score:2)
  
  by Lord Ender ( 156273 ) writes:
  
  The old model is dying. Simple web pages are on the way out. Web applications are the future.
  
  A search engine that indexes web applications is more useful to me than one that can not.
  
  Google realizes that, and you don't.
- Re: (Score:2)
  
  by foniksonik ( 573572 ) writes:
  
  The model should really be
  
  DOM holds the content (whether HTML/XHTML/XML or plain text; static/dynamic or mixed)
  CSS styles that content
  Javascript enhances that content (e.g. provides auto-fill for a textbox)
  
  Google should be indexing the DOM and it's contents, not the code in the file. That's like indexing the english Dictionary and saying you've indexed the english language.
  
  Websites are going to be more and more dynamic. Content is going to be added directly to the page from an amalgamation of sources with t
- Re: (Score:2)
  
  by suv4x4 ( 956391 ) writes:
  
  The only problem I can see is with scam sites, where they might put content in the HTML, then remove/add to it with Javascript so the crawler sees something different than the end-user does. I think they already do this with CSS, either by hiding sections or by making the text the same color as the background. Does anyone know how Google deals with CSS that does this?
  
  Google has a bot that understands CSS and JavaScript, based roughly on the Mozilla source code (wondered why they hire so many Firefox develop
- - Re: (Score:2)
    
    by catbutt ( 469582 ) writes:
    
    Who's talking about Java?
    - Re:Doesn't work; Good (kind of) (Score:4, Informative)
      
      by VGPowerlord ( 621254 ) writes: on Monday March 12, 2007 @04:50AM (#18313695)
      
      In actuality, it says "Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site." – Webmaster Guidelines [google.com], Technical Guidelines section, bullet point 1.
      
      Parent Share
      twitter facebook
      - Looking for good/current Lynx for Windows/XP (Score:2)
        
        by martyb ( 196687 ) writes:
        
        In actuality, it says "Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site." - Webmaster Guidelines, Technical Guidelines section, bullet point 1. (emphasis added)
        Can anyone here recommend a good place to download a current port of Lyn
- - Re: (Score:3, Insightful)
    
    by Rakishi ( 759894 ) writes:
    
    Huh? He's talking about browser generated content, most dynamic content is server side generated (like slashdot but I think slashdot may have flat files as cache for speed reasons). No one said that nice xml file can't be generated by the server when the page is called.
    - - Re: (Score:2)
        
        by Rakishi ( 759894 ) writes:
        
        Yes, I'm well aware of that but he's more specifically talking about certain uses of AJAX. The poster I replied to said dynamic content which is a whole lot more than either the original poster meant or even what AJAX encompasses. Databases and non-flat things are not AJAX and not in any way what the original poster meant. I was simply pointing out that the person I replied to can't read.
        
        AJAX should degrade gracefully, if you don't have javascript things should still work which means that search spiders sho
- - I would make normal links, then use JS on top (Score:4, Insightful)
    
    by The Amazing Fish Boy ( 863897 ) writes: on Monday March 12, 2007 @03:33AM (#18313417) Homepage Journal
    
    So, what do you have to say about websites that have their entire user-interfaces built with content that gets filled by javascript asynchronously from a single html page?
    If I understand you, you something like this: The site has two parts, a menu and content. When you click a menu item, rather than being taken to a new URL, it executes Javascript which fetches only the new content from the web server, then replaces the content section. So the URL doesn't change.
    
    It's a nice improvement. Less bandwidth used, and a quicker interface.
    
    Unfortunately, it's not often done right. The way I would do it is to first make the menu work like it normally would. Make each menu item a link to a new page. Then you apply Javascript to the menu item. Something like this:
    // menuLink is the DOM element for each menu link. // (i.e. get it from document.getElementById(), etc.) menuLink.onclick = function() { getNewContent(); return false; }
    
    (FYI, this is how I do pop-up windows, too.)
    
    Putting it behind a login screen doesn't solve all the problems. You're right that it won't be searchable anyway, but people with older browsers or screen readers won't be able to access it.
    
    I think Gmail actually offers two versions. One for older browser that uses no (or little?) Javascript, and the other which almost everyone else (including me) uses and loves. But I'm not sure how easy it would be to maintain two versions of the same code like that. I also don't think it's nice for the end user to have to choose "I want the simple version", though it may encourage them to update to a newer browser, I guess.
    
    (Of course this is all "ideally speaking", I realize there are deadlines to meet and I violate some of my own guidelines sometimes. I still think they're good practices, though.)
    
    Parent Share
    twitter facebook
    - - Re: (Score:2)
        
        by The Amazing Fish Boy ( 863897 ) writes:
        
        I agree with you that with a small enough user base (or one that is adequately controlled), you can cut some corners, especially if time is a constraint. Generally I would say whenever your users could reasonably demand their browser work. That is, if the site is going to be publicly accessible, I would not make Javascript a requirement. I'm not sure what the actual "limit" I would put on the number of users would be; I think that would vary from project to project.
        
        It's a matter of project requirements,
  - Re: (Score:2)
    
    by maxwell demon ( 590494 ) writes:
    
    And how do you bookmark a certain view of that page (which, to you as page user, is a separate page after all)?
Accessibility? (Score:2, Informative)

by BladeMelbourne ( 518866 ) writes:

The bottom line is your web sites should probably degrade nice enough when JavaScript is not enabled. It might not flow as nice, the user may have to submit more forms, but the core functionality should still work and the core content should still be available.

DDA / Section 508 / WCAG - the no JavaScript clause makes for a lot of extra work - but it is one that can't be avoided on the (commercial) web application I architect. (Friggin sharks with laser beems for eyes making lawsuits and all.)
Document.write() is not the way to go (Score:2)

by Max Romantschuk ( 132276 ) writes:

Document.write() is executed as the page loads. Most AJAX-style implementation rely on either the innerHTML-property or creating nodes through the DOM. Testing these would tell us much more than testing Document.write().
From TFA: (Score:2)

by bennomatic ( 691188 ) writes:

So, some friends and I have been bantering back and forth about how Google treats content that has been inserted into a page using Javascript. So I decided to do an experiment. This page has six nonsense words. Two are hardcoded into the page via straight HTML. Two are inserted via Javascript, but the script is part of the page HTML. The last two are inserted via Javascript, but the script is on a remote server. The purpose of the test is to see three things... * The time lapse between when the words
- - - Re: (Score:2)
      
      by maxwell demon ( 590494 ) writes:
      
      Except that
      
      content may depend on user actions in a non-trivial way (i.e. if the page contains things like onclick or onmouseover, the dependence on the sequence of events occuring may be quite complex),
      
      content may be requested by callbacks to the server (after all, that's what AJAX is all about), in which case I'm not sure it's a good idea for the search engine to execute it,
      running a certain script might be expensive in time and/or memory,
      by processing JavaScript, the search engine might open up itsel
(tagging beta) (Score:2)

by Jack Schitt ( 649756 ) writes:

I predict that from now on, zonkdogfology will be a common tag for all articles that relate to google search...
Google doesn't, but it's possible (Score:3, Informative)

by Animats ( 122034 ) writes: on Monday March 12, 2007 @03:49AM (#18313481) Homepage

I'd thought Google would be doing that by now. I've been implementing something that has to read arbitrary web pages (see SiteTruth [sitetruth.com]) and extract data, and I've been considering how to deal with JavaScript effectively.
Conceptually, it's not that hard. You need a skeleton of a browser, one that can load pages and run Javascript like a browser, builds the document tree, but doesn't actually draw anything. You load the page, run the initial OnLoad JavaScript, then look at the document tree as it exists at that point. Firefox could probably be coerced into doing this job.
It's also possible to analyze Flash files. Text which appears in Flash output usually exists as clear text in the Flash file. Again, the most correct approach is to build a psuedo-renderer, one that goes through the motions of processing the file and executing the ActionScript, but just passes the text off for further processing, rather than rendering it.
Ghostscript [ghostscript.com] had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language. It has variables, subroutines, and an execution engine. You have to run PostScript programs to find out what text out.
OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.
Sooner or later, everybody who does serious site-scraping is going to have to bite the bullet and implement the heavy machinery to do this. Try some other search engines. Somebody must have done this by now.
Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.

Share
twitter facebook
- Re: (Score:2)
  
  by dargaud ( 518470 ) writes:
  
  OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.
  Yes, and it should work like that too. If the background is so cluttered as to make the OCR difficult, then chances are the human will have trouble reading it too. I suggested that during a job interview witha *cough* serious search engine: use a secondary crawler reporting as a normal IE/firefox, load a page usin
  - OCR and web sites (Score:2)
    
    by Animats ( 122034 ) writes:
    
    If the background is so cluttered as to make the OCR difficult, then chances are the human will have trouble reading it too.
    Web site images with logos against faint but busy backgrounds are moderately common. I'm talking about stuff like this. [ddfurnitur...dators.com] Commercial OCR programs interpret that as "a picture". Because we're working to automatically extract business identities from uncooperative websites, we sometimes need heavier technology than the search engines.
- Re: (Score:2)
  
  by VGPowerlord ( 621254 ) writes:
  
  Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.
  Does Google run macros in Word documents? No? Then why are you even comparing this? I can parse a PDF document or a Word document without having to have a script interpreter running.
  
  I imagine that the Googlebot crawler is a rather simplistic program that only knows how to:
  1. Read robots.txt
  2. Read meta tags (robot tags in particular)
  3. Fi
- Re: (Score:2)
  
  by imroy ( 755 ) writes:
  
  Ghostscript had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language.
  Ghostscript had to deal with what problem? Yes, PostScript is a programming language with built-in graphics primitives. What does that have to do with search engines? It doesn't have to recognise certain outlines as being text (i.e text drawn without using the PostScript primitive for drawing text), it just draws it. Ghostscript is just another implementation of a lang
  - Re: (Score:2)
    
    by shish ( 588640 ) writes:
    
    Ghostscript had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language.
    Ghostscript had to deal with what problem? Yes, PostScript is a programming language with built-in graphics primitives. What does that have to do with search engines?
    Postscript is a programming language, not a page description language; you need to write a language interpreter, not just a data parser, to get the most from it. HTML + Javascript also requires an interp
If you want to see (Score:4, Funny)

by BrynM ( 217883 ) * writes: on Monday March 12, 2007 @05:28AM (#18313837) Homepage Journal

If you want to see through a search engine's eyes, open the page in Lynx [browser.org]. The funniest part about showing that method to another developer is when they think Lynx is broken because the page is empty. "It didn't load. How do I refresh the page? This browser sucks." Heh. Endless fun.

(method does not account for image crawlers)

Share
twitter facebook
Google holds back the web! (Score:2, Insightful)

by mumblestheclown ( 569987 ) writes:

this is a pretty straightforward example of how google holds back the web. this is not google's fault, per se, but it definitely is true. We routinely resort to older, inefficient technologies for our websites simply to please google. it works well for us from an advertising standpoint, but is often incredibly stupid technologically.
- Re: (Score:2)
  
  by mumblestheclown ( 569987 ) writes:
  
  As always, I'm underwhelmed by idiot slashdot moderators who mark my comment as 'flamebait.' You may not agree with it, but 'flamebait?' It's not even CLOSE to flamebait. / most of the comments i make on slashdot that get moderated get BOTH {troll/flamebait} AND {insightful/interesting}. If that doesn't suggest that slashdot's moderation system isn't heavily broken, then I don't know what does. I'd understand {overrated}+{interesting} or something like that, but my comments show rather clearly that mod
- - Re: (Score:2)
    
    by mumblestheclown ( 569987 ) writes:
    
    Your whole argument rests on the belief that the developers are doing something wrong. If you start from that premise, of course you are going to come to that conclusion. Just like we do things that we wouldn't otherwise do because of google, you encourage others to. The point is that a better search engine would not force people to change their technologies / add 5-10 lines of "extra" code. extra means extra. A better search engine would do this for you and allow people to use the full extent of tools
AJAX is for writing applications not Documents (Score:2, Interesting)

by e-Trolley ( 771869 ) writes:

AJAX is for writing applications not Documents. Why and how should an application be indexed?
- Re: (Score:2)
  
  by Raenex ( 947668 ) writes:
  
  The line between application vs document gets blurry, fast. Consider a site like Try Ruby! [hobix.com]. There's definitely content hidden inside the tutorial, yet a search engine will never see it.
err news? (Score:2)

by Vexorian ( 959249 ) writes:

Hey I didn't think that after read-skipping the first paragraphs the article would actually state the obvious, that javascript generated content is not indexed... I would have expected an article to appear in case he found out it WAS INDEXED...

In other news: Most plants are green!
Intent-based markup -- a look ahead (Score:2)

by leighklotz ( 192300 ) writes:

Take a look at the Web Accessibility roadmap from the W3C, and in particular the section on intent-based markup [w3.org].
What a dumb idea (Score:2)

by Sloppy ( 14984 ) writes:

Now, its too early to say conclusively that Google will never index the JavaScript-generated content..
..but still, we can hope Google doesn't completely cave in to useless trendy bullshit.
- Re: (Score:2)
  
  by DrXym ( 126579 ) writes:
  
  I do this also on my pages. I also mangle certain things like affiliate links, AdSense ids etc. not for any particular reason except I don't like the idea of any search engine inadvertantly indexing them.
- Re: (Score:2)
  
  by DrSkwid ( 118965 ) writes:
  
  If you want people to contact you, you should provide the contact details properly and suck up the spam yourself.
  
  Choose a non-default email i.e. not webmaster but web-master and deal with the consequences.
  
  In my eyes, a customer/client/new friend not being able to contact you is far more expensive than dealing with some *more* spam.
- - Re: (Score:2)
    
    by Megane ( 129182 ) writes:
    
    Meh. I have mine double-escaped, using two unescape() calls. The first hides the e-mail address, and the second hides the HTML for the mailto link. It even has a <noscript> condition to point out that the user does not have Javascript enabled. I've been scrupulous about noscript ever since one web site that just displayed a blank black page with JS disabled.
    If I was really paranoid, I'd probably come up with some sort of while loop to decode the mail address, and a skip-over condition to change the i
- Re: (Score:2)
  
  by julesh ( 229690 ) writes:
  
  Web apps these days consist nearly entirely of dynamic content invisible to googlebot. If you try to make your page visible on the web, this is really a problem. But think twice before adding invisible div's or alike in order to achieve proper seach results: Google might as well ban you (since they don't check whether or not the keywords you name in your invisible divs do in fact relate to the page's purpose or contents).
  
  OTOH, there's nothing wrong at all with having static content that is only displayed to
  - Re: (Score:2)
    
    by julesh ( 229690 ) writes:
    
    AFAIK, google does index the content of tags.
    
    Erm. "of <NOSCRIPT> tags". Sorry.
- Re: (Score:2)
  
  by leighklotz ( 192300 ) writes:
  
  This TAG finding is all the more reason for the W3C to support declarative approaches to markup, which allow you to express intent in markup, and leave another level to convert that intent to presentation. This approach starts at the top with technologies such as CSS, but the need for dynamic pages is better addressed by recent additions such as XBL (here's an example in mozilla [mozilla.org] -- think of it as like CSS but binding to script instead of to a fixed set of attributes) and XForms [wikibooks.org] (think of it as a 3-layer mo

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Nonsense words? (Score:5, Funny)

Re:Nonsense words? (Score:4, Funny)

Re: (Score:2)

The Results: (Score:5, Informative)

Re:The Results: (Score:5, Informative)

How does document.write mess up your DOM tree? (Score:2)

Re:How does document.write mess up your DOM tree? (Score:5, Informative)

Re:How does document.write mess up your DOM tree? (Score:5, Insightful)

Re:How does document.write mess up your DOM tree? (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

True (Score:2)

google.com/?q=slashdotting+in+google+dollars (Score:5, Insightful)

Re:google.com/?q=slashdotting+in+google+dollars (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Google Pigeon technolog (Score:3, Funny)

If they weren't, then they're trying (Score:4, Interesting)

Re: (Score:2)

Re:If they weren't, then they're trying (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

How did this make the front page? (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

Google request external JavaScript file? (Score:4, Insightful)

Re: (Score:3, Informative)

Doesn't work; Good (kind of) (Score:5, Insightful)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Doesn't work; Good (kind of) (Score:4, Informative)

Looking for good/current Lynx for Windows/XP (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

I would make normal links, then use JS on top (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Accessibility? (Score:2, Informative)

Document.write() is not the way to go (Score:2)

From TFA: (Score:2)

Re: (Score:2)

(tagging beta) (Score:2)

Google doesn't, but it's possible (Score:3, Informative)