Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Google User Journal

Googlebot and Document.Write 180

With JavaScript/AJAX being used to place dynamic content in pages, I was wondering how Google indexed web page content that was placed in a page using the JavaScript "document.write" method. I created a page with six unique words in it. Two were in the plain HTML; two were in a script within the page document; and two were in a script that was externally sourced from a different server. The page appeared in the Google index late last night and I just wrote up the results.
This discussion has been archived. No new comments can be posted.

Googlebot and Document.Write

Comments Filter:
  • The Results: (Score:5, Informative)

    by XanC ( 644172 ) on Monday March 12, 2007 @01:12AM (#18312812)
    Save a click: No, Google does not "see" text inserted by Javascript.
  • Re:The Results: (Score:5, Informative)

    by temojen ( 678985 ) on Monday March 12, 2007 @01:27AM (#18312904) Journal
    And rightly so. You should be hiding & un-hiding or inserting elements using the DOM, never using document.write (which F's up your DOM tree).
  • Accessibility? (Score:2, Informative)

    by BladeMelbourne ( 518866 ) on Monday March 12, 2007 @01:38AM (#18312963)
    The bottom line is your web sites should probably degrade nice enough when JavaScript is not enabled. It might not flow as nice, the user may have to submit more forms, but the core functionality should still work and the core content should still be available.

    DDA / Section 508 / WCAG - the no JavaScript clause makes for a lot of extra work - but it is one that can't be avoided on the (commercial) web application I architect. (Friggin sharks with laser beems for eyes making lawsuits and all.)
  • by doormat ( 63648 ) on Monday March 12, 2007 @01:46AM (#18312991) Homepage Journal
    I thought I remember a while ago about some search engine using intelligence to ignore hidden text (text with the same or a similar color as the background). Of course the easy work around for that is to use an image for your background and then that may fool the bot, but who knows, they could code to accomidate that too.

    Regardless, I'm pretty sure you'd get banned from the search engines for using such tactics.
  • by XanC ( 644172 ) on Monday March 12, 2007 @01:58AM (#18313033)

    If you're using document.write, you're writing directly into the document stream, which only works in text/html, not an XHTML MIME type, because there's no way to guarantee the document will continue to be valid.

    In this day and age, document.write should never be used, in favor of the more verbose but more future-proof document.createElement and document.createTextNode notation.

  • by Animats ( 122034 ) on Monday March 12, 2007 @03:49AM (#18313481) Homepage

    I'd thought Google would be doing that by now. I've been implementing something that has to read arbitrary web pages (see SiteTruth [sitetruth.com]) and extract data, and I've been considering how to deal with JavaScript effectively.

    Conceptually, it's not that hard. You need a skeleton of a browser, one that can load pages and run Javascript like a browser, builds the document tree, but doesn't actually draw anything. You load the page, run the initial OnLoad JavaScript, then look at the document tree as it exists at that point. Firefox could probably be coerced into doing this job.

    It's also possible to analyze Flash files. Text which appears in Flash output usually exists as clear text in the Flash file. Again, the most correct approach is to build a psuedo-renderer, one that goes through the motions of processing the file and executing the ActionScript, but just passes the text off for further processing, rather than rendering it.

    Ghostscript [ghostscript.com] had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language. It has variables, subroutines, and an execution engine. You have to run PostScript programs to find out what text out.

    OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.

    Sooner or later, everybody who does serious site-scraping is going to have to bite the bullet and implement the heavy machinery to do this. Try some other search engines. Somebody must have done this by now.

    Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.

  • by VGPowerlord ( 621254 ) on Monday March 12, 2007 @03:54AM (#18313505)

    Because JavaScript can create content. Since 99% of people run with it enabled, they will see this content, so it makes sense to index it.

    Did you know that 99% of all statistics are made up?

    I can source some Javascript statistics: W3Schools reports [w3schools.com] that, as of January 2007, 94% of their audience has Javascript turned on, a significantly lower statistic than you are reporting. Not only that, but it is actually the highest percentage since they started recording them binannually in late 2002.

    It's a moot point, though: As W3Schools stats page states "You cannot - as a web developer - rely only on statistics. Statistics can often be misleading." Meaning that you should always code things so that they work with HTML/CSS, then use javascript to make it look/act nicer.
  • by VGPowerlord ( 621254 ) on Monday March 12, 2007 @04:50AM (#18313695)
    In actuality, it says "Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site." – Webmaster Guidelines [google.com], Technical Guidelines section, bullet point 1.
  • by Bitsy Boffin ( 110334 ) on Monday March 12, 2007 @05:04AM (#18313745) Homepage
    From memory, setTimeout forms a time-delayed but synchronous entry into the execution stream, you will not get two threads in the same javascript code pile running simultaneously, the timeout will not fire until the execution stream is idle.
  • by The Amazing Fish Boy ( 863897 ) on Monday March 12, 2007 @06:49AM (#18314253) Homepage Journal
    I have actually seen some reports [google.com] of a "new" Googlebot requesting the CSS and Javascript. The rumour I heard was that it was using the Gecko rendering engine or something along those lines. This was some time ago. I'm not sure what ever became of this.

"Floggings will continue until morale improves." -- anonymous flyer being distributed at Exxon USA

Working...