Googlebot and Document.Write 180
With JavaScript/AJAX being used to place dynamic content in pages, I was wondering how Google indexed web page content that was placed in a page using the JavaScript "document.write" method. I created a page with six unique words in it. Two were in the plain HTML; two were in a script within the page document; and two were in a script that was externally sourced from a different server. The page appeared in the Google index late last night and I just wrote up the results.
The Results: (Score:5, Informative)
Re:The Results: (Score:5, Informative)
Accessibility? (Score:2, Informative)
DDA / Section 508 / WCAG - the no JavaScript clause makes for a lot of extra work - but it is one that can't be avoided on the (commercial) web application I architect. (Friggin sharks with laser beems for eyes making lawsuits and all.)
Re:Doesn't work; Good (kind of) (Score:3, Informative)
Regardless, I'm pretty sure you'd get banned from the search engines for using such tactics.
Re:How does document.write mess up your DOM tree? (Score:5, Informative)
If you're using document.write, you're writing directly into the document stream, which only works in text/html, not an XHTML MIME type, because there's no way to guarantee the document will continue to be valid.
In this day and age, document.write should never be used, in favor of the more verbose but more future-proof document.createElement and document.createTextNode notation.
Google doesn't, but it's possible (Score:3, Informative)
I'd thought Google would be doing that by now. I've been implementing something that has to read arbitrary web pages (see SiteTruth [sitetruth.com]) and extract data, and I've been considering how to deal with JavaScript effectively.
Conceptually, it's not that hard. You need a skeleton of a browser, one that can load pages and run Javascript like a browser, builds the document tree, but doesn't actually draw anything. You load the page, run the initial OnLoad JavaScript, then look at the document tree as it exists at that point. Firefox could probably be coerced into doing this job.
It's also possible to analyze Flash files. Text which appears in Flash output usually exists as clear text in the Flash file. Again, the most correct approach is to build a psuedo-renderer, one that goes through the motions of processing the file and executing the ActionScript, but just passes the text off for further processing, rather than rendering it.
Ghostscript [ghostscript.com] had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language. It has variables, subroutines, and an execution engine. You have to run PostScript programs to find out what text out.
OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.
Sooner or later, everybody who does serious site-scraping is going to have to bite the bullet and implement the heavy machinery to do this. Try some other search engines. Somebody must have done this by now.
Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.
Re:How did this make the front page? (Score:3, Informative)
Did you know that 99% of all statistics are made up?
I can source some Javascript statistics: W3Schools reports [w3schools.com] that, as of January 2007, 94% of their audience has Javascript turned on, a significantly lower statistic than you are reporting. Not only that, but it is actually the highest percentage since they started recording them binannually in late 2002.
It's a moot point, though: As W3Schools stats page states "You cannot - as a web developer - rely only on statistics. Statistics can often be misleading." Meaning that you should always code things so that they work with HTML/CSS, then use javascript to make it look/act nicer.
Re:Doesn't work; Good (kind of) (Score:4, Informative)
Re:How did this make the front page? (Score:3, Informative)
Re:Google request external JavaScript file? (Score:3, Informative)