Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Google Businesses The Internet IT

Google Launches Google Sitemaps 223

Ninwa writes "Google has launched Google Sitemaps. It seems to be a service that allows webmasters to define how often their sites' content is going to change, to give Google a better idea of what to index. It uses some basic XML as the method of submitting a sitemap. More information on the protocol is available in an FAQ. What's most interesting is that Google is licensing the idea under the Attribution/Share Alike Creative Commons license. According to the Google Blog, this is being done '...so that other search engines can do a better job as well. Eventually we hope this will be supported natively in webservers (e.g. Apache, Lotus Notes, IIS).' They even offer an open source client in Python."
This discussion has been archived. No new comments can be posted.

Google Launches Google Sitemaps

Comments Filter:
  • great interview (Score:5, Informative)

    by professorhojo ( 686761 ) * on Friday June 03, 2005 @10:56AM (#12713975)
    for more crunchy detail, here's a great Q&A interview i found with Shiva Shivakumar, engineering director and the technical lead for Google Sitemaps:

    http://blog.searchenginewatch.com/blog/050602-1952 24 [searchenginewatch.com]
  • by sachmet ( 10423 ) on Friday June 03, 2005 @10:56AM (#12713976)
    Everyone else defines a protocol. But apparently Google defines protocools.

    I guess the rest of the world has a long way to go to catch up...
    • by Pac ( 9516 )
      [To ELP's "Lucky Man"]

      They had white pages
      And hits by the score
      All the people's queries
      Waiting by the door

      Ooooh, what a search engine it was
      Ooooh, what a search engine it was

      Many geeks and hackers
      They made up its core
      Everybody's dearest
      A daily stop for more

      Ooooh, what a search engine it was
      Ooooh, what a search engine it was

      It went to the market
      Of the engines it was king
      Of his honor and his glory
      Slashdot would sing

      Ooooh, what a search engine it was
      Ooooh, what a search engine it was

      A burst had found it
      • I find your lack of faith disturbing.
      • You know, that was pretty clever, but a hint as to how this burst happened would be helpful.

        Especially since Google really does have a great idea here. I know Slashdotters on the whole love Google, and I know there's a bit of a backlash, but for the sake of the integrity of the argument, let's have that backlash be for some legitimate reason, not just because Google's too popular because, well, it really is great.

        D
        • Please, I was just making a quick joke - no predictions, nothing so serious. I love Google dearly (I have even installed the Accelerator for a day or two until it bothered me with that "23 minutes saved" message) and I think it is popular because of its merits and the hard work of its people, not because they got lucky or something. But even great companies can eventually disappear for one reason or another.
  • Cool idea (Score:5, Interesting)

    by aftk2 ( 556992 ) on Friday June 03, 2005 @11:00AM (#12714006) Homepage Journal
    This is a cool idea, because I've often wondered about being able to "talk" to search engines at a slightly higher level than robots.txt allows.

    For example, a website we launched a couple months ago is primarily images. We played nice - all of the images have legitimate alt tags, and we tried to let the site degrade properly in older browsers (although you really wouldn't get much, in those instances).

    But the biggest problem we had was trying to get the site spidered by Google. It would be, and it would appear in the index, but it would be listed far below sites that linked to it. I don't believe Google likes sites that are primarily images. We populated meta tags with descriptions, but they weren't included; we even tried using hidden text - legitimate, hidden text that would serve as the sites description, but not break the design - but you know how Google feels about those sorts of things. We had to walk a fine line. This'll be nicer.
    • Re:Cool idea (Score:2, Interesting)

      I think Google doesn't like NEW sites. I run a high school alumni website, and it was at least 6 months before you could type in the title of the homepage (which was "[Town nobody has heard of] Alumni") into Google and have it listed in the top 10. Once it did start appearing in the top ten, it was still below sites that linked to it. Most of the higher results simply had "Alumni" in them and nothing with the town name. After about 9 months, my site now has the #1 slot for that search string.
      • Re:Cool idea (Score:4, Informative)

        by Eric Giguere ( 42863 ) on Friday June 03, 2005 @11:12AM (#12714118) Homepage Journal

        Quite right, a new site can be listed in the Google index pretty quickly -- it only took a few days for my latest site to be found by the Googlebot -- but it takes a while before any PageRank gets assigned to its pages, especially if there are no inbound links to the site. No PageRank, no top listing...

        Eric
        Currently at #1 for adsense tips [google.com]
      • Re:Cool idea (Score:5, Informative)

        by rehannan ( 98364 ) on Friday June 03, 2005 @11:25AM (#12714229) Homepage
        I just put a new site online. About 4 or 5 days after submitting it to google, it was the number one hit when searching for the title of the site.
        • That's pretty strange, because Google definitely has a sandbox that they keep sites in for 6-8 months.

          Maybe you had few competitors and those competitors (for the search result) were also new.
        • Re:Cool idea (Score:4, Informative)

          by singleantler ( 212067 ) on Friday June 03, 2005 @12:16PM (#12714690) Homepage Journal

          It's quite common to be high up for matching terms for about a week, then disappear for three months or so. This seems to be normal behaviour for new sites and is nicknamed the Google sandbox [google.com] and seems to have been confirmed by the patent application recently made public.

          The sandbox is just an artificial lowering, so if you're a match for a rare term you can still be found quite easily.

        • I just put a new site online. About 4 or 5 days after submitting it to google, it was the number one hit when searching for the title of the site.

          So you're the one who came up with "DISCREET ONLINE PHARMACY" ?? :)

          Seriously though, if there aren't a lot of other sites containing your title, that's easy. If you're one among a dozen or so, not so easy.
        • Re:Cool idea (Score:3, Informative)

          by mgbaron ( 457884 )
          I think I can shed a little light on this situation as I have had both of the above cases happen to me.

          This is how the system works. Google can index your site very quickly (within a couple of days), if you have an incoming link or submit to their crawler. If your site is well keyword optimized for a fairly rare keyword, it is entirely plausible that it would come up number one fairly quickly.

          What takes a long time is for google to update their pagerank index. This is where your site will sit in the Go
      • I'm still waiting for my site, Calum [calum.org] to get indexed. The bots come regularly, but nothing in there. If people could just paste the following link onto their pages, Calum [calum.org], I'm sure everything would be right with Calum [calum.org] and Google. I'm sure Google doesn't hate Calum [calum.org], and that there is just some misunderstanding.
        :)
    • On february 16th i sent google the following email to suggestions@google.com: Hi,
      This is a suggestion for the people who take care of indexing web sites.
      Because Google is the first search engine of choice it has enough of influence to point noses into the same direction.
      So, i propose a new element to be added to websites: a sitemap file. Similar to the favicon file, every site could have an (xml?) file containing information about the info and the info-topography on the site.
      Google has already a 'si
  • by Anonymous Coward
    Had to say it:

    http://www.fuckedgoogle.com/ [fuckedgoogle.com]
  • Sitemaps abuse? (Score:3, Insightful)

    by iolagnm ( 645827 ) <iolagnm@gmail.PERIODcom minus punct> on Friday June 03, 2005 @11:03AM (#12714033) Homepage
    It will take a company with enough influence like Google to really promote XML sitemaps, which could lead to a great thing... but what is to stop them from becoming like MetaTags where companies will just flood them with useless keywords and entries in an attempt to get better search rankings?
    • I'd really like to see a site-influenced system like this that defines areas of news and areas of non-news. I'm tired of searching for multiple terms and getting main articles devoted to one of the terms and sidebar links to one of the others. For example, [insert notebook model] and Linux.. you might get a site like Slashdot where there's an article about the new notebook and many, many sidebar items about Linux.
      • Re:Sitemaps abuse? (Score:3, Interesting)

        by Jellybob ( 597204 )
        Using XHTML this shouldn't be too hard - something along the lines of:
        &lt;goog:index&gt;
        Stuff that actually matters
        &lt;/goog:index&gt;
        Advertising crap which people don't care about.
        It's not going to fix the problem on sites which are doing this delibrately, but for those of us who actually care about getting indexed relevantly it would be great.
    • I've not seen anything to suggest sitemaps will improve your ranking, just get you indexed more often.

      If you claim pages update every day, but they don't, it will be pretty easy for the spider to tell. So you could stop the frequent scans if they aren't really needed, if after say a month the supposed daily updates never happened.

    • Re:Sitemaps abuse? (Score:2, Interesting)

      by drnlm ( 533500 )
      That's really up to the search engine implementation, isn't it.

      Anyway, a brief look at the proposed format gives very little scope for abuse - you can specify location, change frequency, last modified and a priority, and that's it. The priority is specified as only applying to urls from the same site, so what you can do with it is fairly limited. Overall, it looks written as a set of additional hints to spiders crawling the site.

    • Re:Sitemaps abuse? (Score:4, Informative)

      by ArbitraryConstant ( 763964 ) on Friday June 03, 2005 @11:54AM (#12714488) Homepage
      Well, I noticed two things about it...

      First, the priority is a relative priority, so if you want to set every page to 1.0 (defined as the highest priority) it'll mean nothing.

      Second, if you lie about update frequency or the date of the last update they'll figure it out pretty quick.

      These aren't commands, they're hints.
  • I would love to see a new meta tag for address to become common. Could make things like Google local even more useful.
  • Ermm this is all well and good and such but isn't a large chunk of this information already made available via Cache-Control and Last-Modified HTTP headers?

    Reminds me of blog pings - what's wrong with using the Referer header? Doing some checking and then fetching the referering page and checking for linkage?

    Has the world gone XML crazy?
    • Actually, no, I don't think it has. Precisely as you observe, only a large chunk is available. Now the fact that the vanilla aspects you mention can already be acheived is not a good enough reason to avoid implementing some kind of value-added extensible version of anything that is useful. This is the net evolving to serve humans better, right in front of us.

      Just think of this sort of thing as inter-linking web services sitting on top of the http protocol.

      Justin.

      • But as it stands this XML Sitemap index doesn't provide any new information that HTTP headers don't (assuming dynamic pages update handle them well) except for the priority weighting...which should be derived from update frequency.

        I don't see how centralising all this header information serves webmasters better. Only Google.
    • Cache-Control only works on a per-request basis, and Last-Modified only works if you decide to check again. They're designed for clients like web browsers, where you only care about whether there have been changes when the user is checking on the site; they're not good for trying to schedule spidering, because many things specify "no-cache" (if the user wants to look at the page, just get a new one) and doing HEAD requests on the whole web for the Last-Modified dates is going to be slow.
      • "Cache-Control only works on a per-request basis"

        I believe proxies cache the headers as well, unless must-revalidate is specified in which case it must do a If-Modifed-Since or similar request which will return fresh headers. How is it not Google's responsibility to remember when to crawl your page anyway? Thats exactly what they intend to do.

        "They're designed for clients like web browsers, where you only care about whether there have been changes when the user is checking on the site"

        Why is Google any d
  • by stlhawkeye ( 868951 ) on Friday June 03, 2005 @11:07AM (#12714067) Homepage Journal
    I envision the interior of Google as this huge warehouse full of oversized transistors, data streams with paddleboats, waterfalls of caffeinated beer, chairs contoured like a keyboard key, where diminutive men in green hair sing songs about electrons and logic gates and if you wander into the room where Duke Nukem 3D is being tested you'll be thrown out.
    • if you wander into the room where Duke Nukem 3D is being tested you'll be thrown out.

      I think you mean Duke Nukem Forever.
      Duke Nukem 3D is nearly ten years old, I remember playing it at high school on my Pentium-100 laptop.

    • I never saw any paddleboats, but they did have a keg of beer outside the cafe yesterday. And there's no shortage of caffeinated drinks in the mini-kitchens.

      I can neither confirm nor deny the existence of any secret video game testing rooms.

      -B

    • Yeah, you've mostly got the description of the public tour, but if you step off the boat and go searching around, you'll find a room with this 3-story-tall slug, spewing out search results from it's back-side! It's a disturbing site, but I still can't get myself to stop using Google!
  • by broward ( 416376 )
    It's not surprising that Google is using a Creative Commons license. The meme has been steadily gaining strength for over a year.

    http://www.realmeme.com/miner/preinflection.php?st artup=/miner/preinflection/creativecommonscontentD ejanews.png [realmeme.com]
  • This sounds like a really cool idea.

    Livejournal.com has had a number of problems with Google, and often just plain outright bans them from spidering the site. Part of the problem is that all the registered users have their journals at journalname.livejournal.com as well as livejournal.com\users\journalname. This means indexing the journals for resisted users doubles the load on their server farm!

    With something like this, livejournal would be able to define exactly how often the indexing process occurs,
  • It's too bad they couldn't use figure out a way to add addtional keywords to robots.txt. (w/o breaking it) Now one needs to create both files for a site to index properly.
    • Google wants this sitemap funtionality to make into the web server itself. So, it looks like they're opting for the long-term solution.

  • by yotto ( 590067 ) on Friday June 03, 2005 @11:11AM (#12714108) Homepage
    In other news, the Google Evil Index went down 3.2 points today, and is currently at 13.8, the lowest it's been since right before the beta rollout of Google Web Accelerator.
  • Somebody's using Lotus Notes as a webserver? May God have mercy on their souls.

    (The submitter probably meant Lotus Domino, which is still a bad webserver, but not nearly as bad as Notes would be.)
    • They may have really meant 'notes,' this has been seen in the wild... the person who I knew who worked there lamented about it quite a lot. but it is done... sad to say.
  • If people use this, it will likely remove much redundancy from google's indexing processes, possibly freeing up bandwidth and processing power in their datacenters for other projects like more web-based applications...
  • by 823723423 ( 826403 ) on Friday June 03, 2005 @11:22AM (#12714201)
    Navigation is sometimes the hardest part on the internet. A tree structure is sometimes the second easiest way of searching/browsing for information (1st being keyword searching). So maybe if more web designers set up server side solutions, it will lower the burden on web designers. More importantly, move navigation away from web designers to users just as Google displaced content from web designers unto Searchers. So instead of overburdening web servers like this Firefox extension Firefox extension with screenshot [extensionsmirror.nl] which automatically generates a sitemap br crawling a site. Sites can access a sitemap using a favicon.ico like or link rel="sitemap.rdf or sitemap.xml" protocol. Just as netscape NAVIGATOR originally proposed a while back. I think web designers should pay attention - at least those that don't use flash for their whole site. The web is slowly become a database of content rather than style. See the webmonkey wired article on netscape sitemap feature Sitemap rdf [wired.com] or the sitemap slide here Slide from seminar [ukoln.ac.uk]
  • "Google is licensing the idea under the Attribution/Share Alike Creative Commons license. "

    And I'm willing to license my idea, "better search engines with better user interfaces", to Google, for a modest sum.
  • I'm wondering: why do you need a license to implement this? Did Google patent this?

    In any case, patented or not, the CC license that this falls under seems acceptable for an open standard, even if it is patented, because it is transferable and because its requirements are minimal. Contrast this with the Microsoft Office XML license, which is royalty-free (for now...), but non-transferable.
  • It needs Python 2.2, and I only have 1.5 running. Unfortuately, so many things depend on it (*cough* Ensim *cough*) that attempting to upgrade is a death wish.

    Will wait until I get my new server. :)
  • An idea cannot be copyrighted, and thus cannot be licensed under a copyright license like Creative Commons. File formats, being facts, shouldn't be copyrightable either. If the text of the spec is licensed as Attribution-ShareAlike, then all this allows is people to fork the spec, causing confusion.
  • The thing that seems so cool about this sort of thing is that it opens up the search service to the rest of us to help us make our content easier to find when it is updated. One thing that I have come to really respect about Google is that they don't rely on the government to beat Microsoft back down the way Netscape did. Google has managed to make a product that 47% of the US Internet users want to use, even though MSN is the default in IE. Remember Netscape 4? There's a reason that bloated POS failed, any

  • any site where certain pages are only accessible via a search form would benefit from creating a Sitemap and submitting it to search engines.

    If you have a bunch of data in a MySQL database, ordinarily Google can't find it. You have to create a static link somewhere with a URL for the search you want to make googlable. Those take maintenance.

    There may be some sites that want certain areas crawled, but not others, and those areas aren't maintained by the webmaster or only the top-level part should be h

  • Someone alerted me to google watch [google-watch.org] the other day. It's definitely an interesting take on the company, I have to say.

    You do have to wonder how much of the 'do no evil' philosophy is cover for the "let us store and index all information about everything, including you" philosophy. Not that I'm going to stop using Google until their results become less usable than Yahoo's results...

  • by md17 ( 68506 ) <james@@@jamesward...org> on Friday June 03, 2005 @12:21PM (#12714733) Homepage
    OMG!!! We finally /.'d Google!
  • Instead of having to notify search engines (blech)
    What about a robots.txt extension to define the
    location of the sitemap index?
  • by neves ( 324086 ) on Friday June 03, 2005 @01:00PM (#12715378) Homepage
    My rss feeds already publishes my newest/freshest pages. Why did they didn't just extended it with some aditional attributes/tags instead of forcing me to implement another xml format?
  • I see an execution gap here, though. My blog is, what, 2600 pages? I'm obviously not going to build that XML file manually (with one node for each page). Google does provide a Sitemap Generator, but it's Python code meant to be run on my web server. My Python skills are nil, so that route isn't viable for me either. I expect that there's a good many 'webmasters' (as in, people who design and run websites) who don't know Python from perl. Given the CC license, though, maybe somebody will grab the code and b
  • by e**(i pi)-1 ( 462311 ) on Friday June 03, 2005 @02:47PM (#12716546) Homepage Journal
    I had been writing a primitive sitemap generator myself using shellscripts
    essentially using "find" and "grep" alone, but this tool is much better,
    faster and easy to configure. Cool.

    Note that this tool will allow google to reach files which never would be
    found by spidering a site, because the files are not linked. If you
    include something like

    <directory path="/var/www/html" url="http://www.example.com/" />

    in your config.xml and run "sitemap_gen.py" on it, you will give the world
    access to a large amount of material
    (like test versions of your website or source code you did not want to
    make accessible). We might see lot more material material which had been
    'hidden'.

I have hardly ever known a mathematician who was capable of reasoning. -- Plato

Working...