Do you develop on GitHub? You can keep using GitHub but automatically sync your GitHub releases to SourceForge quickly and easily with this tool so your projects have a backup location, and get your project in front of SourceForge's nearly 20 million monthly users. It takes less than a minute. Get new users downloading your project releases today!
geekfiend writes "Today Google updated their website to indicate over eight billion pages crawled, cached and indexed. They've also added an entry to their blog explaining that they still have tons of work to do."
This discussion has been archived.
No new comments can be posted.
Personally I find that the lack of relevant pages if the biggest problem with search engines, not the lack of pages with information. It seems I always find what I'm looking for eventually, what I need improved is the time I spend looking though spam-bomb pages before I find a page with the correct information.
These spam-pages seem to be increasing; I mean those pages with just a buch of keywords or the output of some search system.
I wonder if it'll take longer to index twice as many pages? Or if they, along with this change, improved their spider and/or added hardware. Otherwise I'm not sure this change is for the better, unless you like to search for really obscure topics.
Does every minor Google or Apple related thing deserve a slashdot story? Can slashdot create a "Fanboy" section for insignificant stories advocating Google (with their software patent) and Apple (with their iTunes DRM)? That way I could filter them out more easily.
Google needs to stop obsessing about the number of indexed pages, and start concentrating on the quality. Since pagerank was switched off, 2 out of 5 searches now seem to be jammed with pages full of nothing but random words and adverts. It's even more galling when the adverts are Google Ads. Much as I love Google, they're becoming increasingly less effective as a tool.
Does this mean that I've been missing a huge amount of important information until now? I'd just assumed that Google covered the entire relevant web but now it seems to cover the whole same amount again. My Google alerts [googlealert.com] also seem to have started producing a lot more results which suggest that a lot of these new pages are rated quite highly. Who knows how much more quality content on the web we're just not seeing?
To paraphrase Churchill, Google is the worst system devised by the wit of man, except for all the others. Where else would you go? Yahoo? Hey, how about AltaVista?
The problems faced by Google in their battle against the scumbags who would game the system are faced by every other search engine. Google, IMHO, handles them better.
I don't quite believe that Google would've limited themselves that way (using 32 bit identifiers for documents) - that would've been incredibly short-sighted.
I'm especially irritated by the increasing number of highly-ranked pages that are nothing more than another search engine's results. If Google could find some way to identify and remove these from my result set, Google's usefulness to me would increase 10 times over.
It isn't about having a better search engine, so much as it is knowing how to use it. If you are looking for information on a recipe for oriental rice using asian spice, how would you search?
Bad search example:
oriental rice recipe asian spice
Good search example:
recipe+"oriental rice"+spice
See the difference? google tries its best to get rid of the spam pages, but it won't ever combat them all. Half of the work has to be done with you understanding the best way to describe to the search engine, what it is you want to do. The better you explain it, the better it can search for you.
You can rant all you want, but Google still has a fair use right to your images. They are reduced resolution images and therefor legal for non-commercial use.
Not to mention robot.txt, but that is so obvious it shouldn't need to be mention.
I am feeding this troll because there are people who really _do_ think like that and I wish I could yell at them to their faces:)
You put content in a place where it is publically accessible. You explicitly and proactively made that content available to everyone, including 'the average surfer' and googlebots. You took no steps to make it available only to the select few of whom you approve.
Now you are all cross and bothered because average surfers / googlebots have read / copied your content, such as it is.
The solution is to drown yourself in a bucket. I have a bucket.
What I've read on the Google help pages seems to indicate that they don't index punctuation or capitalization. When you search for something, your string is looked for within an existing index, and appropriate reference materials are shown. Including punctuation wouldn't result in any hits within their index, meaning no results.
Now, obviously, it is theoretically possible to do just about anything. But in this case, with the architecture they have in place, anyone ever doing what you're asking would require a full-text search through their multi-TB dataset, which I suspect is highly impractical.
My point is that as I understand it, Google has coded a number of shortcut tricks which allow reasonable search times, and full-text string-exact searching would prevent them from using those shortcuts, resulting in search times they don't seem to think is reasonable.
Robots.txt isn't some thing that only applies to Google, it is (supposed) to be honoured by all search engines, and uses the Robots Exclusion Standard. So, when you claim these are Google's arbitary rules, you are in fact wrong. They are neither Google's nor arbitary (at least no more than any web standard).
So your clue, not so much of clue, as robots.txt doesn't fit your description.
As for why you should know about it, you are putting up a web site, it is part of running a web site. You might as well complain why you need to know about HTML, CSS or registering a domain name. Quite what coming from the UK has to do with it (something I also do), I have no idea.
"I simply do not want the average surfer to be able to visit my site, I am not interested in serving my pages to them, they simply would not appreciate or understand what it is I am showing."
Then a publicly accessable webiste is the wrong place. It is not your personal space, and it isn't private. You made it available to the world, nobody made you. To turn around and complain when (some of) the world visits it is hypocracy.
It's like putting up posters around a town, then running around complaining all these people are looking at them, won't appreciate them, and you don't want them too. It's also comes across as condescending and arrogant, which probably explains the nastiness of some of the responses.
You opted in when you put up the publicly accessable website. If all search engines had to be opt in, nobody could find anything on the web, and it would use a lot of its utility. Your assumed to want them crawling becuase the vast majority of people do, they want their site to be found. If you don't though, no problem, just use the standards for stopping searches, or password protect the site. No scandal at all, just hysterics.
Showing the low res thumbnail of your image isn't violating your copyright either. The only legitimate claim you have is the amount of time it took to remove something from the cache.
The "thieves" accusation is even more ridiculous. If you put something up on the web people can see for free, you can't complain. There are options if you want to protect it. Google doesn't claim you work as theirs (which would be 'stealing' or at least copyright violation), they help people find you public web site.
If you don't want a public website but made one, whose fault is that? If you are going to run a website and can't be bothered to find out how to do it properly, you can't blame Google.
How about a NEAR operator? Sure, AND OR NOT are nice, but my results would be a lot more relevant if I could eliminate results where the search terms appeared a thousand words apart.
MSN's "msnbot" has been crawling/spidering my webserver (which runs Geeklog and is just another blog of my random crap) pretty extensively for weeks now. (Lie 5 times a day it seems) Searching on Google for my site's name now reveals more results from my site, but not a lot of those circle-jerk style search results pages that are just trying to generate some ad revenues. However, using the beta.search.msn.com site DOES yield a lot more random crap (mostly blogs and personal webservers) that somehow generated some kind of link to my site because of the title of one of my articles, someone linking to my site in one of their blog posts, etc.
I have a feeling MSN's new search site is gonna be mostly blogs and advertisements, not relevant information. I think it's good Google has indexed more pages, but I still believe their algorithm will continue to provide more USEFUL results than MSN. (BTW, the googlebot doesn't hit my site too frequently which tells me Google's bot understands that my site isn't updated too frequently, nor is it linked to from other important sites)
More pages v.s more relevant pages (Score:5, Insightful)
These spam-pages seem to be increasing; I mean those pages with just a buch of keywords or the output of some search system.
Do this affect how fresh their index will be? (Score:4, Insightful)
Google makes minor change to website - news at 11! (Score:3, Insightful)
Quality - not quantity (Score:3, Insightful)
Re:This is news ? (Score:3, Insightful)
Yeah, but it'd be news if the sun set twice in one night or rose twice as bright.
It's more the exponential increase in the size of the index rather than the piecemeal addition.
Makes you wonder... (Score:5, Insightful)
Re:Quality - not quantity (Score:2, Insightful)
The problems faced by Google in their battle against the scumbags who would game the system are faced by every other search engine. Google, IMHO, handles them better.
Re:What is new about this. (Score:3, Insightful)
Re:More pages v.s more relevant pages (Score:5, Insightful)
Does this mean...? (Score:4, Insightful)
Re:What? (Score:2, Insightful)
Bad search example:
oriental rice recipe asian spice
Good search example:
recipe+"oriental rice"+spice
See the difference? google tries its best to get rid of the spam pages, but it won't ever combat them all. Half of the work has to be done with you understanding the best way to describe to the search engine, what it is you want to do. The better you explain it, the better it can search for you.
Re:Google thieves my bandwidth (Score:2, Insightful)
Not to mention robot.txt, but that is so obvious it shouldn't need to be mention.
Re:Google domination. (Score:3, Insightful)
I think it is more that many users of IE just do not twig that their failed page access resulted in an automatic query to MSN.
In reality, most users make occasional deliberate queries to Google and more frequent accidental queries to MSN.
So, to sum up... (Score:5, Insightful)
I am feeding this troll because there are people who really _do_ think like that and I wish I could yell at them to their faces
You put content in a place where it is publically accessible. You explicitly and proactively made that content available to everyone, including 'the average surfer' and googlebots. You took no steps to make it available only to the select few of whom you approve.
Now you are all cross and bothered because average surfers / googlebots have read / copied your content, such as it is.
The solution is to drown yourself in a bucket. I have a bucket.
Proximity search will help (Score:3, Insightful)
This is why I've been begging google folks to implement NEAR [pandia.com] operator!
Here is an example msn search: http://search.msn.com/results.aspx?FORM=SMCRT&q=f
Re:What is new about this. (Score:3, Insightful)
It now no longer shows microsoft.com as top hit.
Haha, I guess the joke reached MS headquarters.
Re:More pages v.s more relevant pages (Score:5, Insightful)
Now, obviously, it is theoretically possible to do just about anything. But in this case, with the architecture they have in place, anyone ever doing what you're asking would require a full-text search through their multi-TB dataset, which I suspect is highly impractical.
My point is that as I understand it, Google has coded a number of shortcut tricks which allow reasonable search times, and full-text string-exact searching would prevent them from using those shortcuts, resulting in search times they don't seem to think is reasonable.
Re:Read more carefully. (Score:5, Insightful)
Robots.txt isn't some thing that only applies to Google, it is (supposed) to be honoured by all search engines, and uses the Robots Exclusion Standard. So, when you claim these are Google's arbitary rules, you are in fact wrong. They are neither Google's nor arbitary (at least no more than any web standard).
So your clue, not so much of clue, as robots.txt doesn't fit your description.
As for why you should know about it, you are putting up a web site, it is part of running a web site. You might as well complain why you need to know about HTML, CSS or registering a domain name. Quite what coming from the UK has to do with it (something I also do), I have no idea.
"I simply do not want the average surfer to be able to visit my site, I am not interested in serving my pages to them, they simply would not appreciate or understand what it is I am showing."
Then a publicly accessable webiste is the wrong place. It is not your personal space, and it isn't private. You made it available to the world, nobody made you. To turn around and complain when (some of) the world visits it is hypocracy.
It's like putting up posters around a town, then running around complaining all these people are looking at them, won't appreciate them, and you don't want them too. It's also comes across as condescending and arrogant, which probably explains the nastiness of some of the responses.
You opted in when you put up the publicly accessable website. If all search engines had to be opt in, nobody could find anything on the web, and it would use a lot of its utility. Your assumed to want them crawling becuase the vast majority of people do, they want their site to be found. If you don't though, no problem, just use the standards for stopping searches, or password protect the site. No scandal at all, just hysterics.
Showing the low res thumbnail of your image isn't violating your copyright either. The only legitimate claim you have is the amount of time it took to remove something from the cache.
The "thieves" accusation is even more ridiculous. If you put something up on the web people can see for free, you can't complain. There are options if you want to protect it. Google doesn't claim you work as theirs (which would be 'stealing' or at least copyright violation), they help people find you public web site.
If you don't want a public website but made one, whose fault is that? If you are going to run a website and can't be bothered to find out how to do it properly, you can't blame Google.
Re:More pages v.s more relevant pages (Score:3, Insightful)
Re:Searching LiveJournal.com (Score:2, Insightful)
I have a feeling MSN's new search site is gonna be mostly blogs and advertisements, not relevant information. I think it's good Google has indexed more pages, but I still believe their algorithm will continue to provide more USEFUL results than MSN. (BTW, the googlebot doesn't hit my site too frequently which tells me Google's bot understands that my site isn't updated too frequently, nor is it linked to from other important sites)