Millions of Pages Google Hijacked using ODP Feed 427
The Real Nick W writes "Threadwatch reports that millions of pages are being Google Hijacked using the 302 redirect exploit and the ODP's RDF dump. The problem has been around for a couple of years and is just recently starting to make major headlines. By using the Open Directory's data dump of around 4 million sites, and 302'ing each of those sites, the havoc being wreaked on the Google database could have catastrophic effects for both Google and the websites involved."
Robot.txt (Score:3, Insightful)
Law of the Internet (Score:5, Insightful)
Why? (Score:2, Insightful)
"Oh! Look! Something beautiful! Something impressive! I must destroy it!"
pah. feeling jaded today, i guess.
Do what I'm going to do... (Score:4, Insightful)
Web presence pressure (Score:5, Insightful)
Re:Ugh. This is so not true. (Score:2, Insightful)
Re:Robot.txt (Score:5, Insightful)
Why? It can be completely automated. A million is no harder than four.
Re:Easy to prosecute, hmmm? (Score:5, Insightful)
Re:Why? (Score:2, Insightful)
The people using this exploit to get fake listings (just like all of the spam pages we see in search engines) aren't doing it for the fun of it.
Search engines should devalue redirects (Score:5, Insightful)
It will also break many "click trackers", "portals", "directory sites", "search engine optimizers", and other annoyances, which is probably a plus for Google users. You know, those sites where you click on some phrase in Google and, three redirects later, you're at some irrelevant porno site.
Doesn't seem like the end of the world (Score:3, Insightful)
Not true I think (Score:1, Insightful)
I'm not convinced by this whole 302 nonsense. I haven't seen a single search where a 302 scraper site is ranking above the site it 302s for the scammed text.
To me it sounds like people's sites drop for whatever reason, then they look for a reason and they grasp at this 302 story.
I do an allinurl on my various sites (8 of them) and 6 have scrapers attached, only 1 has disappeared recently and that seems to have been caused by a change of IP address or maybe the loss of the yahoo directory link or perhaps because I have lots of pages with 20-30% similar content.
But if I only had 1 site I could easily blame a 302 problem.
Re:RTFA (Score:5, Insightful)
The article is confused and baddly written. It does not explain the exploit being used ever. So stop dumping on people. It is not at all surprising that people don't get what is going on when the description is crud.
What is really going on has nothing to do with 302, or at least very little. What these people are doing is to set up fake web sites using content filched from genuine Web sites. This allows (or is beleived to allow) them to climb the google rankings.
I don't see why someone would use a 302 response when they can just copy the entire content unless there is some sort of bug in Google's pagerank that is not being explained. Copying the entire content is much simpler.
So what the attacker does is to set up their site so that when the googlebot comes round it publishes some legitimate content, then when other folk follow the site from a google search they get pages infested with spyware or the like.
This would certainly explain the number of times I have done a Google search and ended up at an idiotic 'search site' that does nothing for me.
treat redirects as one-link pages (Score:3, Insightful)
It seems that when page A redirects to B, Google not only considers that a hit for A, but also assigns B's content to A (I just skimmed through all the posts here so maybe that's not what happens).
In that case, it seems to make more sense to just ignore A altogether since the hit and content rightfully belong to B.
This could be done by treating redirects as empty one-link pages, thus unifying the handlers and defeating this practice.
Re:The super-slashdotting (Score:3, Insightful)
Re:Ugh. This is so not true. (Score:2, Insightful)
Re:Not a surprise (Score:5, Insightful)
Re:Ugh. This is so not true. (Score:5, Insightful)
Well shucks GG, not every webmaster is glued to WMW and other forums.. and even if they did the signal/noise ratio on this topic is so low that you probably couldn't find the information even if you were looking. It's hardly an obvious reporting mechanism. Although posting it on /. should help some, so that's appreciated. Thanks.
But look - what we have here are a whole bunch of webmasters who have been nuked off the face of the earth by 302 redirects and just don't have the technical knowledge to try and fix it. Mom and Pop stores, hobbyists, nonprofits etc etc. These people are just gonna get pasted.. they'll just be wondering why they don't get any visitors any more.
This is a HUGELY serious problem - and it's getting worse all the time as more and more people deliberately try to exploit the 302 bug. I've been hit by this bug myself, and let me tell you that unless you know EXACTLY what to look for you'd be stuffed - all you'd see is your traffic flatlining.
The key issue here - and it's the kind of issue that will really, really hit the headlines when it's exploited is redirection. Sure, I can use a 302 and send Googlebot to the correct page.. so first of all I basically 0wn the content of that page not the publisher. *Then* I insert an exploit into the 302 redirect.. and hey presto, I've 0wned hundreds of thousands if not millions of computers. *That's* going to make unpleasant reading for Google when it hits the headlines - "Use Google and Get Owned". Nasty.
Go Phish (Score:2, Insightful)
Imagine someone searching for their bank's website on Google (because some think that [searching] is how the web works!) and clicking the wrong link. That link takes them to a site that looks just like their bank's website, and maybe there is a security alert on the front page asking them to verify their information. After doing so, they could be redirected to their real bank's site, never having realized their error.
Experience has shown me that most non-techies know they type an address into their browser, but after that, they pay no attention to it which makes this a real possibility.
Mod parent up. (Score:3, Insightful)
Re:fraud, copyright, phishing, decency laws (Score:3, Insightful)
prosecution can't fix this problem.
Re:302 (Score:3, Insightful)
Re:Ugh. This is so not true. (Score:2, Insightful)
Google/GoogleGuy isn't being evil, just seemingly suffering from ignorance [yahoo.com] and/or apathy [yahoo.com].
That said, I'm reminded of a quote I heard once: "The only thing necessary for the triumph of evil is for good men to do nothing". Please stop doing nothing, Google.Re:Kindly extract your head from wherever it is (Score:1, Insightful)
Obviously you did not google for it.
This is an idiotic response. Why on earth do people mod stuff like this up? Who in the hell is going to google for "canonicalpage"??? That is the solution you moron. Let me see you search for and find the solution without entering the term for the solution itself.
You are a moron, and whoever modded you up is even stupider than you.
Re:Exactly. (Score:4, Insightful)
A "Don't show me any results from this subnet + domain from now on" feature would be nice, as would google banning some of the worst offenders (which it seems to have done).
Re:Ugh. This is so not true. (Score:3, Insightful)
It's EXTREMELY informative, because it tells you what Google's offical position is. Whether you like it or not, you need to know that. "Informative" doesn't mean "good".
If Bill Gates posted here in defence of some MS policy, it would hopefully similarly be modded "informative".
OK, I'll bite ... (Score:4, Insightful)
However, if this is Google's PR method, I think you are kind of asking for it! In the absence of information, the internet community will speculate until the cows come home. I'm not saying it's right, I'm just saying that's reality. Even though I said on my site that I thought Google didn't do anything underhanded I bet a lot of people were still not convinced. Google can do a little better than this, and although you have been fairly nice to me (thanks) this response is a little flamebaity for PR. Please understand that I mean no offense, it's just constructive criticism. Even if everything you say is true, a representative of the company should always at least attempt to sugar coat something like your last paragraph.
Also, on a more personal note, maybe Google should embrace the people that are involved [clsc.net] in researching [gregduffy.com] these problems instead of using this broken communications policy. I know that in my case I contacted you guys 5 *months* ago about the Google Print problem I described and never got any followup except for my t-shirt (which I really like). I have some great ideas about possible solutions to the problem I described, and as far as I can see Google has not fixed the root of the problem. When are you guys going to contact me?
-Greg Duffy
Simple Answer (Score:5, Insightful)
In all other cases treat a 302 (temporary) as a 301 (permanent) redirect, thus giving credit for the content to the actual hoster of the content.
This allows webmasters to continue using 302s to setup logical URLs to mask the organization of underlying content but eliminates the ability to hijack completely.
Re:Web presence pressure (Score:3, Insightful)
Re:OK, an example (Score:3, Insightful)
Re:No, it's not about redirecting the user... (Score:1, Insightful)
This may not be a problem because the PageRank shown in the toolbar is generally not the real PageRank Google uses to determine its positions.
These techniques may be nothing more than a placebo, and there's probably a few Google employees who get a good laugh out of webmasters using such techniques.
I don't get it... (Score:3, Insightful)
It's pretty simple; 302 redirects allow bad guys to exploit Google.
It doesn't matter that it's the wrong way to use a 302 redirect. They are the BAD GUYS. Remember the "spammers lie" truism?
It's the Google rule that is broken. 302 should be treated as "cant find site" in their search rankings rather than assuming the the data sent by the web server is honest. It sucks that some legit users of 302 won't get ranked as well because of it, but boo hoo. Let anybody that has hardware or software problems get better equipment in the first place if their freaking world ends when they don't get ranked in their keyword group. I have NO SYMPATHY for someone that shoestrings their vital revenue stream infrastructure and then wonders why things go bad. It reminds me of my job too much.
Buy Google ADs if you need to make money off your site traffic.
Google will change the rule or they won't. If they want to stay relevant, they'd better. I find myself getting irritated with Google's crappy search results a lot now days, sooner or later I will find one of the little startup to use and they can kiss off if it keeps up. So I figure they will get to it. They are Google, they are good at what they do.
Now what I think they should do is download snippets of pages via the Google toolbar which then sends the data to Google to make a massively distributed bot-net spider that is indistinquishable from the web-using masses. At that point, as far as exploiting Google via IP of the bot or user agent of the bot IT IS ALL OVER.
Move along, nothing to see here but a bunch of people that don't understand redirect and HTTP protocols.
Re:Robot.txt (Score:1, Insightful)
Re:Ugh. This is so not true. (Score:4, Insightful)
As an alternative, I'd love a cookie based version of this that you could click "ignore all results from this domain". After a couple of weeks you'd get rid of most of them on your personal browser. Make the lists sharable even. All the pagerank wannabies can do is start from scratch with new URLs.
Re:Won't work: Robots don't send the referrer (Score:2, Insightful)
More precisely, googlebot always sends the same referrer. Here's a snippet from an apache access log.
In practice, a static referrer and no referrer amount to the same thing so you're right from a practical standpoint. The referrer is not useful.
But that's OK because the system I described does not depend on the referrer header. If a referrer header is available, it will use it as a shortcut to determine that if client was referred by an internal link and potentially bypass the whole redirect process. This saves system and and network use for the majority of cases when the client is an ordinary web browser, but it's not essential and clearly won't be useful when the client is googlebot (or some other robot that does not provide a referrer).
If the client is a googlebot, the filter will see that there's no referrer. It will then check its stateful cache to determine if it has seen this robot recently. If so, it will let the robot right through and the request will be procesed normally. If not, it will issue the slightly obfuscated 301 redirect. When the robot follows this redirect, the filter will be invoked again. This time, it will recognize the robot from its previous visit and will let it through.
Re:Re-re-explained (Score:3, Insightful)
well, a bunch of people have suggested that 302s should only be honored by crawlers if the domain is the same. i think that's a pretty good idea.
It's not Google that's broken--it's the web. It's just that the two-legged weasels are only now starting to pry open the cracks.
why do you say that? how is the web broken because of the way google crawls it? the http standard was designed before googlebots were crawling it. long long before. the googlebot need to be more intelligent is all.
Re:Actually an example has been posted (Score:2, Insightful)
Thanks for mentioning this search; it's a good point. We've already made some changes to improve our heuristics, and you can see that improvement in the fact that current urls look better than the supplemental urls.
Re:Re-re-explained (Score:2, Insightful)
In a sense, of course, there's little google can do to prevent this, because even if they weighted 302-redirects lower in their "throw out duplicates" stage, I could always just go snag a copy of your website each time googlebot visits, in essence doing the redirection myself.
However, doing it through 302 redircts means that google pays for the bandwidth to go get your page, not me.
Ah, but doing it through a 302 also means that the target site can't notice you making regular hits to it and block your IP address.
There's also perhaps a legal distinction. Actively copying someone else's site without permission is pretty clearly copyright infringement. Just 302ing to it most likely isn't.