Google 302 Exploit Knocks Sites Out 410
clsc writes "The exploit: Redirect via 302 to another page of your choice, then watch as the URL of your redirect script replaces the URL of that carefully selected page in Google's search results. Once this happens, feel free to redirect any visitor that is not Googlebot to any other page of your choice. Also applies to other search engines as well (not Yahoo! though)."
Re:everybody uses 302 (Score:5, Informative)
You use 302 to hijack someone else's page in Google's search results. Your bogus ad infested page shows up instead of the actual content the user was searching for (and thought they were going to see), while the real website that you hijacked doesn't get any more Google traffic. That's the exploit.
Dumbass.
No 302? (Score:2, Informative)
Create page that, when accessed by Googlebot, creates its own HTTP connection to a different, highly ranked page, and returns its contents to the Googlebot, but retuns your contents to everyone else than Googlebot.
Ooops - no 302 needed? Houst^H^HGoogle, we have a problem.
Re:Fake Banks (Score:4, Informative)
Further Reading (Score:5, Informative)
Re:WTF (Score:5, Informative)
*it allows a hijacking website to replace pages belonging to target websites in the Search Engine Results Pages*
that's what it does. think about it for a while. sure they could have protection but at the time it seems they DO NOT.
*What does it look like?
The Search Engine Results Pages ("SERPs") will look just like normal results to the searcher when a page hijack has occured. On the other hand, to a webmaster that knows where one of his pages used to be listed, it will look a little different. The webmaster will be able to identify it because (s)he will see his/her page listed with an URL that does not belong to the site. The URL is the part in green text under listings in Google.*
a lot of people use google as a sort of bookmarks page(with keywords they remember), potentially this could hurt them. what it more likely happens if it isn't fixed is that advertisers start to pollute the results even more, eventually leading google to be useless.
This is just plagiarism/cloaking (Score:3, Informative)
This means that you can't reliably hijack the page unless you have a higher PR than it. But if you have a higher PR than that page then could just as well copy its content, then wait till you're spidered, then substitute for whatever you want.
In other words, this is nothing more than another way to exploit two existing problems: (a) that you can steal anyone's content on the web (though see this [copyscape.com] for a way to detect it) and (b) you can cloak your site for the search engines (though I'm sure they notice that too).
In summary, there is nothing new in this whatsoever.
Re:Fake Banks (Score:2, Informative)
I use google all the time if I'm on someone else's computer since my bank has a strange URL.
However, if you search for say "Chevy Chase Bank" and then click on a link where the address clearly has nothing to do with Chevy Chase...well, Darwin had some things to say about that.
Re:Fake Banks (Score:5, Informative)
Not news.
I agree it's old, even the guy that wrote the article admits it goes back a few years. But you are wrong about how it works. These aren't just extra pages
Why This is Such a Big Deal (A Summary) (Score:5, Informative)
Suppose you have a small business under the domain http://xyz.com/, and search engines bring you a lot of traffic because you rank high for keywords in your market. You have a lot of people out there linking to you, a lot of satisfied customers, good content on your site. You're always in the top 10 somewhere when people search for "xyz widgets".
Well, this issue with Google makes it very easy -- incredibly easy -- for someone to knock your site out of the rankings entirely. And I mean for *everything*, to where searching for your own company name in quotes literally buries you hundreds of pages deep in the results. We're talking sites going from getting 1000 unique hits to 10 overnight.
And here's the kicker: It requires absolutely no technical knowledge, no time investment, and is perfectly legal...
All I have to do is have another domain handy that is roughly as popular as yours. And I make a "links" page, like one of those directory services, that lists your website. But instead of being a normal hyperlink, it's a CGI (or PHP or ASP or whatever) script that generates a 302 redirect to your domain... Now, these are very simple, common scripts. One-liners that you can download from cgiscripts.com and stick on your server. The original intent of these scripts is to track which links are being clicked on your site. But now they've found a new use, because when Google gets that 302, all hell breaks loose.
See, according to the HTTP spec, 302 is a *temporary* redirect, which means Google is supposed to interpret whatever content it finds at the 302 target (your site) as really belonging to the URL of the source (my site). Google is just obeying the spec strictly here, and with devestating results. Why? BECAUSE THE DUPE FILTER NOW KICKS IN! You see, Google has a "dupe filter" that says if the same exact content is found for two unique URLs, then one of the URLs is obliterated in the rankings. Because after all, searchers don't want to be finding the same content over and over. If that happens, they'll start using a different search engine. But Google, sticking strictly to the HTTP spec, doesn't know who the content really belongs to when it gets a 302.
So Google essentially flips a coin. And if it comes up tails, say bye-bye to your domain in the rankings. Your *entire* domain. Because the dupe filter isn't limited to just the page that the 302 is pointing to -- it applies across your entire domain.
These 302 "exit-link-trackers" are all over the web. They've been used by webmasters for years. But it's just recently that Google has started treating 302 this way, so it didn't have any bad effect before. But now it kills you.
The funny thing is, the solution seems pretty simple: Just stop treating 302s this way if they point to a different domain. But for whatever reason Google isn't listening. Hopefully the press that's being generated now will give them the kick in the ass that they need.
It happened to me.. (Score:5, Informative)
Well, I knew about the 302 bug (in fact, it's been known for months in professional webmaster circles).. so, I did an allinurl:mydomain.com/mypage.htm search on Google to find the culprit. Low and behold, it was some blog page about one PR below my page with a script that redirected through a 302. The catch was that this redirect script ONLY worked if you clicked on it from the blog itself - if you clicked on it from the Google SERPs you got a 500 server error.. so in effect, Google misidentified the redirect page as my actual page and then subsequently tried to spider it from the URL directly and got a 500 error.. the result being that I was dropped from the index. Was this malicious? Hardly - the webmaster had compiled a small list of cool, useful links - not knowing that his buggy redirector was killing those sites off.
So whaddya do? I tried emailing the webmaster but everything bounced. It looks like he was out of the country. I tried giving Google feedback, but frankly that's just like offering up a prayer to the Great Google God - so I also used the BASE HREF trick mentioned in the article, and after a few days the page came back in the index as normal. So, either that trick worked or the Google God answered my prayers. I'm guessing at the former.
301 and 302 have very different meanings. (Score:5, Informative)
This "exploit" isn't very interesting and the author really doesn't seem to have a good grasp of the HTTP protocol design, the end-to-end model, or the internet in general.
I'd be very careful before I blindly changed all my redirects to 301s. The semantics behind a 301 and 302 are VERY different and unless you want people to replace the original URI with the target in your 301s, forever, you might be entering a world of hurt.
From RFC 2616 -- HTTP/1.1 [ietf.org]:
10.3.2 301 Moved Permanently
The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. This response is cacheable unless indicated otherwise.
10.3.3 302 Found
The requested resource resides temporarily under a different URI. Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field.
This is a common theme in the high-tech world; Joe Hacker figures out a problem and a 'solution'. Problem is, they don't understand all the implications of the solution. That doesn't stop them from yelling loudly about the solution. Without a comprehensive explanation of the impact of the 'solution' you might be just causing yourself harm in other areas down the road.
Education and thorough analysis are always a good idea when you are dealing with complex systems that might have emergent behaviors. This is certainly one of the bigger pet-peeves at the IETF [ietf.org] and with the IESG [iesg.org].
This has been known for more than 2 years now (Score:2, Informative)
I have two sites, one is the main site which we'll call www.widgets.com and one is a site with a catchy name that automatically diverts to www.widget.com, we'll call this site www.widgetscatchy.com.
Kind of confused that www.widgetscatchy.com site had a PR5 so checked the incoming links and for some reason when I check the links to this site is shows www.widgets.com's links instead of it's own. Even when listing the site Google states 'Searched for pages linking to AYdabadfa:www.widgets.com/' instead of 'Searched for pages linking to AY4cSZStU-0J:www.widgetscatchy.com/'
The sites are using the same hosting company but they're both two completely seperate accounts and have completely different content.
Why has Google amalgamated these two sites links? I'm just slightly worried that Googlebot will drop the pair of the sites from the index if it decides that the two sites are the same.
Re:Bollox (Score:4, Informative)
I have seen this exploit used in a variety of ways.
For instance, this kind of redirect could be used to highjack Amazon.com - the user types in Amazon into a search box, sees the title and snippet that matches amazon, clicks it, the hijacker gets affiliate commission credit for sending people to amazon.com.
Basically the 302 link makes the linking site appear to host the target site's page, and it replaces it in the search results.
You can pretty much do it for any site. In the case of Amazon, they'd likely void your affiliate commissions - if they noticed (which they would eventually) but if you did it for a few days before, say, Christmas, and took it down after it worked, you might net 8 - 15K in a single day.
Another danger is a malicious site whose redirect page sniffed for JavaScript. User Agents with JS deactivated would redirect straight to, say, CNN, if the UA accepted JS, it could start loading one of the many spyware "tools" that forcefeed affiliate tracking cookies into the user's computer, or much worse.
There are tens of thousands of searches for "cnn.com" in the search engines a day - even if the highjacker was able to only replace CNN for a day, the harm would be widespread.
Unfortunately, the Google PageRank is not considered when ranking the sites, as Google basically considers www.example.com/302.php?www.cnn.com to actually be www.cnn.com - it will show CNN.com's backlinks when your query backlinks for the hijack url, for example.
preview with Google cache (Score:2, Informative)
Re:goog (Score:5, Informative)
The basic issue is that not only can purposeful individuals kick you out of the serps with a simple 302 from a higher pagerank page, but people who use 302 redirects to track outgoing links from their site (and several content management software packages do this by default) can accidently do the same thing and there isn't anything the real webmaster can do about it.
It's been discussed in much greater detail in a thread at webmaster world [webmasterworld.com] for a while, as well.
Re:yawn (Score:1, Informative)
That's a pretty broad statement, and while I'm far from being a prude and have seen my fair share of porn, I've NEVER fantasized about being anywhere else with anyone else while having sex, and I've been with my partner for 10 years now.
So, yeah. Different strokes, as it were.
Re:yawn (Score:4, Informative)
Best Buy owned the magazine stand, the counter, the time of the person you were outright harassing, the building the exchange took place in, the merchandise you were holding in your hand until you handed over your money for it... in short, it's their private property! If you don't like it, go away!
I'm no fan of T&A magazines, if for no other reason than because it's a lame and overused marketing gimimck. But you ask someone to change what they're doing in their own store, you do not demand. And if they say "no," that's the end of the matter. You have no right to dictate the lives and decisions of other people, no matter what your religion may tell you.
"Divert your eyes. Divert your thoughts."
Do what you will with your eyes and thoughts. Leave mine alone.
The only problem I see.... (Score:4, Informative)
I would imagine that this could cause a problem with getting a website into the listing that is in the process of moving, but if Google simply waited until it's an actual 200 status code, then redirections would get ignored (since they're not
From the W3C document:
The temporary URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s).
Again, and since even the temporary URI doesn't have to be given, 302's should be ignored. Even 301's and 303's are not acceptable since the new URI doesn't have to be given.
The harder way to fix it is to only accept 3xx response codes that give the new URI in response. Even then, I assume it's possible to still fake a 200 response code if you modify the http daemon, and make a transparent redirection... thus fooling the search engine in every respect.
In my opinion, I don't see a way around it unless you include signature files or such... but even if you used and SSL connection, it's probably still exploitable.
I guess you're damned any way you look at it.
Re:Can I use this to knock out a fraudulent site? (Score:1, Informative)
Their hosting shut them down a few weeks after that.
Your first port of call is your credit card company, as others have said. After that, just keep up a barrage of (very, very polite) emails to the hoster and registrar.
Good luck.
Part of the reason for this... (Score:2, Informative)
The problem the spider has to deal with in trying to organize and rank the results is that there is an inherent problem with the way web servers handle default web pages for a domain or a directory:
http://www.xyz.com/ actually pulls up http://www.xyz.com/index.html (because apache or the web server has been told to use index.html if no page component is in the URI) - but there is no requirement to communicate the "index.html" page name to the client, and very few servers actually do that (if they do, you'll see the URL change in the browser)
Some of the incoming links point to just the doemain, other links point to the fully qualified URL. More than likely, your spider will eventually follow both and then receive web pages that are nearly identical.
At some point, xyz.com discovers php (yea!)... but they have traffic and page rank associated with index.html. They put up a 302 redirect to point index.html -> index.php
Or they symlink index.html to index.php and tell php to parse index.html even though the extension is
So from google's perspective:
http://xyz.com/
http://www.xyz.com/
http://xyz.com/index.html
http://www.xyz.com/in dex.php
http://www.xyz.com/index.html
http://xyz
all return identical content and the web has links pointing to every one of those names (and those links almost never go away or are corrected once created). From the Search Engine's perspective, which is the "real" URL/URI for the page?
Google (and the visitor) generally would like the answer to be
http://www.xyz.com/
Using the BASE URL tag tells Google the actual page name and clears up any ambiguity, which is why using one partially fixes the problem in some cases.
<head>...<base href="http://www.xyz.com/index.php"></head>
Now, let's make it uglier:
Ecommerce web site is installed in subdirectory, but wants its main page to be the "default" page for the domain - referral tracking and cookie management depends on this - however the web pages rely on the package existing in a subdirectory of the document root:
Actual URI is http://www.xyz.com/ecommerce/index.php
How do you get to that page as the default without confusing the search engines or losing the referring URL? Possible answers:
1) Use a meta refresh - doing that loses tracking information, as the landing page becomes the referring page. Google will also not be happy as this looks like a doorway page, and the redirect page itself has no real "content" to index
2) Use a 301 redirect - Bzzzzt - wrong answer - if you do this, you'll telling the world that http://www.xyz.com/ no longer exists in all perpetuity.
3) Use a 302 redirect - clears up tha ambiguity, however confusing Page Ranking at least temporarily - since your incoming links mostly point at http://www.xyz.com/, not http://www.xyz.com/ecommerce/index.php
4) Use a Base Ref on
5) Have the web server return a content-location: header. This is similar to the base URL, except it is done at the http level not within the HTTP. content-location: can either be relative to the request or absolute. It isn't authoritative, but could be helpful. In general, a cross-domain content-location header would have to be ignored, otherwise you would have the same exploit... you request