Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Google Businesses The Internet

Millions of Pages Google Hijacked using ODP Feed 427

The Real Nick W writes "Threadwatch reports that millions of pages are being Google Hijacked using the 302 redirect exploit and the ODP's RDF dump. The problem has been around for a couple of years and is just recently starting to make major headlines. By using the Open Directory's data dump of around 4 million sites, and 302'ing each of those sites, the havoc being wreaked on the Google database could have catastrophic effects for both Google and the websites involved."
This discussion has been archived. No new comments can be posted.

Millions of Pages Google Hijacked using ODP Feed

Comments Filter:
  • Robot.txt (Score:3, Insightful)

    by superpulpsicle ( 533373 ) on Wednesday March 23, 2005 @11:40AM (#12024068)
    I am really extremely entirely confused about the article altogether. Is the hijacking more or less about Google digging into your site even when your robot.txt crawler robot is refusing google entrance?

  • by Cytlid ( 95255 ) on Wednesday March 23, 2005 @11:43AM (#12024124)
    For every Good Thing, there are at least 100 different ways to abuse it.
  • Why? (Score:2, Insightful)

    by dep01 ( 730107 ) on Wednesday March 23, 2005 @11:46AM (#12024177) Homepage
    Why is it seemingly man's mission to "bring down" something that seems to provide such a great service for everyone?

    "Oh! Look! Something beautiful! Something impressive! I must destroy it!"

    pah. feeling jaded today, i guess.

  • by Not_Wiggins ( 686627 ) on Wednesday March 23, 2005 @11:46AM (#12024181) Journal
    buy GOOG on the dip as many non-techie investors panic sell. 8)
  • by gitana ( 756955 ) on Wednesday March 23, 2005 @11:47AM (#12024195) Homepage
    As web presence -defined as within about the first 10-20 results of a search- becomes more and more important to "success," black hat techniques such as this, to eliminate competitors, will become more and more common. Google, or any other search tool needs to be able to stay above the fray and not be subject to hacks such as this.
  • by Anonymous Coward on Wednesday March 23, 2005 @11:49AM (#12024212)
    Wow, getting modded up just for leaving a message on our answering machine! I guess it's true, just like with Wil Wheaton, if you claim to be (or are) someone of alleged importance, you too can get +5 Informative on every post, no matter what you say (or don't)!
  • Re:Robot.txt (Score:5, Insightful)

    by PornMaster ( 749461 ) on Wednesday March 23, 2005 @11:49AM (#12024219) Homepage
    I do think the figure of millions of pages being hijacked is a little steep, though.

    Why? It can be completely automated. A million is no harder than four.
  • by jridley ( 9305 ) on Wednesday March 23, 2005 @11:59AM (#12024379)
    Prosecute for what? Is there a law against redirecting web pages? I think this would be a pretty difficult prosecution. Google's going to have to take technical steps on this one.
  • Re:Why? (Score:2, Insightful)

    by a16 ( 783096 ) on Wednesday March 23, 2005 @12:18PM (#12024649)
    In this case, it's more a case of "I must make money from it".

    The people using this exploit to get fake listings (just like all of the spam pages we see in search engines) aren't doing it for the fun of it.
  • by Animats ( 122034 ) on Wednesday March 23, 2005 @12:24PM (#12024734) Homepage
    Redirects to a page should be treated as having far less PageRank value than the page itself. That will fix the problem.

    It will also break many "click trackers", "portals", "directory sites", "search engine optimizers", and other annoyances, which is probably a plus for Google users. You know, those sites where you click on some phrase in Google and, three redirects later, you're at some irrelevant porno site.

  • by Hornsby ( 63501 ) on Wednesday March 23, 2005 @12:26PM (#12024767) Homepage
    Why not just fix the bug and then recreate the rankings index? Googlebot hits my sites all the time, so I know that it covers the rest of the internet quite often as well. With their amount of hardware, it probably wouldn't take long.
  • Not true I think (Score:1, Insightful)

    by Anonymous Coward on Wednesday March 23, 2005 @12:27PM (#12024789)
    AllinURL returns results where the results are in the URL. So they *should* be returned.

    I'm not convinced by this whole 302 nonsense. I haven't seen a single search where a 302 scraper site is ranking above the site it 302s for the scammed text.

    To me it sounds like people's sites drop for whatever reason, then they look for a reason and they grasp at this 302 story.

    I do an allinurl on my various sites (8 of them) and 6 have scrapers attached, only 1 has disappeared recently and that seems to have been caused by a change of IP address or maybe the loss of the yahoo directory link or perhaps because I have lots of pages with 20-30% similar content.

    But if I only had 1 site I could easily blame a 302 problem.
  • Re:RTFA (Score:5, Insightful)

    by Zeinfeld ( 263942 ) on Wednesday March 23, 2005 @12:27PM (#12024794) Homepage
    Read the fucking article - you don't have to have any access to the victim site to do this - you only need to have a higher pagerank than them.

    The article is confused and baddly written. It does not explain the exploit being used ever. So stop dumping on people. It is not at all surprising that people don't get what is going on when the description is crud.

    What is really going on has nothing to do with 302, or at least very little. What these people are doing is to set up fake web sites using content filched from genuine Web sites. This allows (or is beleived to allow) them to climb the google rankings.

    I don't see why someone would use a 302 response when they can just copy the entire content unless there is some sort of bug in Google's pagerank that is not being explained. Copying the entire content is much simpler.

    So what the attacker does is to set up their site so that when the googlebot comes round it publishes some legitimate content, then when other folk follow the site from a google search they get pages infested with spyware or the like.

    This would certainly explain the number of times I have done a Google search and ended up at an idiotic 'search site' that does nothing for me.

  • by wotevah ( 620758 ) on Wednesday March 23, 2005 @12:28PM (#12024804) Journal

    It seems that when page A redirects to B, Google not only considers that a hit for A, but also assigns B's content to A (I just skimmed through all the posts here so maybe that's not what happens).

    In that case, it seems to make more sense to just ignore A altogether since the hit and content rightfully belong to B.

    This could be done by treating redirects as empty one-link pages, thus unifying the handlers and defeating this practice.

  • by Nessak ( 9218 ) on Wednesday March 23, 2005 @12:34PM (#12024882) Homepage
    I think that is the RIAA wet dream -- to have every web page point to it. Don't they belive the only way to save music is to kill the web?
  • by ghoti ( 60903 ) on Wednesday March 23, 2005 @12:35PM (#12024884) Homepage
    Why don't you just pick the new URL as the canonical one? This way, any hijacking attempts would have no effect. And if I really want to do a permanent redirect, I don't want the old URL to stay in Google's database, anyway. I guess transferring the PageRank would be tricky (would make it possible to hurt a page by redirecting from a very low-rated one), but this still seems to be a lot less open to abuse.
  • Re:Not a surprise (Score:5, Insightful)

    by GoogleGuy ( 754053 ) * on Wednesday March 23, 2005 @12:36PM (#12024895) Homepage
    Hey, if you've run across spammy sites, have you filled out a spam report and used the keyword slashdot? I mentioned in a earlier comment from a different story [slashdot.org] that you can do this. We got eight reports last time, and the responses are on their way. We do check that data to look for new tricks that spammers are trying.
  • by Dynamoo ( 527749 ) on Wednesday March 23, 2005 @12:38PM (#12024915) Homepage
    You contact user support and use the keyword "canonicalpage" in your report.. So how much reports has all this work gotten me? The last time I checked, it was under 30

    Well shucks GG, not every webmaster is glued to WMW and other forums.. and even if they did the signal/noise ratio on this topic is so low that you probably couldn't find the information even if you were looking. It's hardly an obvious reporting mechanism. Although posting it on /. should help some, so that's appreciated. Thanks.

    But look - what we have here are a whole bunch of webmasters who have been nuked off the face of the earth by 302 redirects and just don't have the technical knowledge to try and fix it. Mom and Pop stores, hobbyists, nonprofits etc etc. These people are just gonna get pasted.. they'll just be wondering why they don't get any visitors any more.

    This is a HUGELY serious problem - and it's getting worse all the time as more and more people deliberately try to exploit the 302 bug. I've been hit by this bug myself, and let me tell you that unless you know EXACTLY what to look for you'd be stuffed - all you'd see is your traffic flatlining.

    The key issue here - and it's the kind of issue that will really, really hit the headlines when it's exploited is redirection. Sure, I can use a 302 and send Googlebot to the correct page.. so first of all I basically 0wn the content of that page not the publisher. *Then* I insert an exploit into the 302 redirect.. and hey presto, I've 0wned hundreds of thousands if not millions of computers. *That's* going to make unpleasant reading for Google when it hits the headlines - "Use Google and Get Owned". Nasty.

  • Go Phish (Score:2, Insightful)

    by MacFanMR ( 870166 ) on Wednesday March 23, 2005 @12:40PM (#12024957)
    This has very real potential to be taken advantage of for phishing scams.

    Imagine someone searching for their bank's website on Google (because some think that [searching] is how the web works!) and clicking the wrong link. That link takes them to a site that looks just like their bank's website, and maybe there is a security alert on the front page asking them to verify their information. After doing so, they could be redirected to their real bank's site, never having realized their error.

    Experience has shown me that most non-techies know they type an address into their browser, but after that, they pay no attention to it which makes this a real possibility.
  • Mod parent up. (Score:3, Insightful)

    by MyLongNickName ( 822545 ) on Wednesday March 23, 2005 @12:48PM (#12025066) Journal
    This is hilarious! Someone please mod up! Hope I get the above mods in M2.
  • by gl4ss ( 559668 ) on Wednesday March 23, 2005 @12:48PM (#12025079) Homepage Journal
    and that prosecuter has to get pretty imaginative to get jurisdiction over the people in some countries.

    prosecution can't fix this problem.
  • Re:302 (Score:3, Insightful)

    by ari_j ( 90255 ) on Wednesday March 23, 2005 @12:49PM (#12025103)
    Thanks. And remember, identitiy theft is not a joke, unless you steal the identity of a clown.
  • by salvorHardin ( 737162 ) <{adwulf} {at} {gmail.com}> on Wednesday March 23, 2005 @01:11PM (#12025376) Journal
    What happened to Google's 'dont be evil' policy. Guess that only applies when its convenient. Personally.. I would have given GoogleGuy -1 Troll.

    Google/GoogleGuy isn't being evil, just seemingly suffering from ignorance [yahoo.com] and/or apathy [yahoo.com].

    That said, I'm reminded of a quote I heard once: "The only thing necessary for the triumph of evil is for good men to do nothing". Please stop doing nothing, Google.
  • by Anonymous Coward on Wednesday March 23, 2005 @01:26PM (#12025559)

    Obviously you did not google for it.

    This is an idiotic response. Why on earth do people mod stuff like this up? Who in the hell is going to google for "canonicalpage"??? That is the solution you moron. Let me see you search for and find the solution without entering the term for the solution itself.

    You are a moron, and whoever modded you up is even stupider than you.

  • Re:Exactly. (Score:4, Insightful)

    by loraksus ( 171574 ) on Wednesday March 23, 2005 @01:26PM (#12025560) Homepage
    I sort of agreed, it was really bad about a month or two ago, but has been getting better for most of the "commonly searched" terms. Some fairly obscure searches still turn up a bit of crap, but you can't do it for everyone.
    A "Don't show me any results from this subnet + domain from now on" feature would be nice, as would google banning some of the worst offenders (which it seems to have done).

  • by 1u3hr ( 530656 ) on Wednesday March 23, 2005 @01:27PM (#12025577)
    I wish I had mod points for you. If this was MS, everyone here would be screaming bloody murder. Instead GoogleGuy gets moded +5 Informative

    It's EXTREMELY informative, because it tells you what Google's offical position is. Whether you like it or not, you need to know that. "Informative" doesn't mean "good".

    If Bill Gates posted here in defence of some MS policy, it would hopefully similarly be modded "informative".

  • OK, I'll bite ... (Score:4, Insightful)

    by isometrick ( 817436 ) on Wednesday March 23, 2005 @01:33PM (#12025636)
    Look, there *was* circumstancial evidence for the "Greg Duffy" thing ... i.e. just enough to make it a discussion. I agree that fearmongering is not the way to go. I appreciate that you looked into the issue (and my first instinct is to trust your explanation, that is was a DNS issue).

    However, if this is Google's PR method, I think you are kind of asking for it! In the absence of information, the internet community will speculate until the cows come home. I'm not saying it's right, I'm just saying that's reality. Even though I said on my site that I thought Google didn't do anything underhanded I bet a lot of people were still not convinced. Google can do a little better than this, and although you have been fairly nice to me (thanks) this response is a little flamebaity for PR. Please understand that I mean no offense, it's just constructive criticism. Even if everything you say is true, a representative of the company should always at least attempt to sugar coat something like your last paragraph.

    Also, on a more personal note, maybe Google should embrace the people that are involved [clsc.net] in researching [gregduffy.com] these problems instead of using this broken communications policy. I know that in my case I contacted you guys 5 *months* ago about the Google Print problem I described and never got any followup except for my t-shirt (which I really like). I have some great ideas about possible solutions to the problem I described, and as far as I can see Google has not fixed the root of the problem. When are you guys going to contact me?

    -Greg Duffy
  • Simple Answer (Score:5, Insightful)

    by rabtech ( 223758 ) on Wednesday March 23, 2005 @01:38PM (#12025694) Homepage
    There is a simple solution for Google: Only honor 302 redirects when the original and target domains match (or points to a subdomain of the original domain.)

    In all other cases treat a 302 (temporary) as a 301 (permanent) redirect, thus giving credit for the content to the actual hoster of the content.

    This allows webmasters to continue using 302s to setup logical URLs to mask the organization of underlying content but eliminates the ability to hijack completely.
  • by filmmaker ( 850359 ) * on Wednesday March 23, 2005 @01:48PM (#12025825) Homepage
    Exactly. And if they'd just stop giving PageRank credit to the redirect destination, it'd all be over. In fact, the algorithm should check to see what the link density is between to disparate domains if it's going to even cache 302'ed content. Because in these scam cases, the perpetrator never has an inbound link from the victim domain and Google could "grade" this relationship as being very one-sided and not generally very trustworthy. The more interlinkages, the more trust. But assigning Pagerank on 302's is just nuts.
  • Re:OK, an example (Score:3, Insightful)

    by That's Unpossible! ( 722232 ) * on Wednesday March 23, 2005 @02:02PM (#12025977)
    The problem you are describing here is not a 302 hijacking. Those sites don't do any redirecting, and they aren't duplicating your site page causing you to be bumped out of the loop. They just happen to have a link to your site and your "motto" on their page. The fact their page comes up before yours does seem stupid, but is unrelated from the 302 hijacking issue.
  • by Anonymous Coward on Wednesday March 23, 2005 @02:20PM (#12026275)
    The same thing can be done with a CNAME record. Give your domain a CNAME to www.google.com. It will eventually obtain a PageRank of 10. But this PageRank is useless for obtaining better search positions, as it will go away once the real PageRank is calculated.

    This may not be a problem because the PageRank shown in the toolbar is generally not the real PageRank Google uses to determine its positions.

    These techniques may be nothing more than a placebo, and there's probably a few Google employees who get a good laugh out of webmasters using such techniques.
  • I don't get it... (Score:3, Insightful)

    by jafiwam ( 310805 ) on Wednesday March 23, 2005 @03:19PM (#12026967) Homepage Journal
    Why all the yammering and discussion on this?

    It's pretty simple; 302 redirects allow bad guys to exploit Google.

    It doesn't matter that it's the wrong way to use a 302 redirect. They are the BAD GUYS. Remember the "spammers lie" truism?

    It's the Google rule that is broken. 302 should be treated as "cant find site" in their search rankings rather than assuming the the data sent by the web server is honest. It sucks that some legit users of 302 won't get ranked as well because of it, but boo hoo. Let anybody that has hardware or software problems get better equipment in the first place if their freaking world ends when they don't get ranked in their keyword group. I have NO SYMPATHY for someone that shoestrings their vital revenue stream infrastructure and then wonders why things go bad. It reminds me of my job too much.

    Buy Google ADs if you need to make money off your site traffic.

    Google will change the rule or they won't. If they want to stay relevant, they'd better. I find myself getting irritated with Google's crappy search results a lot now days, sooner or later I will find one of the little startup to use and they can kiss off if it keeps up. So I figure they will get to it. They are Google, they are good at what they do.

    Now what I think they should do is download snippets of pages via the Google toolbar which then sends the data to Google to make a massively distributed bot-net spider that is indistinquishable from the web-using masses. At that point, as far as exploiting Google via IP of the bot or user agent of the bot IT IS ALL OVER.

    Move along, nothing to see here but a bunch of people that don't understand redirect and HTTP protocols.
  • Re:Robot.txt (Score:1, Insightful)

    by HEXAN ( 790837 ) on Wednesday March 23, 2005 @03:44PM (#12027264)
    The change is simple and very marginal. Don't follow a 302 to a different TLD. 1. Scammer at iblow.com 302's to ferrari.com 2. Googlebot indexes iblow.com and rec's 302 3. Googlebot refused to 302 outside iblow.com 4. Googlebot continues to next page on iblow.com 5. iblow.com 302 placed in /dev/null
  • by glesga_kiss ( 596639 ) on Wednesday March 23, 2005 @08:15PM (#12030486)
    Google has login accounts, so let logged-in users have a link saying "report spam site".

    As an alternative, I'd love a cookie based version of this that you could click "ignore all results from this domain". After a couple of weeks you'd get rid of most of them on your personal browser. Make the lists sharable even. All the pagerank wannabies can do is start from scratch with new URLs.

  • by accidentalGeek ( 870241 ) on Wednesday March 23, 2005 @09:08PM (#12030983)

    More precisely, googlebot always sends the same referrer. Here's a snippet from an apache access log.

    -----------------
    64.68.80.4 - - [01/Mar/2005:16:19:24 -0500] "GET /robots.txt HTTP/1.0" 200 770 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)
    ------------ -----

    In practice, a static referrer and no referrer amount to the same thing so you're right from a practical standpoint. The referrer is not useful.

    But that's OK because the system I described does not depend on the referrer header. If a referrer header is available, it will use it as a shortcut to determine that if client was referred by an internal link and potentially bypass the whole redirect process. This saves system and and network use for the majority of cases when the client is an ordinary web browser, but it's not essential and clearly won't be useful when the client is googlebot (or some other robot that does not provide a referrer).

    If the client is a googlebot, the filter will see that there's no referrer. It will then check its stateful cache to determine if it has seen this robot recently. If so, it will let the robot right through and the request will be procesed normally. If not, it will issue the slightly obfuscated 301 redirect. When the robot follows this redirect, the filter will be invoked again. This time, it will recognize the robot from its previous visit and will let it through.

  • Re:Re-re-explained (Score:3, Insightful)

    by yulek ( 202118 ) on Wednesday March 23, 2005 @09:10PM (#12030993) Homepage Journal
    If, for example, I use redirects to distribute traffic between multiple servers on multiple hosts, the GoogleBot's behaviour of treating the redirecting host as the website's canonical host is correct. I want users to use the referring host so that I can change physical hosts with impunity.

    well, a bunch of people have suggested that 302s should only be honored by crawlers if the domain is the same. i think that's a pretty good idea.

    It's not Google that's broken--it's the web. It's just that the two-legged weasels are only now starting to pry open the cracks.

    why do you say that? how is the web broken because of the way google crawls it? the http standard was designed before googlebots were crawling it. long long before. the googlebot need to be more intelligent is all.
  • by GoogleGuy ( 754053 ) * on Wednesday March 23, 2005 @09:12PM (#12031017) Homepage
    claus, I'm glad that you mentioned this search. I looked through those 100 results. Every example that I saw in those results was from a while ago--they were all listed with the Supplemental Result tag. So this is already handled correctly in our main index, and as urls are updated in the supplemental index, those examples should be handled correctly as well.

    Thanks for mentioning this search; it's a good point. We've already made some changes to improve our heuristics, and you can see that improvement in the fact that current urls look better than the supplemental urls.
  • Re:Re-re-explained (Score:2, Insightful)

    by anthony_dipierro ( 543308 ) on Wednesday March 23, 2005 @10:01PM (#12031457) Journal

    In a sense, of course, there's little google can do to prevent this, because even if they weighted 302-redirects lower in their "throw out duplicates" stage, I could always just go snag a copy of your website each time googlebot visits, in essence doing the redirection myself.

    However, doing it through 302 redircts means that google pays for the bandwidth to go get your page, not me.

    Ah, but doing it through a 302 also means that the target site can't notice you making regular hits to it and block your IP address.

    There's also perhaps a legal distinction. Actively copying someone else's site without permission is pretty clearly copyright infringement. Just 302ing to it most likely isn't.

Beware of Programmers who carry screwdrivers. -- Leonard Brandwein

Working...