Google Admits to Using Sohu Database 209

Posted by CowboyNeal on Monday April 09, 2007 @07:46PM from the cut-and-paste dept.

prostoalex writes "A few days ago a Chinese company, Sohu.com, alleged Google improperly tapped its database for its Pinyin IME product, stirring controversy on whether two databases were similar just due to normal research process. Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' 'The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents. In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.'"

This discussion has been archived. No new comments can be posted.

Google Admits to Using Sohu Database

Load All Comments

Search 209 Comments Log In/Create an Account

Comments Filter:

Dictionary mistakes. (Score:5, Funny)

by Tackhead ( 54550 ) writes: on Monday April 09, 2007 @07:53PM (#18669449)

> Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents.
...including the ones for "plagiarize", "research", and apparently a new one for the 2000s under "leverage".
Leverage! Leverage!
Let no one else's work cut short your edge,
Against the truth you can surely hedge,
So don't cut short your edge,
But leverage, leverage, leverage!

(One man deserves the credit! One man deserves the blame!
And Sergei Brin Ivanovich Lobachevsky is his name!)

Share
twitter facebook
- - - Re: (Score:2)
      
      by Warg! The Orcs!! ( 957405 ) writes:
      
      ...lehrverage....
Google's initial explanation (Score:5, Funny)

by Anonymous Coward writes: on Monday April 09, 2007 @07:55PM (#18669459)

"In the future, Google invents a time machine that's used by a rogue employee to travel back in time to give Sohu this database. It's clear then that Sohu stole our database."

Share
twitter facebook
- Re: (Score:2, Funny)
  
  by BungaDunga ( 801391 ) writes:
  
  In fact, if we hadn't used their database, our employee won't be able to go back in time to give it to Sohu, and we wouldn't have been able to steal their database. QED.
Have no fear! (Score:2)

by mattgreen ( 701203 ) writes:

I'm sure someone will step up and help them save face in this embarrassing situation! When in doubt, you can always try to change the subject, that has worked well in the previous thread. Now that I think about it, we need a RoughlyDrafted-esque site for Google, anyone up to the task?
This reminds me of (Score:5, Interesting)

by Diordna ( 815458 ) writes: on Monday April 09, 2007 @07:57PM (#18669475) Homepage

"Stolen from Apple Computer" (whole story [folklore.org])

Share
twitter facebook
Turnitin.com Subscription Coming (Score:4, Funny)

by slashbob22 ( 918040 ) writes: on Monday April 09, 2007 @07:58PM (#18669481)

I guess Google Labs will have to subscribe to Turnitin.com now.

Share
twitter facebook
So... (Score:5, Interesting)

by Anonymous Coward writes: on Monday April 09, 2007 @07:58PM (#18669491)

When caught making a mistake, they admit it, work to resolve it, and move on?
I think there are a few other companies who could learn from that approach ...

Share
twitter facebook
- Re:So... (Score:5, Insightful)
  
  by Timesprout ( 579035 ) writes: on Monday April 09, 2007 @08:16PM (#18669611)
  
  'Mistake' is a bit euphamistic here. The dictionary was never made public yet Google somehow managed to accquire it. They have not complied with Sohu's requests to date. They dragged their feet over the whole issue and only came clean when there more than sufficient proof they were infringing.
  
  Its not the first time Google have taken a fairly liberal interpretation of someone elses copyright either.
  
  Parent Share
  twitter facebook
  - On what do you base your judgment? (Score:4, Insightful)
    
    by Anonymous Coward writes: on Monday April 09, 2007 @08:35PM (#18669709)
    
    > They have not complied with Sohu's requests to date.
    
    One of Sohu's demands was to remove it. They did that, even prior to the cease & desist deadline, per the article. It sounds like they'll have to compensate Sohu next, which isn't overly surprising. As for where they got it, perhaps someone sold it to them? We don't know, so I'll reserve judgment about whether it was acquired in an un-Google "evil" way until we hear the rest of the story.
    
    > It's not the first time Google have taken a fairly liberal interpretation of someone else's copyright either.
    
    As for the copyright stance, I honestly don't care. Yes, I dislike Microsoft's hypocrisy concerning copyright, but I don't really give a damn about imaginary property at this point in time, and I don't see Google out there telling people that copyright infringement is evil, wrong, Communist and anti-American.
    
    Frankly, I'm more inclined to distribute my works with only one request: that you do not acknowledge my authorship in any way. Of course, almost the only way to enforce that is to post AC :-)
    
    Parent Share
    twitter facebook
    - Re:On what do you base your judgment? (Score:5, Informative)
      
      by Daengbo ( 523424 ) writes: <daengbo&gmail,com> on Monday April 09, 2007 @09:36PM (#18670127) Homepage Journal
      
      In my mind, there is some question of whether a database of facts should, in fact (hee hee), be copyrightable at all. The characters were not original. The pinyin is not original. The pinyin for each character is, in fact, well established. Why should a compilation of public-domain facts which in itself is a derivative work be copyrightable?
      
      It reminds me of a court case a few years ago in Thailand, where a judge put several Thai fonts into the public domain, stating "No one owns the Thai alphabet. It belongs to the people."
      
      Parent Share
      twitter facebook
      - Re: (Score:3, Interesting)
        
        by QuantumG ( 50515 ) writes:
        
        meh, the argument for why compilations of public domain "facts" should be considered a copyrightable work is that it is work to compile those facts. Why people can't understand that not all work results in property is beyond me, but there's ya reasoning.
        
        Nobody cares about "work." (Score:2)
        
        by Kadin2048 ( 468275 ) writes:
        
        meh, the argument for why compilations of public domain "facts" should be considered a copyrightable work is that it is work to compile those facts. Why people can't understand that not all work results in property is beyond me, but there's ya reasoning.
        
        I don't know about in China (does China even have a copyright system to begin with?), but in the U.S., the amount of "work" you put into something doesn't matter one whit in terms of it being copywritable. You could spend your entire life compiling statistic
        
        Re: (Score:2, Funny)
        
        by heinousjay ( 683506 ) writes:
        
        So the slogan is data entry wants to be free?
      - And it isn't (Score:2)
        
        by phorm ( 591458 ) writes:
        
        The language isn't copyrighted, and google was more than free to come up with their own dictionary/database. However, in this case they used somebody else's. The infringement is not against the language itself, but against the use of somebody's precompiled database (inclusive of errors, amusingly enough).
        
        it is not known data (Score:3, Informative)
        
        by phorm ( 591458 ) writes:
        
        Again, it is not the "known data" that is at question here, but the database as an object in its entirety.
        
        Nobody is accusing Google of "copying Chinese characters", but rather of copying a specific collection that somebody has invested time and money in creating. This is not a corpus, but rather more like a dictionary. Anyone can create one, but google - which I have emminent respect for in other areas, but not this one - has decided to take somebody else's "dictionary" rather than creating their own. The
      - Re: (Score:2)
        
        by buro9 ( 633210 ) writes:
        
        Any book out there is merely a collection of public-domain words, it's the arrangement or them into a single collection that is copyrighted.
        
        A database is little difference.
        
        There is of course time and effort spent in creating the collection, and some of the interpretation could be argued to be a creative effort in and of itself.
        
        A map is public-domain knowledge, but the compiled article is copyrighted. It's hard to imagine why this database should be exempt from copyright when every other instance of compiled
        
        Re: (Score:2, Informative)
        
        by Daengbo ( 523424 ) writes:
        
        Well, Duke's law page [duke.edu] makes it clear that copyright is based on originality and not "sweat of the brow."
        
        The relevant portion:
        In 1991, the Supreme Court addressed this question in Feist Publications v. Rural Telephone Co.10 Feist is a publishing company specializing in area-wide telephone directories, and Rural is a public utility company that provides telephone service to Northwest Kansas. Feist had almost 50,000 white page listings in fifteen counties, while Rural had fewer than 8,000. The white pages li
      - Re: (Score:2)
        
        by asninn ( 1071320 ) writes:
        
        Why should a compilation of public-domain facts which in itself is a derivative work be copyrightable?
        
        Sohu is probably asserting copyright over the errors they introduced. ;)
      - Re: (Score:2)
        
        by Plutonite ( 999141 ) writes:
        
        That is a great point, but they would argue that it is the effort put into the work that makes it "theirs". Same with an encyclopedia - you can use (and cite) it, but you sure as hell can't produce a carbon copy under a different name with zero recognition. Recognition of authorship, among free(beer) work at least, is a courtesy we have no need to abandon.
        
        Which brings me to the GP issue: why don't you want your name on things you've done? Recognition is a "nice" thing. If all of maathematics was written dow
      - Re: (Score:2)
        
        by Jeff DeMaagd ( 2015 ) writes:
        
        I suppose a book shouldn't be copyrighted, because it uses letters and words that already exist?
      - Re: (Score:2)
        
        by That's Unpossible! ( 722232 ) writes:
        
        Exactly! Writing a book is simply re-arranging factual letters into known words, and common sentences.
        How is that considered ORIGINAL? Bahhh...
      - Re: (Score:2)
        
        by Kadin2048 ( 468275 ) writes:
        
        Someone has taken the time to compile the data into the database. It cost time and money to do so. Google chose to take the shortcut and use that db instead of making their own (which further hints that the work involved was not trivial at all - you really can't argue that Google made a mistake, that they didn't know what they were doing).
        
        Doesn't matter; copyright -- at least U.S. and I think British copyright, I have no idea what if any philosophy underlies the Chinese system, if indeed they have one -- is
  - Re: (Score:2)
    
    by inviolet ( 797804 ) writes:
    
    Its not the first time Google have taken a fairly liberal interpretation of someone elses copyright either.
    Perhaps so. But then, Google has billions of dollars in the bank. They have no need to steal anything from anyone, and every reason not to.
    Can you really suppose that anyone in Google management decided to snag Sohu's database? Google is in the database business, so they know all about the salting of databases. They had to know that any commercial database will be filled with giveaway records (e.
    - - Re: (Score:2)
        
        by ClosedSource ( 238333 ) writes:
        
        That would make some sense except for the fact that J++ was presented as Java clone from day one. Sun sued MS on the basis of violating a contract, Sun never claimed that MS had stolen anything because nothing was.
- Re:So... (Score:5, Insightful)
  
  by Breakfast Pants ( 323698 ) writes: on Monday April 09, 2007 @08:16PM (#18669613) Journal
  
  Actually, when caught, they just removed the developer's names from the dictionary. When a big deal of it was made, *then* they went to town 'not doing evil'. They still haven't said how it happened; I bet they will quietly settle it, and we will never hear more.
  
  Parent Share
  twitter facebook
- Re: (Score:3, Insightful)
  
  by suv4x4 ( 956391 ) writes:
  
  When caught making a mistake, they admit it, work to resolve it, and move on?
  I think there are a few other companies who could learn from that approach ...
  
  What a great approach indeed! Steal, and if caught, deny it a little, then cover it up.
  
  Actually I think Google learned that from someone else's company, or is Google "innovating" here? A debate for the coming generations.
Cmon Google... (Score:3, Funny)

by Anonymous Coward writes: on Monday April 09, 2007 @08:02PM (#18669519)

surely after helping so many students copy their research papers you should know the number 1 rule of copying another persons work: Change the F*CKING NAME!

Share
twitter facebook
I wonder... (Score:2, Interesting)

by flyboy81 ( 698817 ) writes:

Is this a single isolated incident or simply the first one of more coming from the company that does no evil?
- Re: (Score:2)
  
  by AmberBlackCat ( 829689 ) writes:
  
  I guess the only thing reasonably certain is it's the first time they got caught.
Time for a slogan change? (Score:5, Funny)

by GFree ( 853379 ) writes: on Monday April 09, 2007 @08:12PM (#18669587)

"Do no evil"

should be changed to

"Do just a tiny bit of evil"

which at this rate will probably end up as

"All your web are belong to us"

Share
twitter facebook
- Re: (Score:3, Funny)
  
  by Ngarrang ( 1023425 ) writes:
  
  Do no evil, or don't get caught.
  We redefine evil.
  Emulate or Innovate, which ever is more convenient.
- Re:Time for a slogan change? (Score:5, Insightful)
  
  by LarsG ( 31008 ) writes: on Monday April 09, 2007 @08:33PM (#18669695) Journal
  
  This reminds me of Animal Farm and how the commandments on the barn wall changed.
  
  The people outside looked from Google to MS, and from MS to Google, and from Google to MS again; but already it was impossible to say which was which.
  
  Parent Share
  twitter facebook
  - Re: (Score:2, Funny)
    
    by Anonymous Coward writes:
    
    It's not gotten to that point yet. If you want to figure out which is Google and which is MS, if you're ducking chairs or you hear the distant chant of "developers, developers, developers", it's MS.
  - Re: (Score:2)
    
    by nephridium ( 928664 ) writes:
    
    My sentiments exactly. What will prevent Google from becoming evil? The compnay is growing rapidly. The information it possesses of many people are enough to theoretically pinpoint who/where they are, what their political affiliations might be, what they like to do in their spare time etc.etc. - THAT is power! And power corrupts. Maybe not now, maybe the "Google guys" have enough foresight and prudence to guard against their company becoming too evil for now, but things are bound to change in a generation o
Car stereo (Score:4, Funny)

by DogDude ( 805747 ) writes: on Monday April 09, 2007 @08:17PM (#18669623)

So then, did the guy who stole my car stereo, was he "leveraging some non-car thief assets"?

Share
twitter facebook
- Re: (Score:2, Insightful)
  
  by iminplaya ( 723125 ) writes:
  
  Did he leave you an exact copy?
Do no evil (Score:5, Insightful)

by z-j-y ( 1056250 ) writes: on Monday April 09, 2007 @08:26PM (#18669671)

Google is going to release a statement that stealing code/data is not evil in China, and Google must fit in local cultures and abide by local laws.

Seriously, this is just pathetic. I am appalled by the Google apologists on slashdot.

Chinese input is a well established market; Google Giant forces itself into the market with a product that is very similar to existing ones and offers no innovation. That is not evil enough? They did this by stealing data and who knows what from others. Mind you that the data is not publicly available, so Google must have committed certain crimes to obtain the data.

For those who don't see what's the big deal: the mapping from ASCII sequence to Chinese character/phrase is not trivial; actually it is what Chinese input is all about.

Share
twitter facebook
- Re: (Score:2, Interesting)
  
  by maxume ( 22995 ) writes:
  
  There is no way to tell if the copying was done by 'Google' or if it was done by some engineer on their own. Sure, 'Google' needs to take steps to make sure that they what they put out meets some sort of standard, but the backpedaling and what not is pretty much the response you would get no matter how the copying was initiated, so there isn't much reason to assume where the responsibility for the copying lies.
  - Re: (Score:3, Insightful)
    
    by QuantumG ( 50515 ) writes:
    
    Or done by a Chinese company which Google outsourced to. Isn't that how all corporations do their evil? Outsource it to Evil Inc. Everyone except Microsoft and Enron I guess.
- Re: (Score:2)
  
  by homer_s ( 799572 ) writes:
  
  Google Giant forces itself into the market with a product that is very similar to existing ones and offers no innovation. That is not evil enough?
  
  So, offering a 'me too' product is now evil?
  - Re: (Score:2)
    
    by The_Wilschon ( 782534 ) writes:
    
    When the me-tooist is a corporate giant and the me-firsters are still quite small, the me-tooist will typically crush the me-firsters merely by virtue of its size, name recognition, and ability to lose money on a market for a while in order to gain a monopoly of it.
    
    Even if they hadn't ganked anybody's data to do it, shoehorning themselves into a market full of players much smaller than themselves is not very nice.
    
    Gratuitous analogy: Michael Johnson steals a kid's shoes and then wears them to run at a hi
- Re:Do no evil (Score:5, Insightful)
  
  by ShawnDoc ( 572959 ) writes: on Monday April 09, 2007 @09:07PM (#18669943) Homepage
  
  This is a serious problem when dealing with Chinese companies. Now that Google has opened offices in China and has staffed them with native Chinese people, they're going to have a hard time enforcing western style ideas about copyright and what constitutes "doing no evil". Its a problem we've run into in the past with our Chinese operations. The way the problem was "solved", by removing the engineers names, but still clearly using the other company's engine (they didn't remove the identical bugs), is something I have seen happen in the past when dealing with our R&D team in China when we've found them using code they "borrowed" either from open source code or from an engineers past employer. I've never seen it handled in public like this however. Google is going to need to take some serious Q&A steps in their Chinese offices to keep stuff like this from happening again or else risk their Chinese office ruining the entire company's reputation.
  
  Parent Share
  twitter facebook
- Re: (Score:3, Insightful)
  
  by ReallyEvilCanine ( 991886 ) writes:
  
  I'm appalled, too. I'm also surprised. What I'm not is a Google apologist. I still stand by the crux of my comment [slashdot.org] based on my work in I18N and with IMEs.
  Google must have committed certain crimes to obtain the data.
  No, or at least, "Not necessarily intentionally". The dictionary could've been indexed via the spiders. It could've been indexed via the desktop search app. There are lots of ways that Google could've got the information. Anyone who works for Google, knows the deep ins and outs of their data
  - Re: (Score:2)
    
    by Achromatic1978 ( 916097 ) writes:
    
    The dictionary could've been indexed via the spiders.
    The database wasn't bulk browseable.
    It could've been indexed via the desktop search app.
    I certainly hope not. I would be horrified to find that my desktop search database was being uploaded to Google.
    The information was NOT publicly available. Making it out as though Google just happened upon the database because "Google is information" (?!?) just reeks of a new way to spin.
- Re: (Score:2)
  
  by asninn ( 1071320 ) writes:
  
  Chinese input is a well established market; Google Giant forces itself into the market with a product that is very similar to existing ones and offers no innovation. That is not evil enough?
  Um, no, that's not evil at all - it's called capitalism. Now, you might argue that capitalism in general is evil, but that'd hardly be Google's fault...
  Seriously, if Google doesn't have anything new to offer, no innovations, no improvements or changes over existing products, then they won't do very well in the "wel
About that do no evil stuff.... (Score:2)

by pcause ( 209643 ) writes:

Ok, so we do do some evil, but jusy with our competitor's code. That isn't so bad, is it?
Exactly how did they get a copy of the DB? (Score:2)

by WoTG ( 610710 ) writes:

OK, so now that Google has admitted to copying the sohu.com pinyin database... exactly how did they get a copy in the first place? Is there a publicly available file for personal use or was there some sort of web scraping or what?

I suspect that there's more to this story that we're not hearing.
- Re:Exactly how did they get a copy of the DB? (Score:5, Informative)
  
  by tooyoung ( 853621 ) writes: on Monday April 09, 2007 @09:45PM (#18670171)
  
  OK, so now that Google has admitted to copying the sohu.com pinyin database... exactly how did they get a copy in the first place? Is there a publicly available file for personal use or was there some sort of web scraping or what?
  
  I suspect that there's more to this story that we're not hearing.
  
  Exactly. Reading 95% of the comments for this story and yesterday's story, everyone seems to think that this is about stealing code. This is about Google using the same data to train an algorithm. Both algorithms make the same mistakes because they were trained using the same data, which contained incorrectly labled information. It is whether or not this data was publicly available that is the issue.
  
  For (a horribly contrived) example: Lets say that I write some hand writing recognition software using a neural-net. In order to train my software, I use a large database of handwriting samples that I have found on the web. However, the person that compiled this database made the mistake of labeling all of the sample images of the letter 'n' as the letter 'q', and all of the images of the letter 'q' are labeled as the letter 'n'. Person B comes along and uses the same data set to train a naïve-Bayes classifier. Guess what? Both algorithms will make the same mistakes when it comes to the letters 'n' and 'q'. Not because I stole code from Person B, but because we used the same training data.
  
  I'm not defending Google at all here. If they stole the data from Sohu, they should get in trouble. Based on the fact that Google is in the web-mining business, I would guess that they just grabbed this data off of the net, and someone forgot to think about if they had the right to use it.
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Insightful)
    
    by martin-boundary ( 547041 ) writes:
    
    To paraphrase Wirth: "Programs = Code + Data"
    According to TFA, the data (which apparently was built by the Sohu company) was not publically available and was not licensed to other companies. Obviously, the data must exist in some form within the product itself. That would suggest that either the company had some unsecured internal servers, or that Google hired some of their people who conveniently kept a copy of the data, or they figured out how to decode the data dictionary from a copy of the product.
    I
  - Re: (Score:2, Informative)
    
    by PassBy ( 1086365 ) writes:
    
    I think you are misunderstanding how a Pinyin input works. But anyhow, it is rumored that Sohu had put in some "database finger prints" in their database. Which means, there are hard-coded patterns of Chinese characters that you wouldn't normally get by typing in corresponding English letters (i.e. Name of some Sohu employees). The mistake confirmed by Chinese users, is in fact a misspelling. A Chinese comedian's name, which should be spelled "feng gong" (two characters), can only be outputted by typing "
this is quite troubling (Score:3, Insightful)

by martin-boundary ( 547041 ) writes: on Monday April 09, 2007 @08:34PM (#18669705)

It is clear from this example that _some_ Google engineers have not the first clue about what clean room engineering [groklaw.net] is and when it should be used. Everyone in the software industry is under pressure to produce, that doesn't mean cutting corners is acceptable.
This reminds me of the recent story about GPL code found in OpenBSD [slashdot.org]. There too, an OpenBSD developer took someone else's code and started modifying it without keeping the GPL license. He apparently thought it was ok to do this as long as all the offending functions would be renamed in the final release, but was caught checking in unmodified functions by accident.
Google is well known for using a lot of GPL software, but it is also true that they do not distribute the source code of their flagship programs to the public. Episodes like this make people wonder if they "accidentally" use some GPL code in their distributed products without telling anyone.

Share
twitter facebook
- Re: (Score:2)
  
  by QuantumG ( 50515 ) writes:
  
  Uh huh. Are you trying to suggest that there is something wrong with this:
  
  1. Take existing code under incompatible license
  2. Write new functionality and integrate into your code
  3. Test and develop your application until it is "ready"
  4. Replace incompatible code with your own code
  
  I mean, if you were talking about using proprietary code in the first step then I could imagine that you might have some kind of argument.. but it's GPL code man.. you're free to do whatever you want with it. Only when you distrib
  - Re: (Score:2)
    
    by martin-boundary ( 547041 ) writes:
    
    .. but it's GPL code man.. you're free to do whatever you want with it.
    Of course you can. But if you modify _it_, then the end product is covered under the GPL. Let's take your example:
    1. Take existing code under incompatible license
    No problem there. At this point you have a copy of the GPL'd code, and no code of your own. You can do anything you like with the code.
    2. Write new functionality and integrate into your code
    At this point you have a derivative of the original GPL'd code. No prob
    - Re: (Score:2)
      
      by QuantumG ( 50515 ) writes:
      
      At this point you have a derivative of the original GPL'd code. No problem there, you can do anything you like with the code.
      No.. if you distribute it *then* you are obligated to release your code under the GPL, *but not before*.
      - Re: (Score:2)
        
        by martin-boundary ( 547041 ) writes:
        
        No.. if you distribute it *then* you are obligated to release your code under the GPL, *but not before*.
        
        The GPL applies to the source code throughout its existence, not merely to distributed source code if and when it gets distributed. In fact, the line "Copyright (C) DATE AUTHOR" which is filled in somewhere near the top of the disclaimer is a statement of ownership.
        
        Re: (Score:2)
        
        by QuantumG ( 50515 ) writes:
        
        Dude, you don't know what you are talking about ok? Stop speaking now.
        
        Fucking Slashdot.
        
        Re: (Score:2)
        
        by martin-boundary ( 547041 ) writes:
        
        Is that a quick way to "save face" and retire from the thread? I can accept that.
  - - Re: (Score:2)
      
      by QuantumG ( 50515 ) writes:
      
      Yeah, you're on crack if you think that new code you write is a derivative work just because you have read some GPL code.
- No symmetry (Score:2)
  
  by mangu ( 126918 ) writes:
  
  This reminds me of the recent story about GPL code found in OpenBSD. There too, an OpenBSD developer took someone else's code and started modifying it without keeping the GPL license
  That's just like that old story about the resort where there were girls looking for husbands and husbands looking for girls. It's not a symmetrical situation. If BSD coders feel it's all right to give their work away for free to commercial companies, it doesn't mean GPL coders should be forced to do the same. Even if the BSD pe
- Re: (Score:2)
  
  by Dun Malg ( 230075 ) writes:
  
  It is clear from this example that _some_ Google engineers have not the first clue about what clean room engineering [groklaw.net] is and when it should be used.
  What kind of idiot "clean room engineers" a freakin' dictionary? You "clean room" the software that uses the dictionary...
  - Re: (Score:2)
    
    by martin-boundary ( 547041 ) writes:
    
    What kind of idiot "clean room engineers" a freakin' dictionary? You "clean room" the software that uses the dictionary...
    
    Spot on. But remember that the dictionary was probably obtained in some form by examining the software that directly uses it. In other words, Google's programmers were reverse engineering a competitor's product.
Ironic (Score:5, Funny)

by smackt4rd ( 950154 ) writes: on Monday April 09, 2007 @08:42PM (#18669775)

So now american companies are pirating chinese software? Oh the irony! :)

Share
twitter facebook
Their new spokesperson ... (Score:2, Funny)

by myster0n ( 216276 ) writes:

... Theo De Raadt says that the Chinese are INHUMAN.

*ducks*
- Re: (Score:2)
  
  by micromuncher ( 171881 ) writes:
  
  C'mon. Anyone who misuses root access and a position of authority such as sysadmin to delete a term paper of someone who disagress with them and is subsequently fired by the university who needs to send the police after him to retrieve server room keys MUST be an authority on authority.
Were the errors intentional? (Score:4, Informative)

by SuperBanana ( 662181 ) writes: on Monday April 09, 2007 @09:22PM (#18670025)

If you ask around in the GIS/mapping community, it's known that the [street] map data providers (Delorme, Garmin, etc) will insert garbage data here and there. A street name is slightly wrong, or they have a mystery street that doesn't exist in the real world. They use it to try and tell if/when someone steals their data. If Zyugyz Road in Somecity, CA exists- the legal team fires at will.
It's kind of weird, considering that most mapping companies do little more than get their hands on town/county/state GIS data for cheap, massage it a bit, then charge assloads of money for it.

Share
twitter facebook
- Re: (Score:2)
  
  by Dan East ( 318230 ) writes:
  
  The same happens with government medical related data. Take the ICD9 database for example. It is distributed in a database format not conducive to programmatic access. For example, there are hundreds of codes with the description of "Other". Its description only makes sense in the context of all its parent levels, which then produces an extremely large, redundant description. Companies will simply reformat the data, take copyright and profit.
  
  Dan East
Shame! (Score:3, Funny)

by BluBall ( 16231 ) writes: on Monday April 09, 2007 @09:24PM (#18670047)

Following the protocols established by the recent OpenBSD/Linux Broadcom driver fiasco, the proper response would be to denounce Sohu for having been ripped off by Google.

Shame on you Sohu! This is inhuman!

Share
twitter facebook
Right! Google is evil! (Score:4, Insightful)

by SEE ( 7681 ) writes: on Monday April 09, 2007 @09:33PM (#18670097) Homepage

After all, we know that all Google employees are under Total Management Mind Control, and that Google Knows Everything Everyone's Doing. It's not even remotely possible that a handful of Google employees in China could shadily cut corners (using an already-extant database instead of compiling one from their own company's data) without Sergey Brin and Larry Page having personally authorized it from Mountain View, or that it would actually take a bit of time for upper management to investigate an issue when it's uncovered.

Share
twitter facebook
- Oh please... (Score:3, Insightful)
  
  by Moraelin ( 679338 ) writes:
  
  Oh please... if Google wanted to distance itself from it, they could have done so long ago. "Sorry, mates, some of our employees fucked up, they've been fired and the offending code/product/database is now being pulled off the market until we build our own replacement."
  
  The whole bullshit, including trying to get away with just deleting the original developpers' names, and press releases about "leveraging non-Google assets" is what's damning Google. It's not just that the original incident happened, it's tha
  - - Re: (Score:2)
      
      by Moraelin ( 679338 ) writes:
      
      They obviously had the time to first try to "fix" it by removing the original developpers' names, and now to pull the weird "we're just leveraging non-Google assets" statement. It seems to me like it would have taken exactly the same time to do a less irritating statement.
      
      Let me also say that I seriously doubt that they could replace such a database in-house within a week. There's a _lot_ of work involved in such a thing. Even if you have the most l33t code ever, the research involved isn't something you'd
Tutorial on Chinese input (Score:5, Informative)

by microbee ( 682094 ) writes: on Monday April 09, 2007 @10:02PM (#18670255)

There are a lot of misundertstandings about how IME works and how Google copied non-public databases. So let me explain.

IME accepts keyboard input and converts it into certain language characters. There are many different input methods that decide how to generate Chinese characters by using English keyboards, and pinyin is one of them (and the most popular one).

pinyin is popular because it's simple and bears almost no learning curve. However, it suffers the problem of aliasing. For example, "shi" under pinyin will convert into "" "" "" ... in general, the same sequence could map to many different words (could be several dozens), and you usually need to select from them by choosing 1, 2, 3, ...(the input bar will display them from which you could choose, somtimes needing page-down). A native implementation of pinyin is thus very slow and cumbersome to use.

A good implementation uses following approaches:
1. adjust word location by how frequently it's used in the past. So most frequently used words are shift to the front, making selection much faster. Typically they should fit into the first page (no scrolling required).
2. allow partial input for common phrases. This inputs a whole phrase at once, each character only requiring the first English letters. It speeds up input significantly.

So the quality of the pinyin method depends heavily on how well the input could guess and prioritize the guesses, and thus the dictionary that is being used. And generating this dictionary (keeping it both contemporary and accurate) takes a lot of time.

The dictionary is typically distributed together with the input method (or it wouldn't work). You could obtain sohu's dictionary by just installing its input method, and Google has likely obtained it this way. However, I don't think it's in an open-standard format, so Google probably has done certain reverse-engineering to be able to actually use it in its own software.

Share
twitter facebook
That shouldn't be copyrightable (Score:5, Interesting)

by wrook ( 134116 ) writes: on Monday April 09, 2007 @10:03PM (#18670265) Homepage

I've been thinking about this. Throwing the evilness of Google aside for a moment, why should someone be able to copyright a listing of the phonetic pronunciation of an alphabet?

Let's just imagine how I might create this list. I would have to hire people who spoke the Chinese. Then I would ask them to record the pronunciation of each character that they know. This is pretty easy because in Chinese each character has only one pronunciation (per dialect, anyway). There are about 3500 characters that you need to know in order to be literate. And all of these people would have learned these at school.

But how did they learn them? Well, they had a textbook and they memorized the list from the textbook.

Wait. I can't just memorize a list from one book and put it in another book. That's copyright infringement. In order for it not to be copyright infringement, I need to make sure that my sources all memorized the pronunciations from different sources. That's going to be difficult.

But let's say I do that. Now I have a list of the 3500 most common characters. And with that, I've probably got 99% of everything that's in a newspaper. But that's probably not good enough. I probably want a list
of say 60,000 characters. Otherwise it's pretty useless in a general sense. Uncommon characters are uncommon, but you *will* bump into the words over time.

So where do I find these characters? Can I hire some guy that knows them all? It would be very difficult. The best place to look is in a book. But wait... what am I going to do? Every time I find a character my people don't know, look it up in a book? Why don't I just copy it from the book in the first place? That's just copyright infringement again.

Really, the task of creating this list authoritatively without infringing copyright is monumental. Probably the *only* way to do it is with a community project where people just submit the pronunciations they know.

But if I'm going to have a community project like this, what the heck do I need copyright for? What am I protecting? If everyone is going to contribute, everyone should benefit.

So, personally, I don't think one should have copyright on this kind of material (same thing for spelling). It's just not in the public interest. This goes doubly so now that we have the internet and creating these kinds of projects is very inexpensive.

OK, I've gone on long enough... But one more rant. What's with this "do no evil" thing? Isn't that setting the bar a little low. If I told my parents that I'd work hard not to be evil, I think they'd be somewhat disappointed in me. If Google wanted to actually "do some good" rather than "do no evil", they could start a community project to collect this data and share it with the world.

Sigh... I guess we'll have to wait for some guy in his garage (but here's betting that someone has already started something).

Share
twitter facebook
- Re: (Score:3, Informative)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re: (Score:2)
  
  by dominator ( 61418 ) writes:
  
  Wait. I can't just memorize a list from one book and put it in another book. That's copyright infringement. In order for it not to be copyright infringement, I need to make sure that my sources all memorized the pronunciations from different sources. That's going to be difficult.
  That's not how copyright law works, at least in the USA. Lists are facts. Facts are not copyrightable, nor are compilations thereof. Mainly because copyright isn't determined by the "sweat of the brow" rule, but rather "the sine
- Re: (Score:2)
  
  by Jeff DeMaagd ( 2015 ) writes:
  
  I think there's a difference between just copying someone else's list and compiling your own from numerous sources that aren't identifiable.
  
  One secret I've heard about the textbook industry is that an author might use dozens of sources, and reinterpret the information in their own words. Much of that information in a a textbook is often public knowledge or from public domain sources, but there's the work needed to compile that information into that particular order. I think using many different sources a
Finally we steal some IP from them! (Score:3, Funny)

by gatkinso ( 15975 ) writes: on Monday April 09, 2007 @10:38PM (#18670471)

TURN ABOUT IS FAIR PLAY.

Ok fine, we have stolen from them before... but Beef and Broccoli don't count.

Share
twitter facebook
Here's your wallet back mate (Score:2, Funny)

by Paranoia Agent ( 887026 ) writes:

Sorry, I was just leveraging some non-personal resources.
Provincialist Americans (Score:2)

by Keith McClary ( 14340 ) writes:

In the US, a list of words in lexicographic order is not necessarily copyrightable (eg. phone books).

Is it also so in China? And does China have laws making databases IP like the US?

Americans seem to think that their bizarre and extreme notions of IP are universal law.

Perhaps someone here is an expert on Chinese IP law - did Google-China do anything illegal?
Begs teh question. (Score:2, Funny)

by Anonymous Coward writes:

Sohu cares?
Google's response (Score:2, Funny)

by Loconut1389 ( 455297 ) writes:

The person responsible for the copying has been sacked. ...
The person responsible for the sacking has been sacked...
- Mistakes are (Score:2)
  
  by EmbeddedJanitor ( 597831 ) writes:
  
  The mistakes were the giveaway. Surely these are "creative works"?
- Re:Any surprise this was done in China? (Score:4, Insightful)
  
  by BiggerIsBetter ( 682164 ) writes: on Monday April 09, 2007 @09:33PM (#18670093)
  
  Google may be filled with the best engineers, but once you move out of North America, they know nothing about ethics or morality.
  
  I'm curious how much time you've spent outside of North America, because I'm pretty sure 92% of the world population would disagree with you.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Kristoph ( 242780 ) writes:
    
    Ummm ... hi there ... Canadian here ... please can we not get dragged into this :-)
    
    ]{
  - Re: (Score:2)
    
    by asninn ( 1071320 ) writes:
    
    I'm curious how much time you've spent outside of North America, because I'm pretty sure 92% of the world population would disagree with you.
    
    Make that 95%, and count me in as one of those who'd disagree and who're curious as well.
  - Re: (Score:2)
    
    by steelfood ( 895457 ) writes:
    
    I think GP is a troll, but the actual point is valid. It isn't that parts of the world "know nothing about ethics or morality." It is that other cultures have other standards of ethics and morality. While most cultures have similar basic ethics and morals (do not kill, do not steal--actually a generalization of the first, etc.), something that falls into a gray area like reusing the IP of another will be inconsistent throughout the world. Besides, we don't really have an established moral outlook on IP infr
    - Re: (Score:2)
      
      by BiggerIsBetter ( 682164 ) writes:
      
      I don't believe that morality comes into it. Possibly ethics, but my limited experience with the US tells me that if you can a) gain advantage, b) get away with it, and c) the exposure is less than the cost of doing it yourself, then you steal/copy/infringe on the "IP". Anything less would be bad business. China isn't so different...
- Oblig futurama quote (Score:5, Funny)
  
  by pedantic bore ( 740196 ) writes: on Monday April 09, 2007 @10:03PM (#18670263)
  
  "The internet is about the free exchange of other people's ideas!"
  
  Parent Share
  twitter facebook
- - - Re: (Score:2)
      
      by mattgreen ( 701203 ) writes:
      
      But they SAID they weren't evil, therefore that MUST make them good! Or, at least, that is how I fit into my naive worldview! Everything is either absolutely evil (Microsoft) or absolutely good (Google). There is no in-between.
    - Re:Do no evil? (Score:4, Interesting)
      
      by setagllib ( 753300 ) writes: on Monday April 09, 2007 @11:00PM (#18670643)
      
      They're significantly reducing the lockin to Microsoft products, by encouraging, buying and thereafter funding web application projects that often overlap with what is currently locked in to Microsoft. They even brew some of their own sometimes. They continue the development of Linux and Python with a wide adoption of both. All of these things are creating wealth for everyone, and crippling Microsoft little by little, which we know is what we want. I'd much rather have a Google & Microsoft duopoly if it means Microsoft would finally have to clean up its shit and accomodate whatever open source platform Google would support in that scenario.
      
      Parent Share
      twitter facebook
- Re: (Score:2)
  
  by Achromatic1978 ( 916097 ) writes:
  
  Tell you what, grab an M16 and man the borders. What the fuck piece of xenophobic, nationalistic tripe is this? "no more good people left in the world"?
- Re: (Score:2, Funny)
  
  by hackingbear ( 988354 ) writes:
  
  The advanced feature will be:
  
  When you are typing your term paper using this IME, the IME will automatically google the Web and find out other papers on the same topic and you can just stop thinking and typing but instead copy from those paper on a click of a button.
- Re: (Score:2)
  
  by rm69990 ( 885744 ) writes:
  
  More like Google being a business. Seriously, get over this "Evil" crap, it's marketing speak for crying out loud. Nothing more, nothing less.
  
  If you actually believe that Google will forgo profits to avoid appearing evil to an extremely small percentage of the population that actually give a shit what Google as a company does, I have a bridge to sell you.
  
  Every Slashdot user could quit using Google, and the affect on their financial situation would be negligible. So they don't really care what you think, or
- Re: (Score:2)
  
  by ajs ( 35943 ) writes:
  
  It's neither. It's a mistake, and human beings make mistakes. They hired someone who did the wrong thing, and I'm sure that mistake is being rectified.... Opening foriegn offices is tricky stuff, and controling them is tricker. Google is just starting to figure this out.
- - Doing evil to combat evil.. (Score:2)
    
    by iendedi ( 687301 ) writes:
    
    Its both. Do evil to combat evil. Thats the American way now, didn't you get the memo?
    That is only one step away from "Doing evil to combat perceived evil". Or is that even one step?
    
    At any rate, since human perception is highly flawed, the practice of "Doing evil to combat perceived evil" can really be reduced to "Doing evil and hoping it limits the evil that others do". However, "Doing evil and hoping it limits the evil that others do" is really the same thing as simply "Doing evil." in fact, it is even worse, because it is really "Doing evil while competing with other evil in the ho
  - - Re: (Score:3, Funny)
      
      by jstomel ( 985001 ) writes:
      
      Chinks are chinese, gooks are vietnamese. People need to learn to keep their racial slurs straight or soon we won't be able to tell who anybody hates, and that would be terrible!
      - Re:Is this... (Score:5, Interesting)
        
        by 808140 ( 808140 ) writes: on Tuesday April 10, 2007 @01:43AM (#18671923)
        
        No, actually, "gook" is a term that originated in the Korean war for Korean people. Because many of the soldiers who fought in the Korean war were officers in the Vietnam war, their racial slurs were adopted and modified by a new generation, leading to great confusion about the origins of the term.
        
        The etymology of the word gook is interesting, because it may be one of the few racial slurs that originated with a people's term for themselves. In Korean, guk means "country" and by extension a country's people; when it is not modified (cf. waiguk, outside country, foreigner) it is understood to be Korea or its peoples. Speakers of Chinese will recognize the word as having sintic origin (gúo, country, and wàigúo, foreign country, respectively, in Mandarin).
        
        The term was appropriated by the Americans during the Korean war and used as a racial slur for Korean people in general, which must have been confusing to the Koreans (imagine someone using "American" as a slur for Americans to get an idea). Then, in Vietnam, the old "Asians are all the same" mentality prompted GIs to extend its meaning (imagine "American" being a racial slur for all white people, for example -- yes, I know many Americans aren't white, it's not a perfect analogy, deal with it).
        
        Parent Share
        twitter facebook
        
        Re: (Score:2, Insightful)
        
        by Anonymous Coward writes:
        
        imagine someone using "American" as a slur for Americans to get an idea
        
        Why imagine? Come to Europe! But make sure to say you're Canadian...
        
        Re: (Score:2)
        
        by AlecC ( 512609 ) writes:
        
        Something like using "Yankee" or "Yank" to refer to all Americans. But we wouldn't do that, would we?
- - Re: (Score:2)
    
    by zippthorne ( 748122 ) writes:
    
    The middle road approach typically works itself out eventually. see the French Revolution.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Dictionary mistakes. (Score:5, Funny)

Re: (Score:2)

Google's initial explanation (Score:5, Funny)

Re: (Score:2, Funny)

Have no fear! (Score:2)

This reminds me of (Score:5, Interesting)

Turnitin.com Subscription Coming (Score:4, Funny)

So... (Score:5, Interesting)

Re:So... (Score:5, Insightful)

On what do you base your judgment? (Score:4, Insightful)

Re:On what do you base your judgment? (Score:5, Informative)

Re: (Score:3, Interesting)

Nobody cares about "work." (Score:2)

Re: (Score:2, Funny)

And it isn't (Score:2)

it is not known data (Score:3, Informative)

Re: (Score:2)

Re: (Score:2, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:So... (Score:5, Insightful)

Re: (Score:3, Insightful)

Cmon Google... (Score:3, Funny)

I wonder... (Score:2, Interesting)

Re: (Score:2)

Time for a slogan change? (Score:5, Funny)

Re: (Score:3, Funny)

Re:Time for a slogan change? (Score:5, Insightful)

Re: (Score:2, Funny)

Re: (Score:2)

Car stereo (Score:4, Funny)

Re: (Score:2, Insightful)

Do no evil (Score:5, Insightful)

Re: (Score:2, Interesting)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re:Do no evil (Score:5, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

About that do no evil stuff.... (Score:2)

Exactly how did they get a copy of the DB? (Score:2)

Re:Exactly how did they get a copy of the DB? (Score:5, Informative)

Re: (Score:3, Insightful)

Re: (Score:2, Informative)

this is quite troubling (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

No symmetry (Score:2)

Re: (Score:2)

Re: (Score:2)

Ironic (Score:5, Funny)

Their new spokesperson ... (Score:2, Funny)

Re: (Score:2)

Were the errors intentional? (Score:4, Informative)

Re: (Score:2)

Shame! (Score:3, Funny)

Right! Google is evil! (Score:4, Insightful)

Oh please... (Score:3, Insightful)

Re: (Score:2)

Tutorial on Chinese input (Score:5, Informative)

That shouldn't be copyrightable (Score:5, Interesting)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Finally we steal some IP from them! (Score:3, Funny)

Here's your wallet back mate (Score:2, Funny)

Provincialist Americans (Score:2)

Begs teh question. (Score:2, Funny)