Slashdot Log In
Google Admits to Using Sohu Database
Posted by
CowboyNeal
on Mon Apr 09, 2007 06:46 PM
from the cut-and-paste dept.
from the cut-and-paste dept.
prostoalex writes "A few days ago a Chinese company, Sohu.com, alleged Google improperly tapped its database for its Pinyin IME product, stirring controversy on whether two databases were similar just due to normal research process. Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' 'The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents. In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.'"
Related Stories
[+]
Google Faces Plagiarism Questions Over Chinese Software 187 comments
yaohua2000 writes "Google's laboratory in China has launched its first product, a Pinyin Input Method Editor. The software allows the romanized characters to be translated to more traditional Chinese symbols , via entering on a QWERTY keyboard. Users soon discovered that the data Google used for the product was unusually similar to the data used by a Chinese rival, Sogou. Google has evaded the question about software similarities, reports PC World. 'The similarities, which included an error involving the name of a celebrity, were noted on a Google Labs discussion board about its Pinyin IME. Users noted that entering the Pinyin pinggong into the Google IME incorrectly produced the name of Feng Gong, an actor and comedian.'"
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Dictionary mistakes. (Score:5, Funny)
Leverage! Leverage!
Let no one else's work cut short your edge,
Against the truth you can surely hedge,
So don't cut short your edge,
But leverage, leverage, leverage!
(One man deserves the credit! One man deserves the blame!
And Sergei Brin Ivanovich Lobachevsky is his name!)
Google's initial explanation (Score:5, Funny)
Have no fear! (Score:2)
This reminds me of (Score:5, Interesting)
Turnitin.com Subscription Coming (Score:4, Funny)
So... (Score:5, Interesting)
I think there are a few other companies who could learn from that approach
Re:So... (Score:5, Insightful)
Its not the first time Google have taken a fairly liberal interpretation of someone elses copyright either.
Parent
On what do you base your judgment? (Score:4, Insightful)
One of Sohu's demands was to remove it. They did that, even prior to the cease & desist deadline, per the article. It sounds like they'll have to compensate Sohu next, which isn't overly surprising. As for where they got it, perhaps someone sold it to them? We don't know, so I'll reserve judgment about whether it was acquired in an un-Google "evil" way until we hear the rest of the story.
> It's not the first time Google have taken a fairly liberal interpretation of someone else's copyright either.
As for the copyright stance, I honestly don't care. Yes, I dislike Microsoft's hypocrisy concerning copyright, but I don't really give a damn about imaginary property at this point in time, and I don't see Google out there telling people that copyright infringement is evil, wrong, Communist and anti-American.
Frankly, I'm more inclined to distribute my works with only one request: that you do not acknowledge my authorship in any way. Of course, almost the only way to enforce that is to post AC
Parent
Re:On what do you base your judgment? (Score:5, Informative)
It reminds me of a court case a few years ago in Thailand, where a judge put several Thai fonts into the public domain, stating "No one owns the Thai alphabet. It belongs to the people."
Parent
Re: (Score:3, Interesting)
it is not known data (Score:3, Informative)
Nobody is accusing Google of "copying Chinese characters", but rather of copying a specific collection that somebody has invested time and money in creating. This is not a corpus, but rather more like a dictionary. Anyone can create one, but google - which I have emminent respect for in other areas, but not this one - has decided to take somebody else's "dictionary" rather than creating their own. The
Re: (Score:2)
Perhaps so. But then, Google has billions of dollars in the bank. They have no need to steal anything from anyone, and every reason not to.
Can you really suppose that anyone in Google management decided to snag Sohu's database? Google is in the database business, so they know all about the salting of databases. They had to know that any commercial database will be filled with giveaway records (e.
Re:So... (Score:5, Insightful)
Parent
Re: (Score:3, Insightful)
I think there are a few other companies who could learn from that approach
What a great approach indeed! Steal, and if caught, deny it a little, then cover it up.
Actually I think Google learned that from someone else's company, or is Google "innovating" here? A debate for the coming generations.
Cmon Google... (Score:3, Funny)
I wonder... (Score:2, Interesting)
Time for a slogan change? (Score:5, Funny)
should be changed to
"Do just a tiny bit of evil"
which at this rate will probably end up as
"All your web are belong to us"
Re: (Score:3, Funny)
We redefine evil.
Emulate or Innovate, which ever is more convenient.
Re:Time for a slogan change? (Score:5, Insightful)
The people outside looked from Google to MS, and from MS to Google, and from Google to MS again; but already it was impossible to say which was which.
Parent
Car stereo (Score:4, Funny)
Re: (Score:2, Insightful)
Do no evil (Score:5, Insightful)
Seriously, this is just pathetic. I am appalled by the Google apologists on slashdot.
Chinese input is a well established market; Google Giant forces itself into the market with a product that is very similar to existing ones and offers no innovation. That is not evil enough? They did this by stealing data and who knows what from others. Mind you that the data is not publicly available, so Google must have committed certain crimes to obtain the data.
For those who don't see what's the big deal: the mapping from ASCII sequence to Chinese character/phrase is not trivial; actually it is what Chinese input is all about.
Re: (Score:2, Interesting)
Re: (Score:3, Insightful)
Re: (Score:2)
So, offering a 'me too' product is now evil?
Re:Do no evil (Score:5, Insightful)
Parent
Re: (Score:3, Insightful)
Google must have committed certain crimes to obtain the data.
No, or at least, "Not necessarily intentionally". The dictionary could've been indexed via the spiders. It could've been indexed via the desktop search app. There are lots of ways that Google could've got the information. Anyone who works for Google, knows the deep ins and outs of their data
About that do no evil stuff.... (Score:2)
Exactly how did they get a copy of the DB? (Score:2)
I suspect that there's more to this story that we're not hearing.
Re:Exactly how did they get a copy of the DB? (Score:5, Informative)
Exactly. Reading 95% of the comments for this story and yesterday's story, everyone seems to think that this is about stealing code. This is about Google using the same data to train an algorithm. Both algorithms make the same mistakes because they were trained using the same data, which contained incorrectly labled information. It is whether or not this data was publicly available that is the issue.
For (a horribly contrived) example: Lets say that I write some hand writing recognition software using a neural-net. In order to train my software, I use a large database of handwriting samples that I have found on the web. However, the person that compiled this database made the mistake of labeling all of the sample images of the letter 'n' as the letter 'q', and all of the images of the letter 'q' are labeled as the letter 'n'. Person B comes along and uses the same data set to train a naïve-Bayes classifier. Guess what? Both algorithms will make the same mistakes when it comes to the letters 'n' and 'q'. Not because I stole code from Person B, but because we used the same training data.
I'm not defending Google at all here. If they stole the data from Sohu, they should get in trouble. Based on the fact that Google is in the web-mining business, I would guess that they just grabbed this data off of the net, and someone forgot to think about if they had the right to use it.
Parent
Re: (Score:3, Insightful)
According to TFA, the data (which apparently was built by the Sohu company) was not publically available and was not licensed to other companies. Obviously, the data must exist in some form within the product itself. That would suggest that either the company had some unsecured internal servers, or that Google hired some of their people who conveniently kept a copy of the data, or they figured out how to decode the data dictionary from a copy of the product.
I
this is quite troubling (Score:3, Insightful)
This reminds me of the recent story about GPL code found in OpenBSD [slashdot.org]. There too, an OpenBSD developer took someone else's code and started modifying it without keeping the GPL license. He apparently thought it was ok to do this as long as all the offending functions would be renamed in the final release, but was caught checking in unmodified functions by accident.
Google is well known for using a lot of GPL software, but it is also true that they do not distribute the source code of their flagship programs to the public. Episodes like this make people wonder if they "accidentally" use some GPL code in their distributed products without telling anyone.
Re: (Score:2)
1. Take existing code under incompatible license
2. Write new functionality and integrate into your code
3. Test and develop your application until it is "ready"
4. Replace incompatible code with your own code
I mean, if you were talking about using proprietary code in the first step then I could imagine that you might have some kind of argument.. but it's GPL code man.. you're free to do whatever you want with it. Only when you distrib
Ironic (Score:5, Funny)
Their new spokesperson ... (Score:2, Funny)
*ducks*
Were the errors intentional? (Score:4, Informative)
If you ask around in the GIS/mapping community, it's known that the [street] map data providers (Delorme, Garmin, etc) will insert garbage data here and there. A street name is slightly wrong, or they have a mystery street that doesn't exist in the real world. They use it to try and tell if/when someone steals their data. If Zyugyz Road in Somecity, CA exists- the legal team fires at will.
It's kind of weird, considering that most mapping companies do little more than get their hands on town/county/state GIS data for cheap, massage it a bit, then charge assloads of money for it.
Re: (Score:2)
Dan East
Shame! (Score:3, Funny)
Shame on you Sohu! This is inhuman!
Right! Google is evil! (Score:4, Insightful)
Oh please... (Score:3, Insightful)
The whole bullshit, including trying to get away with just deleting the original developpers' names, and press releases about "leveraging non-Google assets" is what's damning Google. It's not just that the original incident happened, it's tha
Tutorial on Chinese input (Score:5, Informative)
IME accepts keyboard input and converts it into certain language characters. There are many different input methods that decide how to generate Chinese characters by using English keyboards, and pinyin is one of them (and the most popular one).
pinyin is popular because it's simple and bears almost no learning curve. However, it suffers the problem of aliasing. For example, "shi" under pinyin will convert into "" "" ""
A good implementation uses following approaches:
1. adjust word location by how frequently it's used in the past. So most frequently used words are shift to the front, making selection much faster. Typically they should fit into the first page (no scrolling required).
2. allow partial input for common phrases. This inputs a whole phrase at once, each character only requiring the first English letters. It speeds up input significantly.
So the quality of the pinyin method depends heavily on how well the input could guess and prioritize the guesses, and thus the dictionary that is being used. And generating this dictionary (keeping it both contemporary and accurate) takes a lot of time.
The dictionary is typically distributed together with the input method (or it wouldn't work). You could obtain sohu's dictionary by just installing its input method, and Google has likely obtained it this way. However, I don't think it's in an open-standard format, so Google probably has done certain reverse-engineering to be able to actually use it in its own software.
That shouldn't be copyrightable (Score:5, Interesting)
Let's just imagine how I might create this list. I would have to hire people who spoke the Chinese. Then I would ask them to record the pronunciation of each character that they know. This is pretty easy because in Chinese each character has only one pronunciation (per dialect, anyway). There are about 3500 characters that you need to know in order to be literate. And all of these people would have learned these at school.
But how did they learn them? Well, they had a textbook and they memorized the list from the textbook.
Wait. I can't just memorize a list from one book and put it in another book. That's copyright infringement. In order for it not to be copyright infringement, I need to make sure that my sources all memorized the pronunciations from different sources. That's going to be difficult.
But let's say I do that. Now I have a list of the 3500 most common characters. And with that, I've probably got 99% of everything that's in a newspaper. But that's probably not good enough. I probably want a list
of say 60,000 characters. Otherwise it's pretty useless in a general sense. Uncommon characters are uncommon, but you *will* bump into the words over time.
So where do I find these characters? Can I hire some guy that knows them all? It would be very difficult. The best place to look is in a book. But wait... what am I going to do? Every time I find a character my people don't know, look it up in a book? Why don't I just copy it from the book in the first place? That's just copyright infringement again.
Really, the task of creating this list authoritatively without infringing copyright is monumental. Probably the *only* way to do it is with a community project where people just submit the pronunciations they know.
But if I'm going to have a community project like this, what the heck do I need copyright for? What am I protecting? If everyone is going to contribute, everyone should benefit.
So, personally, I don't think one should have copyright on this kind of material (same thing for spelling). It's just not in the public interest. This goes doubly so now that we have the internet and creating these kinds of projects is very inexpensive.
OK, I've gone on long enough... But one more rant. What's with this "do no evil" thing? Isn't that setting the bar a little low. If I told my parents that I'd work hard not to be evil, I think they'd be somewhat disappointed in me. If Google wanted to actually "do some good" rather than "do no evil", they could start a community project to collect this data and share it with the world.
Sigh... I guess we'll have to wait for some guy in his garage (but here's betting that someone has already started something).
Re: (Score:3, Informative)
In the case of mandarin, while it is the case most of the time that each character has only one pronunciation there are cases where are character may have a different reading depending on the compound word it is in. The case with simplified Chinese as per the mainland makes this even more burdensome as multiple characters with different tones or different pronunciations altogether were combined to make
Finally we steal some IP from them! (Score:3, Funny)
Ok fine, we have stolen from them before... but Beef and Broccoli don't count.
Mistakes are (Score:2)
Re:Any surprise this was done in China? (Score:4, Insightful)
I'm curious how much time you've spent outside of North America, because I'm pretty sure 92% of the world population would disagree with you.
Parent
Oblig futurama quote (Score:5, Funny)
Parent
Re:Do no evil? (Score:4, Interesting)
Parent
Re: (Score:3, Funny)
Re:Is this... (Score:5, Interesting)
The etymology of the word gook is interesting, because it may be one of the few racial slurs that originated with a people's term for themselves. In Korean, guk means "country" and by extension a country's people; when it is not modified (cf. waiguk, outside country, foreigner) it is understood to be Korea or its peoples. Speakers of Chinese will recognize the word as having sintic origin (gúo, country, and wàigúo, foreign country, respectively, in Mandarin).
The term was appropriated by the Americans during the Korean war and used as a racial slur for Korean people in general, which must have been confusing to the Koreans (imagine someone using "American" as a slur for Americans to get an idea). Then, in Vietnam, the old "Asians are all the same" mentality prompted GIs to extend its meaning (imagine "American" being a racial slur for all white people, for example -- yes, I know many Americans aren't white, it's not a perfect analogy, deal with it).
Parent