Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Google The Internet Businesses Your Rights Online

Google Admits to Using Sohu Database 209

prostoalex writes "A few days ago a Chinese company, Sohu.com, alleged Google improperly tapped its database for its Pinyin IME product, stirring controversy on whether two databases were similar just due to normal research process. Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' 'The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents. In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.'"
This discussion has been archived. No new comments can be posted.

Google Admits to Using Sohu Database

Comments Filter:
  • by SuperBanana ( 662181 ) on Monday April 09, 2007 @09:22PM (#18670025)

    If you ask around in the GIS/mapping community, it's known that the [street] map data providers (Delorme, Garmin, etc) will insert garbage data here and there. A street name is slightly wrong, or they have a mystery street that doesn't exist in the real world. They use it to try and tell if/when someone steals their data. If Zyugyz Road in Somecity, CA exists- the legal team fires at will.

    It's kind of weird, considering that most mapping companies do little more than get their hands on town/county/state GIS data for cheap, massage it a bit, then charge assloads of money for it.

  • In my mind, there is some question of whether a database of facts should, in fact (hee hee), be copyrightable at all. The characters were not original. The pinyin is not original. The pinyin for each character is, in fact, well established. Why should a compilation of public-domain facts which in itself is a derivative work be copyrightable?

    It reminds me of a court case a few years ago in Thailand, where a judge put several Thai fonts into the public domain, stating "No one owns the Thai alphabet. It belongs to the people."
  • by tooyoung ( 853621 ) on Monday April 09, 2007 @09:45PM (#18670171)

    OK, so now that Google has admitted to copying the sohu.com pinyin database... exactly how did they get a copy in the first place? Is there a publicly available file for personal use or was there some sort of web scraping or what?

    I suspect that there's more to this story that we're not hearing.


    Exactly. Reading 95% of the comments for this story and yesterday's story, everyone seems to think that this is about stealing code. This is about Google using the same data to train an algorithm. Both algorithms make the same mistakes because they were trained using the same data, which contained incorrectly labled information. It is whether or not this data was publicly available that is the issue.

    For (a horribly contrived) example: Lets say that I write some hand writing recognition software using a neural-net. In order to train my software, I use a large database of handwriting samples that I have found on the web. However, the person that compiled this database made the mistake of labeling all of the sample images of the letter 'n' as the letter 'q', and all of the images of the letter 'q' are labeled as the letter 'n'. Person B comes along and uses the same data set to train a naïve-Bayes classifier. Guess what? Both algorithms will make the same mistakes when it comes to the letters 'n' and 'q'. Not because I stole code from Person B, but because we used the same training data.

    I'm not defending Google at all here. If they stole the data from Sohu, they should get in trouble. Based on the fact that Google is in the web-mining business, I would guess that they just grabbed this data off of the net, and someone forgot to think about if they had the right to use it.
  • by microbee ( 682094 ) on Monday April 09, 2007 @10:02PM (#18670255)
    There are a lot of misundertstandings about how IME works and how Google copied non-public databases. So let me explain.

    IME accepts keyboard input and converts it into certain language characters. There are many different input methods that decide how to generate Chinese characters by using English keyboards, and pinyin is one of them (and the most popular one).

    pinyin is popular because it's simple and bears almost no learning curve. However, it suffers the problem of aliasing. For example, "shi" under pinyin will convert into "" "" "" ... in general, the same sequence could map to many different words (could be several dozens), and you usually need to select from them by choosing 1, 2, 3, ...(the input bar will display them from which you could choose, somtimes needing page-down). A native implementation of pinyin is thus very slow and cumbersome to use.

    A good implementation uses following approaches:
    1. adjust word location by how frequently it's used in the past. So most frequently used words are shift to the front, making selection much faster. Typically they should fit into the first page (no scrolling required).
    2. allow partial input for common phrases. This inputs a whole phrase at once, each character only requiring the first English letters. It speeds up input significantly.

    So the quality of the pinyin method depends heavily on how well the input could guess and prioritize the guesses, and thus the dictionary that is being used. And generating this dictionary (keeping it both contemporary and accurate) takes a lot of time.

    The dictionary is typically distributed together with the input method (or it wouldn't work). You could obtain sohu's dictionary by just installing its input method, and Google has likely obtained it this way. However, I don't think it's in an open-standard format, so Google probably has done certain reverse-engineering to be able to actually use it in its own software.

  • Comment removed (Score:3, Informative)

    by account_deleted ( 4530225 ) on Monday April 09, 2007 @11:48PM (#18671085)
    Comment removed based on user account deletion
  • by PassBy ( 1086365 ) on Tuesday April 10, 2007 @12:11AM (#18671299)
    I think you are misunderstanding how a Pinyin input works. But anyhow, it is rumored that Sohu had put in some "database finger prints" in their database. Which means, there are hard-coded patterns of Chinese characters that you wouldn't normally get by typing in corresponding English letters (i.e. Name of some Sohu employees). The mistake confirmed by Chinese users, is in fact a misspelling. A Chinese comedian's name, which should be spelled "feng gong" (two characters), can only be outputted by typing "ping gong" in both IME. I am going to try to explain why this is obviously a proof of "leveraging". Names of people and other stuff in Chinese, are mostly combinations of Chinese characters that have no logical or any connections. That means, by just using algorithms, names won't come up by just typing their corresponding pronunciation.
  • Well, Duke's law page [duke.edu] makes it clear that copyright is based on originality and not "sweat of the brow."

    The relevant portion:

    In 1991, the Supreme Court addressed this question in Feist Publications v. Rural Telephone Co.10 Feist is a publishing company specializing in area-wide telephone directories, and Rural is a public utility company that provides telephone service to Northwest Kansas. Feist had almost 50,000 white page listings in fifteen counties, while Rural had fewer than 8,000. The white pages listed the names, phone numbers, and towns of residence of all of the residents in a particular area alphabetically by last name. The two companies competed vigorously for yellow page advertisements. Feist copied Rural's collection of white page listings in order to compile its own. The district court granted summary judgment to Rural, relying on the 'sweat of the brow' doctrine, which justified protection because of the labor involved in collecting and arranging the facts.

    The Supreme Court rejected this doctrine because, with the Copyright Act of 1976, Congress made it clear that originality was a requirement for copyright protection.
    I submit that there is no originality in the character -- Pinyin pairing, though perhaps there is in the use of the engineers' names.
  • it is not known data (Score:3, Informative)

    by phorm ( 591458 ) on Tuesday April 10, 2007 @04:34AM (#18672593) Journal
    Again, it is not the "known data" that is at question here, but the database as an object in its entirety.

    Nobody is accusing Google of "copying Chinese characters", but rather of copying a specific collection that somebody has invested time and money in creating. This is not a corpus, but rather more like a dictionary. Anyone can create one, but google - which I have emminent respect for in other areas, but not this one - has decided to take somebody else's "dictionary" rather than creating their own. The compilation existed as somebody else's work. Likely google could have made an attempt to buy it. Equally likely, they could have produced a similar offering on their own. Instead, they chose to take another group's work and then denied both giving said group adequate compensation, or even that they had taken it from said group.

In less than a century, computers will be making substantial progress on ... the overriding problem of war and peace. -- James Slagle

Working...