Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
The Internet Technology

New Online Dictionaries Automate Away the Linguistic Middleman 60

An article in The New York Times highlights two growing collections of words online that effectively bypass the traditional dictionary publishing system of slow aggregation and curation. Wordnik is a private venture that has already raised more than $12 million in capital, while the Corpus of Contemporary American English is a project started by Brigham Young professor Mark Davies. These sources differ from both conventional dictionary publishers and crowd-sourced efforts like the excellent Wiktionary for their emphasis on avoiding human intervention rather than fostering it. Says founder Erin McKean in the linked article, 'Language changes every day, and the lexicographer should get out of the way. ... You can type in anything, and we'll show you what data we have.'
This discussion has been archived. No new comments can be posted.

New Online Dictionaries Automate Away the Linguistic Middleman

Comments Filter:
  • by hawks5999 ( 588198 ) on Sunday January 01, 2012 @02:34PM (#38557316)
    You can type in anything and we'll show you the data we have sounds a lot like Google search.
  • CCAE is an annotated corpus more than a dictionary. It counts words, word co-occurrences, etc. It's also manually annotated with parts of speech and other such things, not fully automated. Its scope is bigger and more recent than what was possible before computers, but the general idea is ancient: 18th-century classicists would manually compile frequency and word co-occurrence tables for ancient languages to try to get an understanding of their structure.

    • Having access to a good corpus is really helpful, but once you start hitting the 2k word count additional entries aren't really that helpful to anybody other than hardcore linguists.At that point it's generally more helpful to have information about what words frequently travel together and where they're likely to appear in a sentence.

      • It's depends what you're doing. I've spent a while dealing with the Scottish Corpus of Texts and Speech, and there the size is around four million words. If you're doing anything based upon dialects, size does make a very real difference, because you're interested in the density of usage by area. Personally, even in a non-linguistic context, I find it useful to know whether someone in x is likely to know (by virtue of using) a word y.

      • If you're a trained computational linguist then obviously yes, you can generate this information (collocations and concordances) from corpora. The COCA actually has some pretty neat features along those lines, but it's not user-intuitive.
  • At the risk of being elitist, I wonder if I should adjust my use of language to that of the average American.

    • It's inevitable, language always adjusts to popular usage eventually, even with guards in place that act as filters.

      Though I still cringe when people say they "could care less."

      Not that all rules set in place by self-annointed authorities. I never understood why end-of-sentence punctuation should appear inside quotations, especially if it might not match what was quoted, like making a question out of a sentence.

      • by vlm ( 69642 ) on Sunday January 01, 2012 @03:05PM (#38557550)

        Though I still cringe when people say they "could care less."

        That begs the question if inappropriate use of "begs the question" is like, worse, like, than like using the word like, like in as the first like word after every like lung inhalation. I think that is a full 360 degree reversal from your suggestion.

        • Re: (Score:2, Informative)

          by Anonymous Coward
          You're sentence could of been improved if you had leveraged a preposition to end it with.
        • by Trepidity ( 597 )

          To be quite honest, it's not an, uh, a very uncommon pattern of speech, if I may say so, to interject one's spoken English with, discourse... discourse particles, and, well, other minor disfluencies, which do--- which do vary by social class, but more in, uh, word choice than in what you might call actual frequency.

        • Re:Good idea? (Score:4, Interesting)

          by bigstrat2003 ( 1058574 ) on Sunday January 01, 2012 @03:40PM (#38557730)
          This post proves that there should be a "made my brain explode" moderation option.
        • Though I still cringe when people say they "could care less."

          That begs the question if inappropriate use of "begs the question" is like, worse, like, than like using the word like, like in as the first like word after every like lung inhalation. I think that is a full 360 degree reversal from your suggestion.

          I live in the corner of a quad of homes that creates an interesting
          amplifying effect of sounds, within the area. So that a house that
          is completely on the other side, hundreds of feet away, you can
          clearly hear people talk. [Yeah, it DOES suck].

          So, the other day, I heard this teen-thing speaking to her folks
          and about the 20th like, I was gonna "say loudly" since that's
          all one has to do...

          "Like will you shut the fuck up"

          But tis the season and all that crap.

          -AI

        • For all intensive purposes, yes

      • I never understood why end-of-sentence punctuation should appear inside quotations, especially if it might not match what was quoted, like making a question out of a sentence.

        So I'm not the only one? Yeah!!! Although I believe that you may have a question mark outside of quotes if the sentence (and not the quoted material) is a question.

      • Re:Good idea? (Score:5, Insightful)

        by Samantha Wright ( 1324923 ) on Sunday January 01, 2012 @05:48PM (#38558558) Homepage Journal
        Oh, that's purely typographical. When moving blocks of metal type around, a full-stop/period or comma is more delicate than a quotation mark, since it's only x-height and not capital letter height. Typographers got in the habit of putting them on the inside to keep them safe. That's also why certain ligatures of f and the long s were preserved from scribal writing: those letters were designed to hook over others, and if the next letter was tall then it would create a structural instability (an x-height hole.) If modern punctuation had evolved before the invention of moveable type, we would probably put the quotation mark directly above the other punctuation mark, and use logical punctuation for ? and !. However, it didn't, so it was all put inside to stay consistent.

        To be honest, I find it visually more pleasant. After looking at code that passes strings around as arguments in C-style imperative languages all day, it's nice to see something without a big gap on the baseline (this "is," an "example", for you.) Since the quotation mark is already floating up and away from the letters, it's less jarring to see it separated from the word than a comma or period. (This is more or less the modern aesthetic justification for keeping it the traditional way. However, modern typographers don't always agree with traditionalists: watch what happens when you point out that the "single" space used to separate sentences prior to the invention of the typewriter was actually larger than a standard double space.)
        • What about punctuation for other languages such as and or the Spanish inverted question mark at the beginning and ? at the end of a question

          • I didn't know this one off the top of my head, but Wikipedia says [wikipedia.org] they were introduced in Spanish in 1754 because there's no way to recognize that a sentence is a question just from looking at the words; it's purely a tonal difference—and for really long sentences it can get disorienting if you have to go back and re-read it because you just found out that it was a question when you got to the end. I imagine the exclamation point was just made to be consistent.

            What were the other symbols you tried
  • by Anonymous Coward

    Let's eliminate the making-sense and explaining that human beings can do. The absurdity of most spell check and voice recognition "did you mean" suggestions doesn't give me much hope that it's all just a matter of having enough data. Yes, Google can seem almost prescient, but only if thousands of other people are looking for the same things as I am. When I could really use a hint, Google never comes up with something useful. On the contrary, then I have to coax it not to replace my carefully selected search

  • What are these guys, all we get is what they're not:

    traditional dictionary publishing system

    slow aggregation

    curation

    crowd-sourced effort

    human intervention

    I'm guessing they are also not street taco vendors, catholic priests or christmas tree salesmen. Great, that really narrows it down. So, what are they? I mean in terms of workflow, or data diagrams, or even user experience. And who are their users, anyway, unless they provide a really good reason, the rest of the world will continue to use wikipedia/wikimedia products, google (lets face it, mostly google), and the urban dictionary (dare I invoke encycl

  • by PPH ( 736903 ) on Sunday January 01, 2012 @03:08PM (#38557556)

    ... if its used, it is automatically entered into this 'dictionary'. On one hand, I shudder to think of the direction that various languages might take. On the other hand, there could be hope for words like malamanteau [xkcd.com]. That seems perfectly cromulant to me.

  • by Compaqt ( 1758360 ) on Sunday January 01, 2012 @03:09PM (#38557566) Homepage

    Obviously, I'd suppose you still needed a few lexicographers to come up with the system.

    And to maintain it, right?

    The problem seems to be when you've put 95% of lexicographers out of a job, who's going to train the next bunch, and will it be cost-effective at a university level to have a graduate program in such for 1 or 2 individuals?

    • Obviously, I'd suppose you still needed a few lexicographers to come up with the system.

      And to maintain it, right?

      The problem seems to be when you've put 95% of lexicographers out of a job, who's going to train the next bunch, and will it be cost-effective at a university level to have a graduate program in such for 1 or 2 individuals?

      Syntax error on line(s): 1 thru 1
      Ambiguous contraction in "I'd".

      Syntax error on line(s): 1 thru 1
      Mixed tense in "still needed".
      Note: Root word "need" satisfies the expression.

      Syntax error on line(s): 3 thru 3
      Incomplete sentence.

      Syntax error on line(s): 5 thru 5
      Expected colon after "be" in "to be when".

      Syntax error on line(s): 5 thru 5
      Expected capitalization of "when" in "to be when".

      Syntax error on line(s): 5 thru 5
      Extraneous comma.
      Note: This message is generated only once for multiple errors.

      Point taken: Screw the Lexicographers!

  • So if I type in "anything" I won't get just an interpreted response
    but really -- what... everything?

    bjd

  • by NaCh0 ( 6124 ) on Sunday January 01, 2012 @04:05PM (#38557870) Homepage

    I wonder what kind of sales pitch it takes to get $12 million for a free web dictionary.

    'Just imagine if we could provide 100 definitions from other people for the word "butt", how much is that worth to you?'

  • Telivision (Score:5, Insightful)

    by aembleton ( 324527 ) <aembleton@gmaiRASPl.com minus berry> on Sunday January 01, 2012 @04:14PM (#38557932) Homepage
    It doesn't detect that telivision is an incorrect spelling because there are so many authoritative examples of that spelling: http://www.wordnik.com/words/telivision [wordnik.com]

    Google seems to do a good job of detecting spelling errors and automatically updating it's dictionary and of course it also shows you websites where that word is used. I don't really see what Wordnik provides.
    • Re: (Score:2, Troll)

      by AK Marc ( 707885 )
      What's funny is that 4 of the top 5 examples are by conservatives attacking liberals (and one transcription error on a CNN interview). What's that say about where our language is going and who is taking it there?
      • Language use, and interpretations thereof, is not politically bias-free. Even more so opinions, scientific or not, on language use.
        • by AK Marc ( 707885 )
          If we find one group, say Wal-Mart shoppers, who use words that don't exist like misunderestimate and nuk-u-lar more than others, does that mean anything? And if so, what?
    • I second this notion. I frequently use the define: $searchTerm query with Google.

      For example: telivision [google.com],
      or: Wordnik [google.com]

      Compare the latter to the same search on Wordnik: Wordnik [wordnik.com]

      Bonus: Those Google links are wrapped in TLS, so no one sees the query terms or results in transit. https://www.wordnik.com/ [wordnik.com] takes you to their developer site...

  • we've eliminated the middle man by letting users submit whatever they want, and pocketing all the money!
  • These sources differ from both conventional dictionary publishers and crowd-sourced efforts like the excellent Wiktionary for their emphasis on avoiding human intervention rather than fostering it.

    You make it sound like they're completely removing the human elements. And just, a corpus by nature does that, as they're only really involved in setting the bounds of the collection and letting the authors speak for themselves. Wordnik, on the other hand, allows *anyone* to contribute, but they're not allowe

    • That's a stupid idea. To use an analogy that Slashdot understands: a traditional dictionary is like a standards document. It's useful to promote interoperability between speakers both during a single transaction (conversation between two parties), and also in log files (written documents to be read again later).

      Collecting random words on the web into a dictionary is like getting rid of standards altogether, or saying that every piece of software out there, no matter what it does, is standards compliant. W

      • Current generation nonsense, it's high time we return to Latin. Ita et vos per linguam nisi manifestum sermonem dederitis, quo modo scietur id quod dicitur? eritis enim in aëra loquentes.
        But I'd accept Old English.

      • Do you really mean to tell me that you only use words as they're defined in the dictionary? And if so, which dictionary? Because as we all know, there's lots of different standards out there. And then there's versioning of the standards, and those implementations that aren't quite complient (in language, those would be regional dialets). Language is not as cut and dried as you think it might be.

        But your suggestion is actually done in other countries -- the French have a government group that officially

  • Regarding Wordnik, I don't think Rick Santorum is going to be a fan of their site.

  • All of the interviewed persons as well as the author of the NY Times Article leave a major issue unmentioned, and that is historical word use. As a very enthusiastic user of the Oxford English Dictionary ( yes, it has the place of honour in my living room ), each time I look up a word in the venerable OED I am amazed at the thick and variegated strata of historical meaning, and the gradual shifting in it, even for words we think of as "simple".

    To wit, neither the Wordnik nor the CCAE person mentioned these

It is easier to write an incorrect program than understand a correct one.

Working...