Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
The Internet Google Technology

How Do You Visualize 100 GB of Google Text Data? 117

An anonymous reader writes "There is an amazing series of charts that visualizes trigrams and bigrams, portions of sentences that have been extracted from Google's web data set. The graphs highlight word associations and the frequency with which we use them on web pages. Chris Harrison from Carnegie Mellon University found, for example, that the word 'he' is often tied to 'argues,' while 'she' is found often with 'loves.' There are also word-relation charts that highlight words used in combination with their opposites, such as good and bad, peace and war, and PC and Mac." There are a lot of these things, and they're really interesting to browse through.
This discussion has been archived. No new comments can be posted.

How Do You Visualize 100 GB of Google Text Data?

Comments Filter:
  • by grub ( 11606 )

    Was this "anonymous reader" the guy who owns the blog?
    • Re: (Score:2, Insightful)

      by Anonymous Coward

      You're not missing anything - the images are unreadable even at 200% or more.

      Anyway, I don't get what they're illustrating. Word relations? So what.

      This is a "Digg" sort of submission ... back over to Fark for me.

    • by icebike ( 68054 )

      Quite possibly.

      That last bit about "really interesting to browse through" was a pretty big clue, since I don't find any this all that interesting, or unexpected.

      Word association games have been played for centuries.

      Picking sets of following words given any first word is child's play, and doing it by computer is pretty meaningless until you add other characteristics, such as regional differences, time differences (50 years ago vs Today) or something to actually reveal something useful.

      More interesting would

      • by Desler ( 1608317 )

        That last bit about "really interesting to browse through" was a pretty big clue, since I don't find any this all that interesting, or unexpected.

        That wasn't part of the original submission. That was added by Taco.

  • due to the server being slashdotted. Anyone have a mirror or alternate link?
  • his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?

    • by Anonymous Coward

      by jcombel (1557059) writes: Alter Relationship on 01-11-11 11:11 (#34837900)

      Sweet, 511!

    • by icebike ( 68054 )

      his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?

      Scalability is my guess. I found that using Chrome I could zoom such that the smallest text is visible (within the browser). Same with Foxit PDF reader.

      No unreadable lines seen here.

      • If he wanted scalability, he should have saved them in SVG format. As it stands now, I can't read them; Okular is rendering them pretty weirdly.
        • by icebike ( 68054 )

          I have trouble with Okular and Adobe Reader on linux as well.
          I suspect some form of embedded fonts were used that works well on windows but not elsewhere.

          Oddly enough, google chrome's internal sandboxed pdf rendering engine has no problem on Windows or Linux, and since that is my normal browser I didn't even notice problems on Linux.

          • I suspect some form of embedded fonts were used that works well on windows but not elsewhere.

            Doesn't work on Windows either. And why would embedded fonts be platform-dependent anyway? Don't PDF renderers do document rendering internally?

            I suspect that the PDF files are simply faulty.

            • by icebike ( 68054 )

              I suspect that the PDF files are simply faulty.

              Then how do you explain they work just fine for me on Win 7 and also in Google Chrome regardless of platform?

      • Works fine in Safari (on Mac) at maximum zoom, the smallest text appears like a 36pt font, with no jaggies...

        • Like I thought. It's a font issue. What font is it?

          Linux/Windows people: If you want to view these things, you need to get some Mac fonts.

          • by icebike ( 68054 )

            Linux/Windows people: If you want to view these things, you need to get some Mac fonts.

            Not true.

    • by jandrese ( 485 )
      Yeah, they're totally unreadable (missing blocks everywhere) with Acrobat reader.
    • I'm on XP and with Adobe Reader X 10.0.0, had the same black line overlay problem at all zoom levels. Dunno why.
    • Agreed. They're illegible even if I view them in the latest version of Adobe Reader on either Linux or Windows. They're not images, though, they're text rotated using PostScript/PDF commands. Any reports from the iPeople? It may be a font issue.

    • his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?

      Same problem here with XP and Adobe Reader 10.

    • by fyndor ( 895340 )
      The answer to why is that it is not a graphic/image. It is text shaped in a "half circle". I use Chrome, and as others say that works. He probably didn't notice the problem because he likely uses Chrome (and so should you?, after all it is freaking fast as hell, i use it for my day to day). SVG seems like a bad idea as well because it is not supported by IE except for v9 beta (which btw renders this incorrectly as well). I am not even sure what he should have used since its not a good idea to either pu
  • by presidenteloco ( 659168 ) on Tuesday January 11, 2011 @02:14PM (#34837936)

    With a semantic network which reflects how humans relate various concepts together, and what topics and relationships humans care about.

    Yes it will be biased and partial and rough, but it's a good start.

    More formal reasoning and association techniques, such as bayesian stuff, logic, etc will be also be needed for general AI, but for the
    knowledge base to be grounded in human concerns and human perceptions; that's a key to an ai we can relate to and which can
    relate to us.

    I imagine this kind of semantic network will be usable for google 2.0 "pre-emptive search" or "my virtual social planner and concierge".

    • Yes it will be biased and partial and rough...

      Just like most humans.

      More formal reasoning and association techniques, such as bayesian stuff, logic, etc will be also be needed for general AI...

      Because we all know that most people use reasoning and bayesian logic everyday.

    • Semantics is all fluffy and stuff, but you are nowhere near AI until the computer can actually comprehend meaning. Semantics is just yet another buzzword for 'dead data, somewhat organized, but still dead, which we hope will make AI. Building larger or better organized datasets will get us nowhere if we can not put the initial 'cogito, ergo sum' into the machine. (And yes I know the 'cogito' is not the ultimate first thought of any mind.) The defining characteristic of life is the fact that data has meaning

      • by tb()ne ( 625102 )

        The defining characteristic of life is the fact that data has meaning to a it.

        I'm guessing most biologists would disagree.

        • I'm guessing most biologists would disagree.

          Of course they would. But they also disagree on the characteristics of life. Biosemiotics on the other hand has no question about it.

          • I'm guessing most biologists would disagree.

            Of course they would. But they also disagree on the characteristics of life. Biosemiotics on the other hand has no question about it.

            ...right. Because biosemiotics is a field dedicated to studying how living organisms processes and interpret data. Your statement is tautological. Biosemioticists have no question because their field is predicated on it.

            Making the claim that anything is the 'defining'g characteristic of life is a little rash, because the definition of life is still kind of up in the air. Clearly, there is some disagreement as to what constitutes life.

      • Semantics is all fluffy and stuff, but you are nowhere near AI until the computer can actually comprehend meaning.

        It already is. A hard drive controller comprehends the meaning of alternating magnetic patterns on the disk: a sequence of ones and zeroes. A processor comprehends a higher-level meaning: a stream of assembly instructions. An operating system comprehends the yet higher level of meaning: a page of code belonging to firefox.exe that was just swapped in and began executing.

        This phenomenom should b

        • by glwtta ( 532858 )
          Meaning what, exactly speaking? What is this "cogito" you're talking about and how does it differ from "mere" data processing?

          We don't know. We don't have even the faintest beginnings of a "theory of intelligence".

          Which doesn't mean that you can just ignore it, start throwing data at simplistic machines and expect (strong) AI to just happen.
          • Meaning what, exactly speaking? What is this "cogito" you're talking about and how does it differ from "mere" data processing?

            We don't know. We don't have even the faintest beginnings of a "theory of intelligence".

            Yes we do. We have a whole branch of science [wikipedia.org] concerning the matter. Which is precisely why I asked: the grandparent post sounds suspiciously like semi-mystical pseudophilosophy that gets thrown around because people don't actually want to know how their minds work and prefer to think them as ma

            • by glwtta ( 532858 )
              Yes we do. We have a whole branch of science concerning the matter.

              Sure we do, it's just that so far they have not come up with anything concrete. Oh, they've done lots of work poking around the edges, but the main question is still pretty much the same - what is intelligence? Perhaps "not the faintest beginnings" was a little strong, I'll rephrase as "have not made consistent progress towards" understanding intelligence.

              Which is precisely why I asked: the grandparent post sounds suspiciously like se
        • The _engineer_ behind the hard drive controller comprehends the meaning of alternating magnetic patterns on a disk. The hard drive controller or any automated system comprehends it no more than a clock comprehends time. Computers are not smart in any way, they are just clockwork; its only people who have become smarter in programming. And making an honest face while selling hot air.

          • Computers are not smart in any way, they are just clockwork; its only people who have become smarter in programming.

            Your brain is a clockwork mechanism, yet it somehow manages to be "smart", or at least appears that way to you.

        • So, I'd say it's simply a matter of overall complexity whether we'd call something alive or not.

          I'm sure the whole internet is more complex than an amoeba, but that doesn't mean it's alive.

    • by icebike ( 68054 )

      I doubt you can derive human like artificial intelligence from simple word order frequency charts.

      People, or at least intelligent people, start saying something with destination in mind, nor simply to mimic some statistical summary.

      Word order charts made today will be different in 6 months, as new phrases enter common usage, but does that mean human relationships or topics change that much over 6 months?

      This reminds me more of the Bing TV ads than anything else.

      • I doubt you can derive human like artificial intelligence from simple word order frequency charts.

        It's been done already, and the resulting AI [mit.edu] was good enough to get three papers submitted to a computer science conference.

    • With a semantic network which reflects how humans relate various concepts together, and what topics and relationships humans care about.

      Wouldn't it make more sense to simply point it to Wikipedia?

    • by tgv ( 254536 )

      No, this just leads to symptom modeling. There is no relation between "he" and "argues" or "she" and "loves" other than that they occur more frequently in the texts that comprise the corpus. I've done corpus studies, and if you look at word frequencies from a certain corpus, i.e. unigrams, they look ok, until you compare them to another one. One of them had 3rd person personal pronouns high, but the rest low, but in another, the 1st person singular (I) was the most frequent word. The difference? The former

  • I like to see what the correlation is between the two words "microsoft" and "sucks".
  • Go to the http://ngrams.googlelabs.com/ [googlelabs.com] site and compare word frequency between 'pirates' and 'ninjas'. Please.

  • Just use grep, or vi with a heavy object on the down-arrow key. What did I win?

    • by JamesP ( 688957 )

      nah, just use cat and read really fast

      • nah, just use cat and read really fast

        RYRYRYRYRYRYRYRYRY...

        This is an obscene abuse of a perfectly innocent program meant to concatenate files.

        I'll have you know I've called the Unix Police and they will be picking you up shortly.

        And you don't have to read fast. All you need is a 45.5 baud teletype machine and filename > /dev/tty

        Personally I prefer to read the punchtape directly though ... with a torch.

  • by Kupfernigk ( 1190345 ) on Tuesday January 11, 2011 @02:22PM (#34838020)
    He does these really interesting data visualisations and publishes them for free - and what do people do?

    "Was this "anonymous reader" the guy who owns the blog?"

    "his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?" - mine don't.

    I am slowly recovering from flu. What's the justification for all you miserable bastards out there? This is genuinely interesting stuff presented in an accessible way, and is the sort of thing /. should be about (checks karma and mod points - yup, probably allowed to say that.)

    • by Anonymous Coward

      I'm sorry, but there is no rationale to call this ( http://gyazo.com/57fe0a7de30d5bfbbeb4998b74730fc3.png ) GOOD. Who failed here? Sure, Adobe has their part after .pdf "being demonstrated" [sic!] as a very "robust" format at the 27c3 (you can put all kinds of shit into an uncompiled pdf - it will compile and execute on launch without asking).

      But I have done comparably complex graphics in pdf an those did not fail - so what's the probleM? I use win7x64.

      • by Anonymous Coward

        Windows XP (at work), and I've got the same problem.

    • by FrankDrebin ( 238464 ) on Tuesday January 11, 2011 @02:36PM (#34838204) Homepage

      'he' is often tied to 'argues,'

      I don't agree.

    • Why are all the posters bitching about the PDFs ACs?

      I ask simply because I have viewed them today on the latest Chrome on Ubuntu 10.10 and Windows 7, and I cannot reproduce the problem, even on a crappy 4 year old laptop.

  • Wouldn't it be better to just present it as a list of words, so that it could be rendered in HTML? For example

    cold
    winter steel case turkey
    blood ...
    weather ...
    spring ...
    air ...
    water ...
    springs spots ...
    products new spot ...
    hot

    with the word lists getting smaller as you go to the right, of course (the ... lists words I can't make out in his image). No need for the "peacock" arrangement that reduces readability and requires it being stored as an image.

    • Wouldn't it be better to just present it as a list of words, so that it could be rendered in HTML? For example

      cold

      winter steel case turkey

      blood ...

      weather ...

      spring ...

      air ...

      water ...

      springs spots ...

      products new spot ...

      hot

      with the word lists getting smaller as you go to the right, of course (the ... lists words I can't make out in his image). No need for the "peacock" arrangement that reduces readability and requires it being stored as an image.

      I think that Tufte would agree with you.

  • by Beelzebud ( 1361137 ) on Tuesday January 11, 2011 @02:26PM (#34838072)
    We'll start off by imagining 1 GB of data. Now multiply that by 100!
  • Looking at the Cat vs. Dog picture, all I can say is, "What's wrong with dog people?"

    • Well you don't see people talking about having sex "kitty style" now do you? So some of the hits on dog may be due to that and not just people who like to feed their dog peanut butter...
  • That one is kind of disturbing.
  • Dog-Cat chart NSFW

  • How Do You Visualize 100 GB of Google Text Data?

    Easy:

    $$$$$$$$$$

  • Visualization = Dark Background + Light Words + Pretty Lines

    How does that give me any sort of understanding of the content?

    • I agree somewhat. The problem is two fold: graphing libraries do the same things and there is not much meaning to be had in the raw data. For the former item, many visualization libraries are designed to display graph/network data somewhat gracefully. Consequently, many visualizations center around, how do we put this thing in graph form? rather than what interface naturally explains this data best? The second problem is that this huge morass of data just has frequency counts and n-grams. So, we sorta know

  • by foobsr ( 693224 ) on Tuesday January 11, 2011 @03:08PM (#34838578) Homepage Journal
    ... progress.

    Corpus linguistics

    http://en.wikipedia.org/wiki/Quantitative_linguistics [wikipedia.org]

    Interestingly enough, most relevant authors (e.g. Kaeding) were not cared for.

    CC.
  • I can guess some of the words... but it required blowing up the pics to 2400% and I was using Adobe PDF.

    My only question is what to do with it. If you are trying to add keywords that will make your site more search worthy, I can understand, or to show a line of thinking how people associate terms. 'Hot and cold' gets you to "environment" "water" "pool"... Might be fun for word association tests.

  • It's easy to visualize 100GB of data. Just view it as a percentage of the Library of Congress -- e.g. a door, or small closet.
  • I wish people would stop using the words "bigram" and "trigram" incorrectly. The "-gram" suffix comes from a Greek word for "a written character", the same root is in the word "grapheme". Hence bigram == a two-character substring, and trigram == a three-character substring. And these words are actually being used in the correct sense as well. Two-word and three-word substrings should IMHO be called "bilexes" and "trilexes", or something similar. But a good first step is to stop calling them bigrams and trig
  • File -> Print
  • With your eyes. Your eyes.

THEGODDESSOFTHENETHASTWISTINGFINGERSANDHERVOICEISLIKEAJAVELININTHENIGHTDUDE

Working...