How Do You Visualize 100 GB of Google Text Data? 117
An anonymous reader writes "There is an amazing series of charts that visualizes trigrams and bigrams, portions of sentences that have been extracted from Google's web data set. The graphs highlight word associations and the frequency with which we use them on web pages. Chris Harrison from Carnegie Mellon University found, for example, that the word 'he' is often tied to 'argues,' while 'she' is found often with 'loves.' There are also word-relation charts that highlight words used in combination with their opposites, such as good and bad, peace and war, and PC and Mac."
There are a lot of these things, and they're really interesting to browse through.
/.ed (Score:2)
Was this "anonymous reader" the guy who owns the blog?
Re: (Score:2, Insightful)
You're not missing anything - the images are unreadable even at 200% or more.
Anyway, I don't get what they're illustrating. Word relations? So what.
This is a "Digg" sort of submission ... back over to Fark for me.
OT: old, busted news ( was Re:/.ed ) (Score:1)
Also, I'll read stories on fark that I'll see a couple weeks later here. It's getting to the point where I don't need to come here for news anymore. Check the 'Geek' tab on Fark, or check my feeds from Techdirt, New Scientist, Wired, CNet or SciAm, and I've got all the news a good week before ./ . Once in a while, there is a rare gem on the feed here, but it's sad, as I came here a lot a year or two ago... now, I just come here to check what the iFanboys like to say, and to hear what Linux and Microsoft
Re: (Score:2)
Quite possibly.
That last bit about "really interesting to browse through" was a pretty big clue, since I don't find any this all that interesting, or unexpected.
Word association games have been played for centuries.
Picking sets of following words given any first word is child's play, and doing it by computer is pretty meaningless until you add other characteristics, such as regional differences, time differences (50 years ago vs Today) or something to actually reveal something useful.
More interesting would
Re: (Score:2)
That last bit about "really interesting to browse through" was a pretty big clue, since I don't find any this all that interesting, or unexpected.
That wasn't part of the original submission. That was added by Taco.
Having trouble visualizing (Score:1)
Re: (Score:2)
Re: (Score:1)
Here is a copy of the PDFs, if you want to just view the results.
http://www.mediafire.com/?ua4dhfxmry2nnhn
Posting anon for non-karma whoring reasons.
Re: (Score:2)
Re: (Score:2)
Bing Cache [encycloped...matica.com]
pdf (Score:2)
his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?
Re: (Score:1)
by jcombel (1557059) writes: Alter Relationship on 01-11-11 11:11 (#34837900)
Sweet, 511!
Re: (Score:2)
his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?
Scalability is my guess. I found that using Chrome I could zoom such that the smallest text is visible (within the browser). Same with Foxit PDF reader.
No unreadable lines seen here.
Re: (Score:2)
Re: (Score:2)
I have trouble with Okular and Adobe Reader on linux as well.
I suspect some form of embedded fonts were used that works well on windows but not elsewhere.
Oddly enough, google chrome's internal sandboxed pdf rendering engine has no problem on Windows or Linux, and since that is my normal browser I didn't even notice problems on Linux.
Re: (Score:2)
Doesn't work on Windows either. And why would embedded fonts be platform-dependent anyway? Don't PDF renderers do document rendering internally?
I suspect that the PDF files are simply faulty.
Re: (Score:2)
I suspect that the PDF files are simply faulty.
Then how do you explain they work just fine for me on Win 7 and also in Google Chrome regardless of platform?
Re: (Score:2)
Both platforms have excessive fault tolerance not found elsewhere?
Re: (Score:2)
Works fine in Safari (on Mac) at maximum zoom, the smallest text appears like a 36pt font, with no jaggies...
Re: (Score:1)
Like I thought. It's a font issue. What font is it?
Linux/Windows people: If you want to view these things, you need to get some Mac fonts.
Re: (Score:2)
Linux/Windows people: If you want to view these things, you need to get some Mac fonts.
Not true.
Re: (Score:2)
Re: (Score:1)
Re: (Score:1)
Agreed. They're illegible even if I view them in the latest version of Adobe Reader on either Linux or Windows. They're not images, though, they're text rotated using PostScript/PDF commands. Any reports from the iPeople? It may be a font issue.
Re: (Score:2)
his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?
Same problem here with XP and Adobe Reader 10.
Re: (Score:1)
XP, Adobe Reader 9 or some shit with all updates, black lines all over the place.
Works fine if I open it up in Adobe Acrobat 8.
Couldn't care less about figuring out why, or the content of the PDFs. Take your fucking word graphs, tag clouds, and other useless shit back to 1999, where you'll still be recognized as completely useless.
Re: (Score:1)
This can be used to preload a "human-like" ai (Score:5, Interesting)
With a semantic network which reflects how humans relate various concepts together, and what topics and relationships humans care about.
Yes it will be biased and partial and rough, but it's a good start.
More formal reasoning and association techniques, such as bayesian stuff, logic, etc will be also be needed for general AI, but for the
knowledge base to be grounded in human concerns and human perceptions; that's a key to an ai we can relate to and which can
relate to us.
I imagine this kind of semantic network will be usable for google 2.0 "pre-emptive search" or "my virtual social planner and concierge".
Re: (Score:2)
Yes it will be biased and partial and rough...
Just like most humans.
More formal reasoning and association techniques, such as bayesian stuff, logic, etc will be also be needed for general AI...
Because we all know that most people use reasoning and bayesian logic everyday.
Re: (Score:1)
Semantics is all fluffy and stuff, but you are nowhere near AI until the computer can actually comprehend meaning. Semantics is just yet another buzzword for 'dead data, somewhat organized, but still dead, which we hope will make AI. Building larger or better organized datasets will get us nowhere if we can not put the initial 'cogito, ergo sum' into the machine. (And yes I know the 'cogito' is not the ultimate first thought of any mind.) The defining characteristic of life is the fact that data has meaning
Re: (Score:1)
I'm guessing most biologists would disagree.
Re: (Score:1)
I'm guessing most biologists would disagree.
Of course they would. But they also disagree on the characteristics of life. Biosemiotics on the other hand has no question about it.
Re: (Score:2)
I'm guessing most biologists would disagree.
Of course they would. But they also disagree on the characteristics of life. Biosemiotics on the other hand has no question about it.
...right. Because biosemiotics is a field dedicated to studying how living organisms processes and interpret data. Your statement is tautological. Biosemioticists have no question because their field is predicated on it.
Making the claim that anything is the 'defining'g characteristic of life is a little rash, because the definition of life is still kind of up in the air. Clearly, there is some disagreement as to what constitutes life.
Re: (Score:2)
It already is. A hard drive controller comprehends the meaning of alternating magnetic patterns on the disk: a sequence of ones and zeroes. A processor comprehends a higher-level meaning: a stream of assembly instructions. An operating system comprehends the yet higher level of meaning: a page of code belonging to firefox.exe that was just swapped in and began executing.
This phenomenom should b
Re: (Score:3)
We don't know. We don't have even the faintest beginnings of a "theory of intelligence".
Which doesn't mean that you can just ignore it, start throwing data at simplistic machines and expect (strong) AI to just happen.
Re: (Score:2)
Yes we do. We have a whole branch of science [wikipedia.org] concerning the matter. Which is precisely why I asked: the grandparent post sounds suspiciously like semi-mystical pseudophilosophy that gets thrown around because people don't actually want to know how their minds work and prefer to think them as ma
Re: (Score:2)
Sure we do, it's just that so far they have not come up with anything concrete. Oh, they've done lots of work poking around the edges, but the main question is still pretty much the same - what is intelligence? Perhaps "not the faintest beginnings" was a little strong, I'll rephrase as "have not made consistent progress towards" understanding intelligence.
Which is precisely why I asked: the grandparent post sounds suspiciously like se
Re: (Score:2)
The _engineer_ behind the hard drive controller comprehends the meaning of alternating magnetic patterns on a disk. The hard drive controller or any automated system comprehends it no more than a clock comprehends time. Computers are not smart in any way, they are just clockwork; its only people who have become smarter in programming. And making an honest face while selling hot air.
Re: (Score:2)
Your brain is a clockwork mechanism, yet it somehow manages to be "smart", or at least appears that way to you.
Re: (Score:1)
Your brain is a clockwork mechanism
In which case, why don't you just build one and prove it?
Oh, that's right, you can't.
Re: (Score:1)
Re: (Score:1)
So, I'd say it's simply a matter of overall complexity whether we'd call something alive or not.
I'm sure the whole internet is more complex than an amoeba, but that doesn't mean it's alive.
Re: (Score:2)
I doubt you can derive human like artificial intelligence from simple word order frequency charts.
People, or at least intelligent people, start saying something with destination in mind, nor simply to mimic some statistical summary.
Word order charts made today will be different in 6 months, as new phrases enter common usage, but does that mean human relationships or topics change that much over 6 months?
This reminds me more of the Bing TV ads than anything else.
Re: (Score:2)
It's been done already, and the resulting AI [mit.edu] was good enough to get three papers submitted to a computer science conference.
Re: (Score:3)
Wouldn't it make more sense to simply point it to Wikipedia?
Re: (Score:2)
No, this just leads to symptom modeling. There is no relation between "he" and "argues" or "she" and "loves" other than that they occur more frequently in the texts that comprise the corpus. I've done corpus studies, and if you look at word frequencies from a certain corpus, i.e. unigrams, they look ok, until you compare them to another one. One of them had 3rd person personal pronouns high, but the rest low, but in another, the 1st person singular (I) was the most frequent word. The difference? The former
Can I do my own searches? (Score:2)
Re: (Score:1)
The results sort of surprised me.
microsoft sucks vs. microsoft doesn't suck [googlefight.com]
Hmm, let's see what's going on here.
microsoft doesn't suck vs. microsoft doesn't suck that much [googlefight.com]
That makes more sense.
Re: (Score:2)
Slashdot appears twice as often as MSNBC [googlefight.com].
Re: (Score:1)
ngrams (Score:2)
Go to the http://ngrams.googlelabs.com/ [googlelabs.com] site and compare word frequency between 'pirates' and 'ninjas'. Please.
Re: (Score:1)
Easy! (Score:2)
Just use grep, or vi with a heavy object on the down-arrow key. What did I win?
Re: (Score:2)
nah, just use cat and read really fast
Cat Abuse (Score:2)
nah, just use cat and read really fast
RYRYRYRYRYRYRYRYRY...
This is an obscene abuse of a perfectly innocent program meant to concatenate files.
I'll have you know I've called the Unix Police and they will be picking you up shortly.
And you don't have to read fast. All you need is a 45.5 baud teletype machine and filename > /dev/tty
Personally I prefer to read the punchtape directly though ... with a torch.
Kudos to Chris Harrison, though (Score:4, Insightful)
"Was this "anonymous reader" the guy who owns the blog?"
"his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?" - mine don't.
I am slowly recovering from flu. What's the justification for all you miserable bastards out there? This is genuinely interesting stuff presented in an accessible way, and is the sort of thing /. should be about (checks karma and mod points - yup, probably allowed to say that.)
Re: (Score:1)
I'm sorry, but there is no rationale to call this ( http://gyazo.com/57fe0a7de30d5bfbbeb4998b74730fc3.png ) GOOD. Who failed here? Sure, Adobe has their part after .pdf "being demonstrated" [sic!] as a very "robust" format at the 27c3 (you can put all kinds of shit into an uncompiled pdf - it will compile and execute on launch without asking).
But I have done comparably complex graphics in pdf an those did not fail - so what's the probleM? I use win7x64.
Re: (Score:1)
Windows XP (at work), and I've got the same problem.
Re:Kudos to Chris Harrison, though (Score:4, Funny)
I don't agree.
Responding to myself (I don't respond to ACs) (Score:2)
I ask simply because I have viewed them today on the latest Chrome on Ubuntu 10.10 and Windows 7, and I cannot reproduce the problem, even on a crappy 4 year old laptop.
Poor way of presenting (Score:2)
cold ... ... ... ... ... ... ...
winter steel case turkey
blood
weather
spring
air
water
springs spots
products new spot
hot
with the word lists getting smaller as you go to the right, of course (the ... lists words I can't make out in his image). No need for the "peacock" arrangement that reduces readability and requires it being stored as an image.
Re: (Score:2)
Wouldn't it be better to just present it as a list of words, so that it could be rendered in HTML? For example
cold
winter steel case turkey
blood ...
weather ...
spring ...
air ...
water ...
springs spots ...
products new spot ...
hot
with the word lists getting smaller as you go to the right, of course (the ... lists words I can't make out in his image). No need for the "peacock" arrangement that reduces readability and requires it being stored as an image.
I think that Tufte would agree with you.
Re: (Score:1)
Easy. (Score:3)
Re: (Score:1)
Cats v. Dogs (Score:1)
Looking at the Cat vs. Dog picture, all I can say is, "What's wrong with dog people?"
Re: (Score:1)
Pretty sure there isn't such a thing as 'kitty-style'.
Amateur.
Re: (Score:2)
Women and Men (Score:1)
Warning - unfiltered (Score:2)
Dog-Cat chart NSFW
How GOOG does it: (Score:1)
How Do You Visualize 100 GB of Google Text Data?
Easy:
$$$$$$$$$$
Visualization? (Score:1)
Visualization = Dark Background + Light Words + Pretty Lines
How does that give me any sort of understanding of the content?
Re: (Score:1)
I agree somewhat. The problem is two fold: graphing libraries do the same things and there is not much meaning to be had in the raw data. For the former item, many visualization libraries are designed to display graph/network data somewhat gracefully. Consequently, many visualizations center around, how do we put this thing in graph form? rather than what interface naturally explains this data best? The second problem is that this huge morass of data just has frequency counts and n-grams. So, we sorta know
Astonishing ... (Score:3)
Corpus linguistics
http://en.wikipedia.org/wiki/Quantitative_linguistics [wikipedia.org]
Interestingly enough, most relevant authors (e.g. Kaeding) were not cared for.
CC.
Re: (Score:1)
guess the word (Score:1)
I can guess some of the words... but it required blowing up the pics to 2400% and I was using Adobe PDF.
My only question is what to do with it. If you are trying to add keywords that will make your site more search worthy, I can understand, or to show a line of thinking how people associate terms. 'Hot and cold' gets you to "environment" "water" "pool"... Might be fun for word association tests.
Psychologists. (Score:2)
as a fraction (Score:2)
Some more Google N-Gram finds (Score:2)
http://ngrams.googlelabs.com/graph?content=blue%2Cred%2Cgreen%2Cyellow&year_start=1880&year_end=2008&corpus=0&smoothing=3 [googlelabs.com]
http://ngrams.googlelabs.com/graph?content=Britannica%2CWikipedia&year_start=1800&year_end=2010&corpus=0&smoothing=3 [googlelabs.com]
http://ngrams.googlelabs.com/graph?content=1881%2C1891%2C1901%2C1911%2C1921%2C1931%2C1941%2C1951%2C1961%2C1971%2C1981%2C1991&year_start=1880&year_end=2008&corpus=0&smoothing=3 [googlelabs.com]
http://ngrams.googlelabs.com/graph?content=poker%2Cc [googlelabs.com]
bigram means two characters (Score:2)
How? (Score:1)
Same way you visualize anything else (Score:2)