Google Snaps Up Stats Tool from Swedish Charity 106
paulraps writes "A stats program that began as a teaching aid for a university lecture has just been bought by Google for an undisclosed sum. The statistics tool, Trendalyzer, was developed by a professor and his son at Stockholm's Karolinska Institute. Unfortunately for the developers, the project has been run under the auspices of a charity, Gapminder, and financed over the last seven years by public money. Maybe that seemed smart at the time, but the professor, admitting that he won't see a dime of Google's cash, now seems regretful. As for what Google has purchased: 'Public organizations around the world invest 20 billion dollars a year producing different kinds of statistics. Until now, nobody has thought of collecting all the information in the same place. That should be possible with Trendalyzer, which will be able to present that quantity of data in a clear way as well as giving the user the ability to compare many different kinds of information.'"
Re:What does it do? (Score:5, Interesting)
Don't dismiss this without knowing anything about it.
Significance levels and missing data (Score:5, Interesting)
Shortly thereafter, a site called Nation Master [nationmaster.org] cropped up, with a bit flashier and simpler user interface, but focused on CIA World Fact Book data, rather than the States of the US. (The same folks later did State Master [statemaster.org] using similar UI technology.)
Finally, Google tested Gapminder [google.com] with an even spiffier and simpler UI -- again focusing on by Nation correlations.
Aside from the usual complaints about "The Ecological Fallacy" [wikipedia.org] (a fallacy that cuts both ways BTW) there are two big pitfalls for this stuff:
What I did about missing data was simply eliminate any data points where data was missing from one or both of the variables being correlated. This reduces the sample size, hence statistical significance, but it bypasses arguments over what sort of missing data should be used. The Netflix Prize [netflixprize.com] is coming up with really good algorithms to compute missing data efficiently and accurately so maybe there is hope for something more effective here.
Statistical significance is more difficult to deal with. Usually one must look at tables for statistical significance of correlations under the assumption that the variables each follow a normal distribution. Unfortunately, many variables follow polynomial (like squared) or exponential distributions, so you have to do things like take the sqrt or log of one or both of the variables to try to normalize them. However, when you are looking for correlations, sometimes it its the relationship that is polynomial or exponential -- in which case you can apply sqrt or log to get the maximum correlation coefficient at the sacrifice of normality of one or both of the variables. Unfortunately, there is no simple arithmetic formula for calculating the significance level of a correlation given a non-normal distribution -- you can't just plug in the skewness, kurtosis, etc. as well as sample size and correlation coefficient, and get out a valid statistical significance. Therefore it is hard to make good statements about many very important correlations without watering them down to meaninglessness.
Also, a complaint about the "simple" user interfaces:
Some of the worst reporting from news media comes when they refuse to report statistics in terms remotely related to anything meaningful -- for example you will frequently hear statements to the effect that "California has the most orange trees in the nation." or some such. Such statistics are nonsense for the purposes of correlation studies since the size of the ecology (California state) is all you are really measuring with such statements. You have to divide by the population or divide by the total GDP or something to rationalize the ecology against other ecologies.
In Laboratory of the States, I did this with all my variables but I also left the raw variables around and allowed people to do arithmetic on them -- like dividing them -- to get their own rational comparisons if for some reason my choices were not adequate. This problem isn't as bad with Gapminder as it is with Nation Master and State Master -- but Gapm
Re:Hopefully they'll hire him (Score:4, Interesting)
Re:If this was developed with public money... (Score:1, Interesting)
Because it would be a violation of privacy.
Any Regrets? (Score:2, Interesting)
At least he can be content to know that Google will be the bestest, most very perfect company ever, since they come right out and say, at every opportunity, that their policy is "don't be evil".
And since they say they won't be evil, we know they can't be lying! (Please ignore how they help totalitarian right-wing regimes to identify people who speak out against them, and empower governments to clamp down on free speech)
Re:If this was developed with public money... (Score:5, Interesting)
The law regarding software and publicly-funded inventions has not always been as it is now. It used to be the case that most significant publicly-funded software HAD to be in the public domain, which AFAIK is why we have the BSD license today. Also witness early versions of Gaussian (quantum chemistry).
These days lots of 100% publicly-funded software is not automatically released to the public domain but instead held ransom by the author or university with a separate license permitting unlimited government use. This directly affects me: essentially ALL of the current quantum chemistry code that produces publishable results is no longer free for everyone to use. Though most programs come with source (the have to for some of the systems we need to run it on), their license restrictions are very onerous for developers: only the PI can register to download it, or it costs 5000 euros per seat, or it cannot be ported to other platforms, etc. One program even revokes licenses from academics who use competing software in the same domain! And this almost ALL software written by tenured professors and their graduate students funded from government grants.
I think we all did much better with the old formula. University-developed code should be available for everyone to use, even if that means someone can later come along and compete with a closed-source version.
I'm curious if the Swedish system more closely resembles the current USA system or the old USA system.
Re:How do you use this? (Score:4, Interesting)
It's also not as new as people are making it out to be, besides being a variant of a scatter plot,
they've been around for awhile. To murder a quote from Hamlet:
There are more things in infographic design, OldBaldGuy, Than are dreamt of by Microsoft Excel.
see Rosling demonstrate it himself (Score:4, Interesting)
Re:Nobody thought about it before? (Score:2, Interesting)
Check out This video [google.com] as can be found on one of Zonk's links. [gapminder.org]
The idea is NOT to collect all the data of the world centrally, it is to link to the pre-existing data and display it in a useful way. The software looks incredibly innovative, I doubt there is anything similar for two reasons (1) Google wouldn't' have bought it (2) TV stations here in Australia would be showing trends with the software just as they now show various parts of the earth with Google Earth.
Re:Wrong license? (Score:3, Interesting)
http://osflash.org/pipermail/osflash_osflash.org/
Gapminder appears to be made from mostly open source code:
Gapminder TechTalk (Score:3, Interesting)