Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Stats Google

How Big Data Creates False Confidence (nautil.us) 69

Mr D from 63 shares an article from Nautilus urging skepticism of big data: "The general idea is to find datasets so enormous that they can reveal patterns invisible to conventional inquiry... But there's a problem: It's tempting to think that with such an incredible volume of data behind them, studies relying on big data couldn't be wrong. But the bigness of the data can imbue the results with a false sense of certainty. Many of them are probably bogus -- and the reasons why should give us pause about any research that blindly trusts big data."
For example, Google's database of scanned books represents 4% of all books ever published, but in this data set, "The Lord of the Rings gets no more influence than, say, Witchcraft Persecutions in Bavaria." And the name Lanny appears to be one of the most common in early-20th century fiction -- solely because Upton Sinclair published 11 different novels about a character named Lanny Budd.

The problem seems to be skewed data and misinterpretation. (The article points to the failure of Google Flu Trends, which it turns out "was largely predicting winter".) The article's conclusion? "Rather than succumb to 'big data hubris,' the rest of us would do well to keep our skeptic hats on -- even when someone points to billions of words."
This discussion has been archived. No new comments can be posted.

How Big Data Creates False Confidence

Comments Filter:
    • If you think buzzwords are annoying, how about this little gem from the summary:

      But the bigness...

      I understand that language evolves over time... but when a simple word like "size" eludes someone, it's time that they returned to square one and tried again. In this particular instance, I'm afraid nothing short of crawling back in their mom's vagina is going to cut it.

  • by known_coward_69 ( 4151743 ) on Sunday April 24, 2016 @12:56PM (#51978297)
    a lot of the big prediction sites have been predicting the kansas city royals to be average to poor the last few years. so far they have been to the world series twice and are in first place in their division this year. all with average to slightly above average mainstream stats but if you look at them they built a team using a team strategy instead of simply signing guys and looking at individual stats
    • by sjames ( 1099 )

      That's a serious limitation of sabermetrics. Baseball is a subtle game. The more you study it, the more subtleties you find.

      A good manager's gut feeling takes far more factors into consideration than sabermetrics even looks at.

      A perfect example is the steal. Sabermetrics followers claim the steal is a losing proposition. That may be true by the numbers being tracked, but it fails to account for the effect on the pitcher after a steal happens. Rattling the pitcher is a real thing.

      • Baseball is about the least subtle game around. Almost everything comes down to individual players' performance. The team aspect is nearly nonexistent. That's why you can throw a bunch of baseball players together to have an All-Star game and it's a pretty good game. The Pro Bowl sucks because that doesn't work in football: the players actually have to work as a team. And your example is far from perfect. Sabermetrics predicts a particular positive value for the steal and a negative value for the caught st
        • by sjames ( 1099 )

          You haven't looked nearly deep enough. For one, in spite of having the best players in the game, the all-star game is filled with lackluster gameplay. It is understood to be more spectacle than sport. The Yankees often have the same problem. They pay top dollar to attract great players, you'd think they would be serious contenders for the World Series each and every year. They're not. The difference between a really good pitcher and a great pitcher often comes down to working well with a catcher to frame a

  • by Etcetera ( 14711 ) on Sunday April 24, 2016 @01:10PM (#51978341) Homepage

    Getting folks in the Bay Area to realize that is still an unsolved problem. Maybe they have an AI team working on it.

    In all seriousness, I saw this a lot when working within a monitoring team, and in consulting I've done for other orgs. Big Data is great for vast, multi-dimensional analysis of massive amounts of data, but it's not a substitute for domain knowledge about *WHAT* you're monitoring, critically thinking about what you're looking for and what types of failure modes might occur, and simple(r) heuristics for triggers.

    Trend analysis is very useful as an adjunct, for example, but within a server monitoring context it's not a *substitute* for having hard limits on, say, CPU load, or HTTP response time, or memory usage.

    Somehow, people managed to come to conclusions and make good decisions even before we had terabytes of raw data being sifted through by statistical algorithms to come up with a result.

    To place it into a broader cultural context, I see this in parallel with "data fetishisation" where nothing at all can be possibly true unless Science. And Data. Hipster praying at the altar of data.gov as some sort of left-wing (or Millennial) shibboleth for smug certainty when the basics -- the entry-level, basic 101 class of domain knowledge for the field -- is being forgotten.

    I'm all for bringing in new tech and new analytic techniques, but you can't look at it as a panacea for failing to understand what's going on in your domain on a philosophical level.

    • Re: (Score:2, Informative)

      by Anonymous Coward

      Getting data is dead easy. I can get you gobs of it. I can store it fairly quickly.

      Now for the hard part. What are you trying to find? Have we been collecting the right data? Is it in the right form? Do we actually have enough? Is it at the proper interval? These are where most people fail and they just keep collecting more of the same data. Even though it has 0 use for them somehow magically expecting the data to self organize itself.

    • by Anonymous Coward

      Exactly! Big Data is about finding patterns, not conclusions. The whole point is that humans are capable of searching through only so much information, and at some point you need a computer to do it for you.

      Of course, once a pattern is found, it's up to humans to determine if it makes sense -- and you'd do the same for any pattern found by a human.


      • Big Data is about finding patterns, not conclusions.

        Gary Taubes (author of "Good calories - bad calories" and "Why we get fat and what to do about it") is my favourite scientist because he just exhibit such a healthy, integrated "given that what we believe today is correct" attitude, e.g. being totally open to be proven incorrect. There is a saying "follow those that seek the truth, run from those that have claimed to found it", and Gary is most certainly a truth seeker in that respect.

        For instance in t

    • by quax ( 19371 ) on Sunday April 24, 2016 @02:37PM (#51978789)

      "data fetishisation" where nothing at all can be possibly true unless Science. And Data.

      The problem is that it's all data, very little science.

      Real scientists know how to scrutinize their data, and how to rule out false positives. Actual science will not only give you a statistical level of confidence, but use domain expertise to the uttermost to rule out systematic errors. A nice case study in that regards is the recent LIGO gravitational wave results.

      Most of the people who like to call themselves "data scientists" these days know as much about science as "computer engineers" know about proper engineering.

    • by jma34 ( 591871 )

      The lack of thinking is somewhat appalling. I am a "data scientist". I came from one of the science fields that understands data, high energy particle physics. People are often surprised when I tell them that their fancy map-reduce tools are not particularly interesting when it comes to actually understanding your data. The tools are not interesting. Do you hear that "big data" conference organizers. Too little time is spent understanding what the data is telling and how do you know that it is telling

  • by iMadeGhostzilla ( 1851560 ) on Sunday April 24, 2016 @01:12PM (#51978355)

    "Well, if I generate (by simulation) a set of 200 variables — completely random and totally unrelated to each other — with about 1,000 data points for each, then it would be near impossible not to find in it a certain number of “significant” correlations of sorts. But these correlations would be entirely spurious. And while there are techniques to control the cherry-picking (such as the Bonferroni adjustment), they don’t catch the culprits — much as regulation didn’t stop insiders from gaming the system. You can’t really police researchers, particularly when they are free agents toying with the large data available on the web.

    I am not saying here that there is no information in big data. There is plenty of information. The problem — the central issue — is that the needle comes in an increasingly larger haystack."

    • by Anonymous Coward

      "Well, if I generate (by simulation) a set of 200 variables — completely random and totally unrelated to each other — with about 1,000 data points for each, then it would be near impossible not to find in it a certain number of “significant” correlations of sorts. But these correlations would be entirely spurious.

      Exactly - determining a question after looking at the data is statistical BS. E.g. deal any bridge hand, look at it, then ask the question "was this a random deal?"
      The odds of that exact hand (in the order dealt) is 1 in 52!/39! ~ 1 in 4e23 , obviously a crooked deal at the 99.99...9 confidence level. (Even if we ignore order, the odds are ~ 1 in 6000000000)

  • A lot of people put in crap to get past prompts, or answer ideally instead of truthfully, etc. You have to imagine a lot of this data is biased and doesn't reflect reality anyway.

  • The key advantage of big data is the ability to show us where to look. But after that we need to dig further with much smaller data and science to see what the cause is.

  • There was a very interesting bit of sleuthing [drhagen.com] done to track down biases in the popularity of certain dates in scanned books. like the prevalence of Sept 11 *before* 2001.

  • xkcd (Score:3, Funny)

    by Anonymous Coward on Sunday April 24, 2016 @02:29PM (#51978753)


  • I call this approach "adding more haystacks".

  • ...false data create big confidence.
  • You measure the wrong thing ten times, and get the wrong answer. You measure the wrong thing ten billion times, and you still get a wrong answer.

    It's almost like quantity and quality are different things!

  • I'm reminded somewhat of "The Bible Code" - the theory/idea that there is a bunch of stuff hidden in the bible, visible when viewed different ways (like when skipping characters, etc - Google it) The reality is - the bigger the dataset - the more patterns - even false patterns may be present in it. If I had a billion money's, what would they type...
    • Good question. If your moneys were well diversified, they would likely display mostly type S risk and minimal type I risk, leading to market average results. But if your moneys cluster together, they would display more type I risk, and therefore produce volatile results.
  • I know a lot of humans don't like the way the big websites save their information and sell them. But there is a reason for this, if we say Google they sell their information to all who want to promote their products through Google. That means Google creates more jobs and wroth in the world. Companies can promote their products and their customers can see them in Googles search engine. Alot of SEO companies [skaftekster.nu] are dependent on Googles way to sell more information.
  • Many use big data systems and techniques to:

    - Identify potential new customers for products and services. Mistakes here result in poor choices and losses.

    - Identify and prevent fraud of all types. Mistakes here result in losses.

    - Identify existing customers that could be successfully marketed new or additional products and services. Mistakes here result in disgruntled customers and losses.

    Sometimes it's hard to determine if a technology is useful or even functional. Money often is a good indicator.

  • Oh man, I ran square into this just last week. This guy was claiming to work in big data as an economist. Said any sort of inefficiency should ultimately impact the GDP. I countered that, there are lies, damned lies, and statistic and that it might not make him comfortable, but the metrics he's using could be lying to him.

    And get this: "I work with data. Statistics is for losers". ... Can you believe that guy? Even after I point out that while every call to Map() might sort data very nicely, every Reduce(

  • The summary example (Google scanning 4% of books), while it may be "a lot" of data, isn't really big data, is it? I understand the whole point about more data not necessarily being better, but here I don't even think the example shows proves the point?

  • Big Data and Statistics were the problems Hari Seldon ran into, didn't he? Only worked with supra-large populations.

I bet the human brain is a kludge. -- Marvin Minsky