Google's Medical AI Was Super Accurate in a Lab. Real Life Was a Different Story.

Google's Medical AI Was Super Accurate in a Lab. Real Life Was a Different Story. (technologyreview.com) 52

Posted by msmash on Tuesday April 28, 2020 @10:45AM from the closer-look dept.

The covid-19 pandemic is stretching hospital resources to the breaking point in many countries in the world. It is no surprise that many people hope AI could speed up patient screening and ease the strain on clinical staff. But a study from Google Health -- the first to look at the impact of a deep-learning tool in real clinical settings -- reveals that even the most accurate AIs can actually make things worse if not tailored to the clinical environments in which they will work. From a report: Existing rules for deploying AI in clinical settings, such as the standards for FDA clearance in the US or a CE mark in Europe, focus primarily on accuracy. There are no explicit requirements that an AI must improve the outcome for patients, largely because such trials have not yet run. But that needs to change, says Emma Beede, a UX researcher at Google Health: "We have to understand how AI tools are going to work for people in context -- especially in health care -- before they're widely deployed." Google's first opportunity to test the tool in a real setting came from Thailand. The country's ministry of health has set an annual goal to screen 60% of people with diabetes for diabetic retinopathy, which can cause blindness if not caught early. But with around 4.5 million patients to only 200 retinal specialists -- roughly double the ratio in the US -- clinics are struggling to meet the target. Google has CE mark clearance, which covers Thailand, but it is still waiting for FDA approval. So to see if AI could help, Beede and her colleagues outfitted 11 clinics across the country with a deep-learning system trained to spot signs of eye disease in patients with diabetes.

In the system Thailand had been using, nurses take photos of patients' eyes during check-ups and send them off to be looked at by a specialist elsewhere -- a process that can take up to 10 weeks. The AI developed by Google Health can identify signs of diabetic retinopathy from an eye scan with more than 90% accuracy -- which the team calls "human specialist level" -- and, in principle, give a result in less than 10 minutes. The system analyzes images for telltale indicators of the condition, such as blocked or leaking blood vessels. Sounds impressive. But an accuracy assessment from a lab goes only so far. It says nothing of how the AI will perform in the chaos of a real-world environment, and this is what the Google Health team wanted to find out. Over several months they observed nurses conducting eye scans and interviewed them about their experiences using the new system. The feedback wasn't entirely positive.

Google's Medical AI Was Super Accurate in a Lab. Real Life Was a Different Story.

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 52 Comments Log In/Create an Account

Comments Filter:

When it worked well... (Score:2)

by mybecq ( 131456 ) writes:

When it worked well, Beede and her colleagues saw how the AI made people who were good at their jobs even better. âoeThere was one nurse that screened 1,000 patients on her own, and with this tool sheâ(TM)s unstoppable,â she says.
- You mean "in a tiny subset of conditions" (Score:5, Interesting)
  
  by Moryath ( 553296 ) writes: on Tuesday April 28, 2020 @11:12AM (#60000494)
  
  ... which is basically the problem with "machine learning" AI at almost all times. The classic GIGO problem, Garbage In, Gospel Out.
  If the dataset contains errors you get errors.
  If the dataset contains bias, the AI replicates the bias.
  If the dataset is incomplete, the AI doesn't learn much if anything that can let it go outside the boundaries of what it was given.
  "When it worked well" in this case is meaningless, because it couldn't scale up as hoped.
  
  - Re:You mean "in a tiny subset of conditions" (Score:5, Informative)
    
    by nagora ( 177841 ) writes: on Tuesday April 28, 2020 @02:00PM (#60001178)
    
    "When it worked well" in this case is meaningless, because it couldn't scale up as hoped.
    And that in turn is because it's not Intelligent - it's not actually learnt anything it can reflect on and consider.
    It's like Google's AlphaZero chess database thing. It can beat any human but I can say to my wife "Let's play maharaja, which is chess but the white side has just one piece with the power of a queen and a knight combined," and we can sit down and play. AlphaZero can't - it knows how to beat you at chess but it doesn't [i]know[/i] anything about chess so it can't apply it outside of its training.
    
    - Re: (Score:2)
      
      by fredzouille ( 209025 ) writes:
      
      AlphaGo Zero learned Go from the rules only and in 3 days managed to beat AlphaGo Lee, which had beaten the world champion Lee Sedol in 2016. I guess it could do the same in less days against you playing maharaja.
      - Re: (Score:2)
        
        by nagora ( 177841 ) writes:
        
        AlphaGo Zero learned Go from the rules only and in 3 days managed to beat AlphaGo Lee, which had beaten the world champion Lee Sedol in 2016. I guess it could do the same in less days against you playing maharaja.
        It didn't learn anything - it built a database. At the end of the process it knew no more about Go than my cat.
        For bonus "so what?" points, the resulting database isn't even in a format that can be interrogated.
        
        Re: (Score:3)
        
        by Kjella ( 173770 ) writes:
        
        It didn't learn anything - it built a database. At the end of the process it knew no more about Go than my cat.
        Clearly you know as much about Go as your cat. There's about 10^172 possible board positions and only 10^78 to 10^82 atoms in the universe. Hell, you can't even make a database of all the chess positions. I do agree there's lots of people that don't understand what "AI" is but you're one of them.
        For bonus "so what?" points, the resulting database isn't even in a format that can be interrogated.
        This is true. But we also tried for many, many years to have grandmasters explain to programmers why a particular chess move was good, they were never able to express all the subtleties and nuances to other humans s
        
        Re: (Score:2)
        
        by nagora ( 177841 ) writes:
        
        Clearly you know as much about Go as your cat. There's about 10^172 possible board positions and only 10^78 to 10^82 atoms in the universe. Hell, you can't even make a database of all the chess positions.
        I didn't say it made a database of all chess positions. However, your estimate of possible board positions is far too high and I suspect your estimate of how big Alpha's database is probably far too small. Playing against it is basically a question of finding a path through the gaps - which is hard because the gaps are small and they're not themselves paths. That is, it isn't a case of finding a line of development it's not seen, it's a case of continually avoiding boards it's never considered. Because ches
    - Re: (Score:2)
      
      by Lanthanide ( 4982283 ) writes:
      
      Not sure what your point is. You communicated to your wife the new rules and she used her existing knowledge (dataset) of chess to understand those rules and start playing.
      If you actually bothered to do the same with an AI, that is actually input the rules for the behaviour of this new piece, then it would go off of its existing knowledge (dataset) of chess to understand and start playing. Remember that all of the other pieces continue to operate the same so the general play style of chess still applies, ju
      - Re: (Score:2)
        
        by nagora ( 177841 ) writes:
        
        Not sure what your point is. You communicated to your wife the new rules and she used her existing knowledge (dataset) of chess to understand those rules and start playing.
        If you actually bothered to do the same with an AI, that is actually input the rules for the behaviour of this new piece, then it would go off of its existing knowledge (dataset) of chess to understand and start playing.
        I don't think you understand just how stupid AlphaGo is.
        Remember that all of the other pieces continue to operate the same so the general play style of chess still applies, just the particular tactics and overall strategy involving this new piece will be different.
        Alpha doesn't know anything about tactics or strategy - that, in fact, is its big trick.
        Initially your wife might play better at maharaja than the AI does, but because the AI is so much faster and can play games against itself and learn from those games, by the time your wife finished her first game against you, the AI would be ready to play her and wipe the floor with her.
        Which is fine if you have the budget to start from scratch every time because your so-call artificial intelligence is retarded compared to a 4-year-old child. Building that database costs serious money in hardware and electricity.
  - Re: (Score:2)
    
    by arglebargle_xiv ( 2212710 ) writes:
    
    In particular, the headline could be rewritten from:
    Google's Medical AI Was Super Accurate in a Lab. Real Life Was a Different Story
    to:
    $insert-company-here Fast Pattern-matching system Was Super Accurate on Training Data. Real Life Was a Different Story.
    to make it pretty much universally applicable.
- Re:When it worked well... (Score:5, Insightful)
  
  by Cipheron ( 4934805 ) writes: on Tuesday April 28, 2020 @12:36PM (#60000790)
  
  The original fall-back was that they send the photos to a specialist and it takes 10 weeks to get results. If this machine can't scan quite 100% of the photos, then naturally why not just send the remaining ones to the specialists like normal?
  The workload for the specialists will also be reduced since less are being sent through than before, so even the manual process will now be reduced from the 10 weeks it took before. If they halve the number being sent through they should knock weeks off the waiting time.
  So no, the headline is wrong, this didn't make things wores. Not being 100% perfect shouldn't be the baseline for being useful. The baseline should merely be that it's better than what it replaces, and this is clearly better than the universal 10 week wait.
  
  - Re: (Score:3)
    
    by Cipheron ( 4934805 ) writes:
    
    *sorry not headline, but the summary where it says " even the most accurate AIs can actually make things worse". The ones you couldn't scan had to be manually done before, and they have to be manually done now. But a large number of them *don't* have to be.
    Even the "borderline" ones the machine couldn't classify where they're going "why do we have to send these to the specialist? It's a waste of time!" Well, what were you doing before you had the classifying machine? Presumably, 100% went to the specialists
  - Re: (Score:2)
    
    by im_thatoneguy ( 819432 ) writes:
    
    Also this sounds like a job for a second simple AI that can be run on a laptop. It doesn't need to classify the image, it just needs to give a greenlight on whether the image is of a "high quality" or not.
    Train a much much much less sophisticated offline neural network on half-resolution images and I bet they could get an instant-result on whether or not they need to retake it with better lighting or not.
Not suprising. (Score:2)

by jellomizer ( 103300 ) writes:

I have gone threw such planning sessions for these types of projects over the years.
The problem is they don't ever ask for all the necessary data to work correctly. Especially around the medical billing information, as they assume that isn't good clinical information.
However, knowing how much action costs compared to others and knowing how well they work compared to others can greatly increase the case. In other cases, they pick only the best quality images, while real-life has different problems.
- Re: (Score:3)
  
  by phantomfive ( 622387 ) writes:
  
  Machine learning doesn’t work outside of its training. There’s no intelligence. It’s literally no better than a look up table.
  It's better than a lookup table because it does interpolation, among other reasons.
  - Still not good enough. (Score:3)
    
    by AtomicSymphonic ( 2570041 ) writes:
    
    Itâ(TM)s still the Law of Averages with interpolation, making it no better than lookup tables.
    In medicine, using the Law of Averages only really works in maybe a military medical setting, where all individuals already meet a strict medical criteria before entering service.
    Outside of that, AI for this specialized purpose begins to fall apart because there are way more than a few civilian cases that donâ(TM)t present as âoetypicalâ or textbook-style.
    We are not yet able to emulate the human
- Re: No surprise (Score:1)
  
  by BurnBabyBurn ( 6407404 ) writes:
  
  That really depends upon the algorithm. For supervised learning that is true. Unsupervised learning can sometimes pick up new things in a limited sense. However nothing at this time picks up new info like the human brain.
- Re: (Score:3)
  
  by Shotgun ( 30919 ) writes:
  
  And a glorified/specialized lookup table is just fine for this sort of work. Whether an AI or specialist, all that must be established is if a group of pixels correspond to a pattern indicating the disease. Both are just sorting images into sick/healthy/can't tell buckets. The AI is just a filter, but that is all the specialist is.
successful test (Score:2)

by bugs2squash ( 1132591 ) writes:

I doubt they expected all the feedback to be positive, so long as they have identified some input that they were incorrectly classifying this has been a success from a testing perspective
- Re: (Score:3)
  
  by kqs ( 1038910 ) writes:
  
  Yeah, my read of TFA is "The technology works. Some problems, mostly fixable."
  Some combination of making the AI work better with imperfect images, and improving the procedure for taking images, would fix many of the problems. Not sure what to do about poor internet access, though once the service is cheaper and more available, that will be a smaller problem.
  The point about "doctors often disagree, so 100% accuracy is not a requirement" is very reasonable and is something many people seem to miss about AI-
  - Re: (Score:2)
    
    by dvice ( 6309704 ) writes:
    
    I make your dilemma more simple to understand:
    A) We are testing you. First picture failed. We would like to install a new light bulb and try to take a new picture so we can be sure that everything is right.
    B) We tested you. The picture was not perfect so the results are 10% less accurate
    C) ... 20% ...
    D) ... 30% ... ...
    Which one would you pick for yourself? How many perfect of inaccuracy would you be OK with? I personally would like them to get their lighting fixed and get as accurate results as possible,
    - Re: (Score:2)
      
      by Shotgun ( 30919 ) writes:
      
      Doctors often disagree because one or more of them is wrong. I don't want that. I want them to be right at least most of the time. AI can do that better than doctors.
      AI can NOT do better than doctors, but they can perform as well as doctors more regularly.
      The advantage of computers is that they don't get tired, hungry, have a headache or a wife that is sleeping around on them. Humans have these sort of problems.
      The disadvantage of computers is that the ONLY thing they know is what the doctors have told them. The computers are just fancy calculators, after all.
      So, the computer can only perform as well as the best doctor involved in programming it. But, it can perform
      - Re: (Score:2)
        
        by Lanthanide ( 4982283 ) writes:
        
        I've read of a few image classifying cases in medical diagnostics where the AIs were actually slightly better (like 92% vs 89%) than the doctors at diagnosing illness.
      - Re: (Score:2)
        
        by Kjella ( 173770 ) writes:
        
        So, the computer can only perform as well as the best doctor involved in programming it.
        No, you don't program a neural network with domain knowledge. All you need is two folders of tagged images and you can train a classifier to tell apples from oranges, cats from dogs, roses from tulips, whatever you want really without having any clue about the topic. In this case you just need the doctors to tag images as diabetic retinopathy or not diabetic retinopathy, the rest is computer science not medicine.
        The problem is in distribution data versus out of distribution data, let's say you've built a cl
Misleading summary, not a failure (Score:5, Interesting)

by Headw1nd ( 829599 ) writes: on Tuesday April 28, 2020 @11:49AM (#60000630)

After having RTFA, accuracy wasn't the issue - the problem was that in order to achieve accuracy the system rejected a large number of scans as being to low quality or inconclusive, without providing feedback. The other issues seemed to revolve around difficulties uploading the images on the Thai network. The summary is written in a way that makes it seem like the whole thing was a fiasco by saying the feedback wasn't entirely positive, when in fact it seems the feedback from the experiment has been mostly positive.
Also featured prominently in the article is that Google is on the ground working to collect feedback and improve the system during the test implementation. This isn't a cautionary tale, it's just an example of the difficulties any project can have during implementation and it seems like google is doing a pretty good job of overcoming them.

- Re:Misleading summary, not a failure (Score:5, Insightful)
  
  by ceoyoyo ( 59147 ) writes: on Tuesday April 28, 2020 @12:08PM (#60000702)
  
  the problem was that in order to achieve accuracy the system rejected a large number of scans as being to low quality or inconclusive
  This actually suggests Google is doing a decent job. Simple machine learning implementations often give an answer, no matter what. Google has correctly realized that "not enough information, can't say" is a good answer.
  
  - Re: (Score:2)
    
    by Lanthanide ( 4982283 ) writes:
    
    Why did you not quote the following 3 words after the part you did quote? Those words were: "without providing feedback".
    The rather makes the rest of your comment rather suspect, since in fact it appears Google's AI did NOT say "not enough information, can't say" as its answer for those poor quality images.
    - Re: (Score:2)
      
      by ceoyoyo ( 59147 ) writes:
      
      Because those three words are very likely meaningless. "Rejected" is feedback. The most important kind.
- Re: (Score:2)
  
  by cacahuetes ( 4662173 ) writes:
  
  > ...the system rejected a large number of scans...
  LOL don't try that at home kids.
  Patient: What was the result of my test?
  Doctor: Computer says no.
  Patient: No what?
  Doctor: Error.
  Patient: I want my money back.
  Doctor: Please click yes to Accept.
  Patient: No I don't accept.
  Doctor: I'm sorry I did not understand that. Please click yes to Accept.
  - Re: (Score:2)
    
    by Shotgun ( 30919 ) writes:
    
    Do you think the doctor will be able to give you a proper diagnosis if she can't get a good look at you? Why would you expect the AI to work on crap data?
Existing staff sabotage new technology (Score:2)

by FeelGood314 ( 2516288 ) writes:

We see this all the time. A new technology comes out and the staff have to change their work flow and some staff resent the new tech and they actively do things incorrectly. They will take bad pictures or not label them correctly. The staff know they won't get push back from the machine. The AI is not going to tell the nurse off that she should smarten up and take better pictures. Back ground - There are new hand held blood testing units that work better than lab tests in many ways but they require nur
- Re:Existing staff sabotage new technology (Score:4)
  
  by frank_adrian314159 ( 469671 ) writes: on Tuesday April 28, 2020 @12:03PM (#60000686) Homepage
  
  That's not user error, it's inadequate design. Expecting an overworked nurse to closely monitor the speed at which they inject blood into the machine seems a bit stupid. Add a loading chamber into which the blood can be injected at any rate followed by a mechanism that transfers the blood at a controlled rate, if proper flow rate is critical. To do otherwise is simply bad design.
  
It all hinges on the quality of the picture (Score:2)

by Dorianny ( 1847922 ) writes:

Software is much more stringent. A human is likely to make due with a not so perfect image but Software is going to make you redo it even if it's slightly out of spec.
- Re: (Score:2)
  
  by Shotgun ( 30919 ) writes:
  
  And I predict the human will be more likely to misdiagnose from the fuzzy image. GIGO applies to humans, too.
The models almost never work... (Score:2)

by skaralic ( 676433 ) writes:

From climate to Covid, if we've learned one thing, it's that modelling, as used in government at least, is massively, horribly inaccurate. It is so inaccurate that there is no point in using it at all. You could literally take a random guess and outperform the models.

And yet, I know that more accurate models exist, since I do similar work myself. Makes you wonder then, why are the models picked by decision makers and the press so bad...
- Re:The models almost never work... (Score:5, Interesting)
  
  by frank_adrian314159 ( 469671 ) writes: on Tuesday April 28, 2020 @01:03PM (#60000922) Homepage
  
  Makes you wonder then, why are the models picked by decision makers and the press so bad...
  Check your assumptions. Most weather models are quite good. I have a feeling that models used by "decision makers and the press" aren't as bad as you believe, it's just that you don't like what they tell you and then you make a biased evaluation.
  
  - Re: (Score:2)
    
    by skaralic ( 676433 ) writes:
    
    Makes you wonder then, why are the models picked by decision makers and the press so bad...
    Check your assumptions. Most weather models are quite good. I have a feeling that models used by "decision makers and the press" aren't as bad as you believe, it's just that you don't like what they tell you and then you make a biased evaluation.
    I still see ice on poles and nowhere near 2 million dead from Covid in the US...
    - Re: (Score:2, Insightful)
      
      by swillden ( 191260 ) writes:
      
      Makes you wonder then, why are the models picked by decision makers and the press so bad...
      Check your assumptions. Most weather models are quite good. I have a feeling that models used by "decision makers and the press" aren't as bad as you believe, it's just that you don't like what they tell you and then you make a biased evaluation.
      I still see ice on poles and nowhere near 2 million dead from Covid in the US...
      That's good, because if you did see those things, that would have proven the models to be horribly bad, since in neither case have the leading models predicted anything like that by this point in time. The climate models don't predict ice-free poles at all, and the only Covid models that predicted 2M dead were those that assumed no response whatsoever... and even then it would have taken several more months to reach those levels.
    - Re: (Score:2)
      
      by Comrade Ogilvy ( 1719488 ) writes:
      
      The champion of the let's-not-trust-the-experts movement predicted a few deaths and everyone better by April.
      So even the straw man outlier modeler you invented who may have said 2 million deaths (in some parallel universe) is proving more accurate than you are willing to admit.
      For the record, there were people, including myself, who said if we do nothing and let life roll forward as normal, we could easily see 2+ million death in the USA. That "if we do nothing" part is relevant.
      - Re: (Score:2)
        
        by skaralic ( 676433 ) writes:
        
        The champion of the let's-not-trust-the-experts movement predicted a few deaths and everyone better by April.
        Many time have I quoted real, accredited experts in the given subject matter (not in this discussion). What is funny is that, if that expert happens to disagree with the narrative, they magically cease to be an expert.
        
        I know it sounds incredible but, on every single issue there are experts on both sides of the argument. The only difference is which side is getting the government funding and media attention.
    - Re: (Score:2)
      
      by im_thatoneguy ( 819432 ) writes:
      
      nowhere near 2 million dead from Covid in the US...
      "You are going to go bankrupt if you don't cut spending."
      *cuts spending, doesn't go bankrupt*
      "That financial advisor was a moron! He said I would go bankrupt and I still am fine!"
  - Re: (Score:2)
    
    by im_thatoneguy ( 819432 ) writes:
    
    Also OP totally misses the feedback cycle of models for climate and Covid.
    Collision Avoidance Model: "You're going to run into a tree."
    *Driver turns steering wheel.*
    Collision Avoidance Model: "You're going to be fine."
    Driver: "OMG these models are shit, it just said I was going to hit a tree so I changed direction now it's saying I'll be fine!? Why did I bother turning?"
    Researchers ran ran old climate models on historical behavior and they were quite accurate. Models aren't fortune tellers. They can't pre
Not AI's fault... (Score:3)

by superdave80 ( 1226592 ) writes: on Tuesday April 28, 2020 @02:40PM (#60001322)

reveals that even the most accurate AIs can actually make things worse if not tailored to the clinical environments in which they will work.

No, the clinics had substandard equipment and internet connections. That has nothing to do with AI.
With nurses scanning dozens of patients an hour and often taking the photos in poor lighting conditions, more than a fifth of the images were rejected.
Because the system had to upload images to the cloud for processing, poor internet connections in several clinics also caused delays.

Give an AI poor quality images and no way to quickly transfer them, and OF COURSE it's not going to work well.

A Tool, Not A Replacement (Score:1)

by BurnBabyBurn ( 6407404 ) writes:

I remember working on the Kaggle retinopathy contest and as always no algorithm made by anyone had perfect predictability. However improvements can always be made, as with any software. For those foggy images there might be some useful image preprocessing that could be done. The AI is always improving. However it's just a tool that works on a limited set of things. Just like any tool. AGI does not exist yet. To promise that any AI works better than humans on all tasks is overselling. The danger in t
Manage expectations (Score:3)

by Shotgun ( 30919 ) writes: on Tuesday April 28, 2020 @04:27PM (#60001616)

Google introduces computerized solution to a problem that takes 10 weeks. Patients and nurses complain that results may take 4 hours.
Solution? Take the picture and tell the patient to go home. Results will be available in less than 10 weeks. Everyone is pleased to get results the next day.
Duh.
And as far as the AI not working on bad images, could the doctor do any better if they were trying to work in the dark with equipment that did not allow them to see clearly? My fear is that in those cases the doctor might try to guess, whereas the AI will rightfully say that the data is inconclusive.

The problems were not with the AI (Score:3)

by MpVpRb ( 1423381 ) writes: on Tuesday April 28, 2020 @04:31PM (#60001630)

The problems were low quality photographs and poor internet connectivity

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

When it worked well... (Score:2)

You mean "in a tiny subset of conditions" (Score:5, Interesting)

Re:You mean "in a tiny subset of conditions" (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:When it worked well... (Score:5, Insightful)

Re: (Score:3)

Re: (Score:2)

Not suprising. (Score:2)

Re: (Score:3)

Still not good enough. (Score:3)

Re: No surprise (Score:1)

Re: (Score:3)

successful test (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Misleading summary, not a failure (Score:5, Interesting)

Re:Misleading summary, not a failure (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Existing staff sabotage new technology (Score:2)

Re:Existing staff sabotage new technology (Score:4)

It all hinges on the quality of the picture (Score:2)

Re: (Score:2)

The models almost never work... (Score:2)

Re:The models almost never work... (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Not AI's fault... (Score:3)

A Tool, Not A Replacement (Score:1)

Manage expectations (Score:3)

The problems were not with the AI (Score:3)

Related Links Top of the: day, week, month.

Slashdot Top Deals