Google Unveils 'Imagen' Text-To-Image Diffusion Model, Claims It's Better Than DALL-E 2

Google Unveils 'Imagen' Text-To-Image Diffusion Model, Claims It's Better Than DALL-E 2 (techcrunch.com) 62

Posted by BeauHD on Tuesday May 24, 2022 @09:00AM from the battle-of-the-neural-networks dept.

An anonymous reader quotes a report from TechCrunch: The AI world is still figuring out how to deal with the amazing show of prowess that is DALL-E 2's ability to draw/paint/imagine just about anything but OpenAI isn't the only one working on something like that. Google Research has rushed to publicize a similar model it's been working on -- which it claims is even better. [...] Imagen starts by generating a small (64x64 pixels) image and then does two "super resolution" passes on it to bring it up to 1024x1024. This isn't like normal upscaling, though, as AI super-resolution creates new details in harmony with the smaller image, using the original as a basis.

The advances Google's researchers claim with Imagen are several. They say that existing text models can be used for the text encoding portion, and that their quality is more important than simply increasing visual fidelity. That makes sense intuitively, since a detailed picture of nonsense is definitely worse than a slightly less detailed picture of exactly what you asked for. For instance, in the paper describing Imagen (PDF), they compare results for it and DALL-E 2 doing "a panda making latte art." In all of the latter's images, it's latte art of a panda; in most of Imagen's it's a panda making the art. (Neither was able to render a horse riding an astronaut, showing the opposite in all attempts. It's a work in progress.)

In Google's tests, Imagen came out ahead in tests of human evaluation, both on accuracy and fidelity. This is quite subjective obviously, but to even match the perceived quality of DALL-E 2, which until today was considered a huge leap ahead of everything else, is pretty impressive. I'll only add that while it's pretty good, none of these images (from any generator) will withstand more than a cursory scrutiny before people notice they're generated or have serious suspicions. OpenAI is a step or two ahead of Google in a couple ways, though. DALL-E 2 is more than a research paper, it's a private beta with people using it, just as they used its predecessor and GPT-2 and 3. Ironically, the company with "open" in its name has focused on productizing its text-to-image research, while the fabulously profitable internet giant has yet to attempt it.

Google Unveils 'Imagen' Text-To-Image Diffusion Model, Claims It's Better Than DALL-E 2

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 62 Comments Log In/Create an Account

Comments Filter:

The ultimate end-game here is... (Score:1)

by Anonymous Coward writes:

You dump a novel in, and it spits out a movie. Actually, three separate movies to maximize box office revenue.
- Re: (Score:2)
  
  by ZiggyZiggyZig ( 5490070 ) writes:
  
  And maybe another AI can write the novel!
  Tbh I'd be really curious to see such a work. It's really fun, and sometimes genuinely surprising, to play with novel-creating AIs (such as AIDungeon).
- Re: The ultimate end-game here is... (Score:2, Redundant)
  
  by ThurstonMoore ( 605470 ) writes:
  
  Couldn't be any worse than the Marvel shit they're churning out now.
If you were wondering... (Score:5, Interesting)

by AmiMoJo ( 196126 ) writes: on Tuesday May 24, 2022 @09:23AM (#62561286) Homepage Journal

If you were wondering why Google hasn't made this available for you to try yourself, from the research paper...
The data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scraped datasets. While this approach has enabled rapid algorithmic advances in recent years, datasets of this nature often reflect social stereotypes, oppressive viewpoints, and derogatory, or otherwise harmful, associations to marginalized identity groups. While a subset of our training data was filtered to remove noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.
From what I understand the DALL-E team hand curated their dataset to avoid this problem.

- Re: (Score:1)
  
  by nagora ( 177841 ) writes:
  
  Yeah, we don't want anyone seeing naked bodies or swearing! The horror of it all!
  - Re: (Score:2)
    
    by Kokuyo ( 549451 ) writes:
    
    For your consideration:
    "Joe Biden bowing and kissing Vladimir Putin's ring on his right hand"
    Just imagine what "proof" one could come up with.
    "Xi Jinping executing a kneeling woman"
    I'm sure more dastardly characters could come up with way more nefarious imagery.
    - Re: If you were wondering... (Score:3)
      
      by ThurstonMoore ( 605470 ) writes:
      
      Donald Trump being pissed on by Russian hookers.
      - Re: (Score:1)
        
        by Anonymous Coward writes:
        
        That's already in the training set.
    - Re: (Score:2)
      
      by Merk42 ( 1906718 ) writes:
      
      That's why DALL-E doesn't do faces well, on purpose.
  - Re: (Score:1)
    
    by rgmoore ( 133276 ) writes:
    
    A more accurate description is Google is aiming for as large a market as possible, and the nature of the market says that being blandly inoffensive is the smart choice. People who don't want to see nudity are likely to be a lot more offended- and less likely to continue using the product- if it produces some for them unasked than people who are OK with nudity never chancing on any. Some people will probably be disappointed the program isn't able to meet their requests for porn, but even they might be will
- - Re: (Score:2)
    
    by AmiMoJo ( 196126 ) writes:
    
    I would guess North American standards. They are a bit more restrictive than some European countries, but broadly acceptable in most places where Google does business.
    By the way, Muslims who actually care about covering the face (far from all of them) wouldn't be offended by an AI generated image of a woman's face. A better example would be a drawing of Mohammed.
  - Re: (Score:2)
    
    by WaffleMonster ( 969671 ) writes:
    
    Curated it from whose perspective? So DALL-E won't generate any images that offend anyone? So every female it generates would have their face covered to not offend a muslim, for example? Or have they just focused on current hot-topic subjects as defined by the woke movement?
    Just saying... this "not offending people" thing has gotten waaay out of hand. I'm pretty certain you could easily come up with plenty of offensive images, even by western standards, with DALL-E. And BTW, don't put too many letter a's in the word "way" or you will offend Slashdot's ascii art filter.
    From Slashdot FAQ:
    "Redundant posts add no new information; they take up space with information either in the original post, the attached links, or lots of previous comments. "
    Clearly a redundant mod is not applicable in this situation. This commentary was not present in the original post, attached links or previous comments.
- - Morphy smut (Score:2)
    
    by Tablizer ( 95088 ) writes:
    
    > I was looking forward to pics of Wall-e having sex with a Speak & Spell.
    Oh oh, uh uh, aaaah, Do Me Like a Furby, Big Boy!
    I do imagine one would get some really strange smut. I played with DALL-E, and found it curious how it would morph similar things into one. For example, a Battle Bot rendering confused floor reflection with metal sparks, so the floor had sparks coming out, but it also looked like a reflection, so it confuses the mind which tries to figure out if the streak is a floor reflection
    - Re: (Score:2)
      
      by Visarga ( 1071662 ) writes:
      
      > it just glues together random delusions
      
      Imagen is apparently a lot less delusional than DALL-E 2. How long until it's pretty good?
- Re: (Score:2)
  
  by OzPeter ( 195038 ) writes:
  
  If you were wondering why Google hasn't made this available for you to try yourself, from the research paper...
  The data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scraped datasets. While this approach has enabled rapid algorithmic advances in recent years, datasets of this nature often reflect social stereotypes, oppressive viewpoints, and derogatory, or otherwise harmful, associations to marginalized identity groups. While a subset of our training data was filtered to remove noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.
  From what I understand the DALL-E team hand curated their dataset to avoid this problem.
  So can I request it to paint a flower in the style of Georgia O'Keefe?
  There is no way given the diversity of the entire population of the Earth that any image produced by one group won't be found offensive by another group, no matter how benign that image is. To even state that this is possible shows a complete lack of ignorance about who we are as a people.
  (For those that don't know, Georgia O'Keefe [wikipedia.org] is an American painter who, among other things, famously painted flowers with the appearance of vaginas.)
  - Re: (Score:2)
    
    by AmiMoJo ( 196126 ) writes:
    
    I don't think they are trying to avoid offending anyone, just removing the categories they mention: porn, racism, homophobia and the like.
    - Re:If you were wondering... (Score:4, Interesting)
      
      by OzPeter ( 195038 ) writes: on Tuesday May 24, 2022 @10:33AM (#62561490)
      
      I don't think they are trying to avoid offending anyone, just removing the categories they mention: porn, racism, homophobia and the like.
      If you are not removing that stuff to avoid offending people, then why are you doing it in the first place?
      But the point I was making is that you can remove as much as you like, and then someone will come along and appropriate some innocuous image that will then be used a a new vector of offense. A classic example of this is the use of Winnie the Pooh to criticize Xi Jinping. There is no algorithm or curation that could have seen in advance how those two items would be connected.
      
      - Re: (Score:2)
        
        by AmiMoJo ( 196126 ) writes:
        
        No, I mean they are not trying to avoid offending everyone in the entire world. They are just removing certain stuff that is widely agreed to be undesirable for a family friendly product.
        
        Re: (Score:1)
        
        by OzPeter ( 195038 ) writes:
        
        No, I mean they are not trying to avoid offending everyone in the entire world. They are just removing certain stuff that is widely agreed to be undesirable for a family friendly product.
        I think you should apply for a roll on Broadway - your tap dancing is immaculate /s
        
        Re: (Score:2)
        
        by AmiMoJo ( 196126 ) writes:
        
        I just phrased it as "anyone", which would have been better phrased as "everyone". I'd just been speaking in Japanese and it's phrased more like anyone there. Brain is not good at context switching sometimes.
      - Re: (Score:2)
        
        by Pascoea ( 968200 ) writes:
        
        There is no algorithm or curation that could have seen in advance how those two items would be connected.
        To be fair, now that the connection has been made it's pretty hard to un-see it.
      - Re: (Score:1)
        
        by rgmoore ( 133276 ) writes:
        
        There's a real difference between eliminating images that are known to be offensive to many people today and eliminating all possibility of offense in the future. It's true that you can't conceive of every way something might be used in an offensive way in the future, but you can categorize some kinds of images as offensive to broad swaths of the public today. That's all they're trying to do: avoid producing images that they know in advance will offend many people. It's a reasonable goal, even if it is
      - Re: (Score:2)
        
        by mobby_6kl ( 668092 ) writes:
        
        But the point I was making is that you can remove as much as you like, and then someone will come along and appropriate some innocuous image that will then be used a a new vector of offense. A classic example of this is the use of Winnie the Pooh to criticize Xi Jinping. There is no algorithm or curation that could have seen in advance how those two items would be connected.
        Sure you can't ensure that nothing offensive will never come out of it. But at the very least they could and should make sure that if you ask it to draw a black person, the AI doesn't reproduce a racist caricature because it learned from decades of cartoons, ads and other media, for example.
- Re: (Score:2)
  
  by cthulhu11 ( 842924 ) writes:
  
  What I'm wondering is why it was named after a finicky printer company from the 1980s that used a multibus system running V7 or SysV or such as a RIP.
Dall-E-2? (Score:2)

by Chris Mattern ( 191822 ) writes:

What kind of security classification is "E"? Obviously it's actually "Dall-Y-2".
Dall-E stupidity (Score:3)

by OzPeter ( 195038 ) writes: on Tuesday May 24, 2022 @10:00AM (#62561374)

During the Dall-E presentation video my draw dropped when the presenter said something like "And you can have it draw a Koala Bear".
Will Google's Imagen have that same level of stupidity?

- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  From the sample images, yes. Basically artificially created meaningless stock photos.
Link to the dataset (Score:2)

by UnknownSoldier ( 67820 ) writes:

Here is a link to the LAION-400M dataset [laion.ai] that was used for training if anyone is curious.
This section on imagegen caught my eye:
While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.
I'm curious how and what they consider "inappropriate content" ?
The
- Re: (Score:3)
  
  by drinkypoo ( 153816 ) writes:
  
  So I followed a link from your link and searched for nsfw [github.io] and got back a bunch of results. I know, I actually read the page. What am I doing here? What am I doing in this handbasket? And where am I going?
  Anyway for those who don't want to peruse the results, most of the hits are of pink pastries, literally. Like, literal donuts with literal pink frosting on them, on a pink background. The next most popular category seems to be children's dolls, some of them in suggestive poses. Then the next popular might b
  - Re: (Score:2)
    
    by UnknownSoldier ( 67820 ) writes:
    
    Thanks for following this up and the link.
    The biggest hit being pink pastries is too funny. Guess this tool is biased against donut shops. Or classical art such as da Vinci's Vitruvian Man. Or classic Greek statues such as Venus de Milo [wikipedia.org]
    That CLIP front end you linked has got me curious what they consider as "violence" in an image? LUL.
    [ ] Safe mode
    [ ] Remove violence
    - Re: (Score:3)
      
      by drinkypoo ( 153816 ) writes:
      
      I was pretty surprised that it was that easy, but I followed the thread from the top and the citation stated they classified as 'nsfw' so I gave 'er the ol' college try, and bingo.
      If you think about it, the donut images they classified make total sense. You've got a whole lotta pink, you've got a hole, and you've got glistening frosting. I can absolutely see why it's making this hilarious mistake. In fact, there is a long and hallowed history of referring to portions of women's anatomy with the names of bre
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Hmm. Why are strawberries NSFW? Does this thing use colour-statistics to make its decision? Seems to be artificial stupidity at its best...
    - Re: (Score:2)
      
      by narcc ( 412956 ) writes:
      
      Pink frosting != Strawberries
      - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Have you looked at the samples? Obviously not.
    - Re: (Score:3)
      
      by drinkypoo ( 153816 ) writes:
      
      It clearly is doing some serious color-related matching. It seemed to me like it was recognizing a whole bunch of pink with a hole in it. Meanwhile, the arrays of strawberry slices could be interpreted as a vagina-related deepdream. There might also just be something there about symmetry. That's the miracle of "AI" though, right? Figuring out why it does what it does in specifics is mind-bending. I scrolled for a little while, and if I squinted I could easily mistake a lot of those images for some graphic p
      - Re: (Score:3)
        
        by gweihir ( 88907 ) writes:
        
        And that is one of the problems with artificial stupidity: It will misclassify without ever noticing.
  - Re: (Score:2)
    
    by ceoyoyo ( 59147 ) writes:
    
    Their NSFW filter seems to have a thing for ice cream and macrons.
    With a bit of Marge Simpson fan art thrown in as spice.
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  I'm curious how and what they consider "inappropriate content" ?
  Anything that they think would negatively affect their bottom-line. This is not about morals or anything. This is purely about money.
Interacting with Google Assistant ... (Score:2)

by Qbertino ( 265505 ) writes:

... sometimes is like talking to someone with severe Alzheimers. Really annoying with bizarre side-effects.
However, there are instances where it works and AFAICT those are increasing in proportion. It is quite very likely that ML will reach critical mass and people will become irrelevant for large amounts of tasks quickly [youtube.com].
The downsides of this AI stuff is numblingly frustrating, it's like trying to do higher math with a toddler. But if they manager to train AI with intelligent settings, data-correction and
- Re: (Score:2)
  
  by narcc ( 412956 ) writes:
  
  But if they manager to train AI with intelligent settings, data-correction and other methods of sophistication true replacement of humans even for brainjobs can't be that far away.
  It's been just 10 years away for the last 50 years or so...
You vill use nothing, und you vill be happy (Score:5, Insightful)

by systemd-anonymousd ( 6652324 ) writes: on Tuesday May 24, 2022 @11:49AM (#62561712)

It's SOO powerful that peons on the Internet aren't permitted to even touch it in a sandbox, because the unwashed masses might be "problematic, harmful, and racist" with it. I'm so sick of that tired excuse, when I accessed their own website through a fucking web browser on the Internet! I could be 100x more problematic, harmful, and racist by just going to the wrong website and sharing the wrong opinions.
If the Internet were released today some big corporation would be bragging about it but saying that it has too much potential and is too dangerous for mere normal users to use. Only a select few who work for a billion dollar corporation have a chance of gaining access, because having money obviously makes you more ethical. I feel like we're hearing some Catholic preacher say we can only read the Bible in Latin because a translation would be too potentially harmful to our salvation.
We've done a 180 from the way that tech used to be.

- Re: (Score:2)
  
  by narcc ( 412956 ) writes:
  
  If you don't like it, make your own. No body owes you a racist porn generator or whatever it is you want.
  - Re:You vill use nothing, und you vill be happy (Score:4, Insightful)
    
    by systemd-anonymousd ( 6652324 ) writes: on Tuesday May 24, 2022 @02:53PM (#62562164)
    
    It costs ~$100M to built the datacenters and buy the GPUs necessary to train diffusion models of that size, but nice try. And nothing in my comment has to do with wanting racism.
    Even foundations built with the purpose of giving normal people access to corporation-level of AI have renegged on their promise when they saw dollar signs, like Sam Altman's "OpenAI." They released GPT-2, which got insanely popular, then decided the most ethical thing to do was coincidentally the one that made them the most money: keep GPT-3 and DALL-E 2 closed and charge people money for it. They then came up with a bunch of trendy bullshit about how people could generate "violent content," oh nooo! Violent drawings! What a bunch of transparent two-faced bullshit.
    
    - Re: (Score:2)
      
      by narcc ( 412956 ) writes:
      
      Don't you have bootstraps? Why do you want everything handed to you?
      Also, why do you think you get to dictate what other people do with their own property? Get over yourself.
      - Re: (Score:2)
        
        by systemd-anonymousd ( 6652324 ) writes:
        
        Did you reply to the wrong comment? I think you're confused, or a troll
- Re: (Score:2)
  
  by null etc. ( 524767 ) writes:
  
  If the Internet were released today some big corporation would be bragging about it but saying that it has too much potential and is too dangerous for mere normal users to use. Only a select few who work for a billion dollar corporation have a chance of gaining access
  That alternative would be far superior to the present reality in which technology is accessible and cheap enough for scammers to use it non-stop to harass the remainder of the world, just to earn a few bucks.
  - Re: (Score:2)
    
    by systemd-anonymousd ( 6652324 ) writes:
    
    In that alternate reality everything is a walled garden and you have to sign up to 50 subscription services to get access to a silo of corporate consumer content. Plus what makes you think you wouldn't be scammed and hassled for money? That's half the purpose of corporate media. The pleasant Internet is the one motivated by normal people to just interact and exchange information.
- Re: (Score:2)
  
  by lucasnate1 ( 4682951 ) writes:
  
  I suspect that the reason it is not available to the general public is because it is hardly as good as they say it is.
  - Re: (Score:2)
    
    by systemd-anonymousd ( 6652324 ) writes:
    
    I've heard some first-hand accounts of DALL-E 2 at least, and though it's not perfect and requires manual curation of the results, it's still a massive step forward. It's probably equivalent to what GPT-2 was compared to other text completion models before it, and should be an order of magnitude better than CLIP.
But how well does it do porn? (Score:2)

by gweihir ( 88907 ) writes:

I mean that is where the real money is after all.
I got nothing but wow... (Score:2)

by Goatbot ( 7614062 ) writes:

Sure the hell bests anything I've read about lately. Not that I specialize in graphics but this has so many use cases my brain hurts. FML I love science but we gotta leverage this for good before evil.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

The ultimate end-game here is... (Score:1)

Re: (Score:2)

Re: The ultimate end-game here is... (Score:2, Redundant)

If you were wondering... (Score:5, Interesting)

Re: (Score:1)

Re: (Score:2)

Re: If you were wondering... (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Morphy smut (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:If you were wondering... (Score:4, Interesting)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Dall-E-2? (Score:2)

Dall-E stupidity (Score:3)

Re: (Score:2)

Link to the dataset (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Interacting with Google Assistant ... (Score:2)

Re: (Score:2)

You vill use nothing, und you vill be happy (Score:5, Insightful)

Re: (Score:2)

Re:You vill use nothing, und you vill be happy (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

But how well does it do porn? (Score:2)

I got nothing but wow... (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals