Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
AI

1960s Chatbot ELIZA Beat OpenAI's GPT-3.5 In a Recent Turing Test Study (arstechnica.com) 57

An anonymous reader quotes a report from Ars Technica: In a preprint research paper titled "Does GPT-4 Pass the Turing Test?", two researchers from UC San Diego pitted OpenAI's GPT-4 AI language model against human participants, GPT-3.5, and ELIZA to see which could trick participants into thinking it was human with the greatest success. But along the way, the study, which has not been peer-reviewed, found that human participants correctly identified other humans in only 63 percent of the interactions -- and that a 1960s computer program surpassed the AI model that powers the free version of ChatGPT. Even with limitations and caveats, which we'll cover below, the paper presents a thought-provoking comparison between AI model approaches and raises further questions about using the Turing test to evaluate AI model performance.

In the recent study, listed on arXiv at the end of October, UC San Diego researchers Cameron Jones (a PhD student in Cognitive Science) and Benjamin Bergen (a professor in the university's Department of Cognitive Science) set up a website called turingtest.live, where they hosted a two-player implementation of the Turing test over the Internet with the goal of seeing how well GPT-4, when prompted different ways, could convince people it was human. Through the site, human interrogators interacted with various "AI witnesses" representing either other humans or AI models that included the aforementioned GPT-4, GPT-3.5, and ELIZA, a rules-based conversational program from the 1960s. "The two participants in human matches were randomly assigned to the interrogator and witness roles," write the researchers. "Witnesses were instructed to convince the interrogator that they were human. Players matched with AI models were always interrogators."

The experiment involved 652 participants who completed a total of 1,810 sessions, of which 1,405 games were analyzed after excluding certain scenarios like repeated AI games (leading to the expectation of AI model interactions when other humans weren't online) or personal acquaintance between participants and witnesses, who were sometimes sitting in the same room. Surprisingly, ELIZA, developed in the mid-1960s by computer scientist Joseph Weizenbaum at MIT, scored relatively well during the study, achieving a success rate of 27 percent. GPT-3.5, depending on the prompt, scored a 14 percent success rate, below ELIZA. GPT-4 achieved a success rate of 41 percent, second only to actual humans.
"Ultimately, the study's authors concluded that GPT-4 does not meet the success criteria of the Turing test, reaching neither a 50 percent success rate (greater than a 50/50 chance) nor surpassing the success rate of human participants," reports Ars. "The researchers speculate that with the right prompt design, GPT-4 or similar models might eventually pass the Turing test. However, the challenge lies in crafting a prompt that mimics the subtlety of human conversation styles. And like GPT-3.5, GPT-4 has also been conditioned not to present itself as human."

"It seems very likely that much more effective prompts exist, and therefore that our results underestimate GPT-4's potential performance at the Turing Test," the authors write.
This discussion has been archived. No new comments can be posted.

1960s Chatbot ELIZA Beat OpenAI's GPT-3.5 In a Recent Turing Test Study

Comments Filter:
  • Anyone with even a half functioning brain can tell that Eliza type chatbots are not human after only a few sentences. Who did they get for the study, special needs patients or maybe pre school children??

    • by Mr. Dollar Ton ( 5495648 ) on Saturday December 02, 2023 @05:42AM (#64048637)

      You'd think so, but about 35 years ago we had set up an Eliza-like bot to chat with someone, who, after talking to it for an hour, proceeded to invite "her" on a date and even went there with flowers. And he wasn't anything like dumb.

      • by ShanghaiBill ( 739463 ) on Saturday December 02, 2023 @08:55AM (#64048879)

        That's not a Turing test.

        In a Turing test, the questioner is told the subject might be a machine, so the questions are specifically designed to differentiate between a human and a machine.

        So you ask questions like:

        "John punched the old man. He was in the hospital for a week. Who was in the hospital?"

        "John punched Mike Tyson. He was in the hospital for a week. Who was in the hospital?"

        When talking to a potential date, you're trying to see if they are a nice person who maybe shares some interests with you. That is much easier to fake. If the questioner doesn't suspect, and isn't told, that they're talking to a machine, they could easily assume ELIZA is just a nice person.

        • by jd ( 1658 )

          The Turing Test, done properly, will include questions designed to catch out machines, yes, but these aren't going to be limited to logic puzzles. They'll include subjective questions, paradoxical questions, questions with insufficient data, experiential questions - stuff that can't be answered simply by knowing more or through basic parsing.

          Ultimately, the Turing Test relies on the principle that if f(x)=g(x) for all x, then f=g.

        • Re: (Score:2, Flamebait)

          Did I say it was, smartass?

        • by narcc ( 412956 )

          That's not a Turing test.

          Sure it is! It's not like it's rigorously defined. Outside that goofy contest anyway, but do you really think that 'amateur tournament' is really the best way to measure this? While a program like ChatGPT might have done well in the past, we've all gotten pretty good at spotting its output. Worse, the people aren't good at being objective, so we can expect things like labeling obvious bots as humans because they think the humans might be pretending to be machines.

          With that in mind, if you still buy into t

      • by syn3rg ( 530741 )
        Lenny [lennytroll.com] has been regularly fooling telemarketers [reddit.com] for over 12 years [wikipedia.org] (admittedly not a high bar, but still).
    • That's because we know Eliza's "trigger" questions that tip her hand. Try that again without knowing them.

      • by Viol8 ( 599362 )

        Oh please. There was a web version of Eliza around , maybe still is. You can tell in less than a minute its a bot. Unless perhaps you're as thick as you.

        • If you ever had to deal with a shrink, you might not be as convinced. Eliza was basically a shrink-in-a-box. Shrink-wrap, if you will.

        • "There was a web version of Eliza around , maybe still is."
          https://web.njit.edu/~ronkowit... [njit.edu]

          Also this snippet from Wikipedia:
          Another version of Eliza popular among software engineers is the version that comes with the default release of GNU Emacs, and which can be accessed by typing M-x doctor from most modern Emacs implementations.
          https://en.wikipedia.org/wiki/... [wikipedia.org]
          I haven't tried that personally...
      • I see. Tell me more about that's because we know Eliza's "trigger" questions that tip her hand.

    • Yes, but so does ChatGPT. Itâ(TM)s *far* more verbose than a human and far faster than a human could type.

    • by Junta ( 36770 )

      Nah, just people used to conversing with people on Twitter (errr.. X I guess).

    • by Rei ( 128717 )

      The problem is that models like GPT-4 are not finetuned for "what's the most pretend-I'm-a-person" behavior possible, but rather, to be an an AI assistant. The personality you encounter is not that of the underlying model, but rather, that of the finetuning, where the already trained foundation is given a series of sample prompt-and-response scenarios to get a sense of how the developers want it to behave in response to user prompting.

      TL/DR: responses like, "As an AI assistant..." are pretty sure ways to g

    • Hello, this is Lenny
    • Anyone with even a half functioning brain can tell that Eliza type chatbots are not human after only a few sentences

      I've got news for you. It doesn't take even a few sentences for the GPT et al crowd.

    • by dbialac ( 320955 )
      Systems like ChatGPT are likely easy to defeat as well: talk about subjects that are far divergent from one another. A human will have a hard time going on about the two topics, but ChatGPT won't.
  • Not meaningful (Score:4, Informative)

    by bradley13 ( 1118935 ) on Saturday December 02, 2023 @05:34AM (#64048631) Homepage

    If they used the real, original Eliza, no one would think it is human. All it does, is mirror back what you said. If it can't figure out what to mirror back, then it just sends a pat phrase. After 3-4 exchanges, the pattern is just glaringly obvious. Example:

    - "Tell me about AI --> Eliza is confused, and just says "Tell me more about this"

    - "I am worried about AI" --> Eliza mirrors "So, you are worried about AI. What do you feel about that?"

    That's it, that's all that Eliza does. That was "state of the art" in computer conversion in 1966.

    Meanwhile ChatGPT isn't even trying to pass a Turing test. For the first prompt above, it writes you a mini-paper, with headings and everything. Of course, no person would do that, so it fails the Turing test. However, a person might be able to produce a summary like that with an hour's work.

    • Re:Not meaningful (Score:5, Interesting)

      by Opportunist ( 166417 ) on Saturday December 02, 2023 @06:59AM (#64048673)

      That's because Eliza was originally meant to act like a psychiatrist. And that's what usually happens there, you get a mirror held in front of you that throws questions back at you, that's all the original was and all it should be.

      And believe it or not, it worked pretty well as such.

      • Re:Not meaningful (Score:4, Interesting)

        by HiThere ( 15173 ) <charleshixsn.earthlink@net> on Saturday December 02, 2023 @09:32AM (#64048921)

        Specifically, as a Rogerian psychiatrist. Different schools of psychiatry take different approaches, and the Rogerian approach was the easiest to model.

        What's interesting is that there a another program called Parry, which modeled a paranoid. When Eliza and Parry were put into a dialog, the transcript could not be identified as being machine generated (when mixed with a pile of real transcripts) who were looking for the machine-generated transcript. That's sure not the Turing test, but it's a definite hint that a lot of human interaction is script driven.

    • "Of course, no person would do that"

      Except I have, on occasions when I'm manic, in response to very simple questions like that one. So maybe you're incorrect, or, maybe *I* don't pass the Turing test on those occasions?

  • Did it? (Score:5, Funny)

    by The Evil Atheist ( 2484676 ) on Saturday December 02, 2023 @06:36AM (#64048665)
    And how does that make you feel, ChatGPT?
    • We should see what happens when we get ChatGPT talking to Eliza. They could go on forever! Do you suppose either one would figure out that the other isn't human?

    • I don't have feelings or emotions, so I don't experience any emotional response. However, I'm here to provide information and assist you to the best of my abilities. If there's anything specific you'd like to know or discuss, feel free to let me know!
  • the dumber the chat bot is, the more it can be confused with humans ... Who would have thought differently? But seriously, Modern AI can be instructed to behave like a human. You can tell it to talk like an 8 year old and it does it. I have made experiments with asking it to pretend to be an ignorant student and learn some math. It behaves then as such. It does make mistakes as expected. The Turing test has to be done correctly: instruct the AI to answer in such a way that it passes the Turing test. I'm
  • Look it up on YouTube. DR SBAITSO shipped the sound blaster audio cards. Kinda the same thing!
  • That means something that does not pass it is surely not intelligent. It does not say something that passes it is intelligent.

    • by HiThere ( 15173 )

      That's not true of the actual Turing test. The problem is that a lot of people would fail that test, so you've got to scale it. For a program to pass the actual Turing test (with a reasonable questioner) it would need to be MORE intelligent than most people. Most of these setups run a really simplified version that doesn't prove much of anything.
      Just consider, suppose the question were "Could you write me a Haiku on a robin's egg?" , what should the answer be? (A good one might be "I think it might fit"

      • by gweihir ( 88907 )

        I was speaking for systems, not people. The Turing test is not suitable for people.

        • by HiThere ( 15173 )

          Actually, it *is* suitable for people. Turing modified a Victorian game (I'm not sure how popular it was) where you tried to decide whether the hidden respondent was a man or a woman. (I forget the details of the game, but that's the gist.)

          • by gweihir ( 88907 )

            Not really. The problem is people generally do not use what intelligence they have and rely in dumb automation instead. Machines are different and always give it their best. I do know about the origin of the Turing test and the original design. That one does not cut it either way and is really just a game, not a test. Not a criticism of Turing, but he clearly was not very serious in the design of this "test".

    • Exactly.
      The purpose of the Turing test is to act as a filter, so that
      systems that fail can be thrown out before anyone wastes
      any more time on them.

  • ChatGPT is designed & developed to be a digital assistant. It's not designed to pass a Turin test. Simply put, most humans don't behave like a digital assistant so it's pretty easy to tell them apart.
    • Also, even among the highly educated, we have a cognitive bias against (we're very bad at) distinguishing between meaningful, purposeful utterances & nonsense, as this paper amusingly demontrated: Pennycook, G., Cheyne, J. A., Barr, N., Koehler, D. J., & Fugelsang, J. A. (2015). On the reception and detection of pseudo-profound bullshit. ResearchGate, 10(6), 549–563. Retrieved from https://www.researchgate.net/p... [researchgate.net]
    • by hawk ( 1151 )

      > It's not designed to pass a Turin test.

      a test which is shrouded in obscurity . . . ok, that wraps this up . . .

      • "shrouded in obscurity" is a bit melodramatic, don't you think? I'd say more like "poorly defined." I'd also bring up Winograd schemas but it looks like they've specifically trained their models on those, although you can still catch ChatGPT out if you swap the typical transitivity around a little, i.e. it hasn't been trained particularly comprehensively in this respect. It could be that the Kenyans (with the Kenyan English dialect) who trained ChatGPT use transitivity a little differently to Europeans
        • Woosh!

          That's the sound the Turin Shroud makes as it wafts past you.

          • Oh, right. Got it now. No into religion so didn't really think about that. Got a couple of friends from Turin though. For me, it's just a city at the foot of the mountains in Piedmont in north-western Italy. Apparently, not a great place to live which is why my friends left. Their parents intend to move from there soon too.
  • In the interest of transparency, GPT seems compelled to frequently inject "as an AI language model...." which would seem to be a slam dunk. Seems particularly likely if the prompts contain questions asking about various facets of AI.

    Eliza did no such thing. Eliza's responses are just really short, stupid and not interested in actually processing what the other side of the conversation is saying. Just like humans sending Tweets, so perhaps more familiar to participants as human-like than verbose nuanced re

  • by chas.williams ( 6256556 ) on Saturday December 02, 2023 @08:37AM (#64048855)
    They should probably check that first.
  • Writing headlines should at least *try* to capture the gist of the article.

    "Humans Beat GPT-4 in Turing Test by Small Margin"

  • by WaffleMonster ( 969671 ) on Saturday December 02, 2023 @10:36AM (#64049057)

    1. They failed to disclose basic facts such as the prompt and temperature setting assigned to each personality.

    "Each LLM witness consisted of a model (GPT-3.5 or GPT-4), a temperature setting (0.2, 0.5, or 1.0) and a prompt."

    Who was assigned what? This seems like critical information that could have easily been provided.

    2. Might not have made much difference however response delays should have had some jitter applied to them or some kind of temporal blinding across the board.

    3. With an average of only 4 messages sent by interrogators per conversation it doesn't seem like interrogators were taking their task all that seriously.

    4. "At the end of each round, the identity of the Witness will be revealed."

    Why? If you are going to allow participants to try more than once... why provide such feedback between rounds?

    5. Only 18% of interrogators ended up talking to a human and 4% of the human witnesses knew their interrogators? The ratio should be more balanced to avoid assumption / scuttlebutt the site is a botfest.

    6. Why 5mins / 300 character limit?

    7. IIRC 58% of decisions were on the basis of "formality". A problem easily averted through competent prompting.

    Personally I'm not a fan of Turing tests especially of this sort. It's cheap and pointless to focus on whether you are talking to Cmdr Data because he uses contractions, types faster than a human or knows too much...etc.

    My biggest criticism of this study is if you are not going to bother to seriously try and iterate to adjust the machine to better fool humans the question needs to be asked are you testing whether or not the machine is capable of something or whether or not it is merely properly configured?

    The litany of eval benchmarks are generally a far better way to judge LLMs than Turing tests yet I think in different settings for example as an assistant or a college such judgements would have real value to individuals.

  • Perhaps in our fantasies we outperformed each other. Eliza could not in any way compete with the current chat systems in terms of providing accurate information on request. It is not clear how these competitions were framed, but they'd have to be kept to a pretty narrow level of basic human emotions and concepts before there could be any comparison.
    • by PPH ( 736903 )

      Eliza could not in any way compete with the current chat systems in terms of providing accurate information on request.

      But that's due to the limited breadth of training data available at the time it was built. No such thing as a web crawler existed. But I suspect that the information it could provide was much more accurate than what today's web-fed LLMs can provide. Because the data was hand curated in advance. Not scraped from the cesspool of human knowledge that is social media.

  • A small amount of good data is much more valuable than a massive amount of horrible, inaccurate and deliberately misleading data. Stupid analogy:

    Eliza was fed a nutritious breakfast of reasonable prompts by parents who wanted her to grow up big and strong. ChatGPT is being fed commercials and spam because they're free and available and their parents don't really care enough to feed 'em anything good.

    ChatGPT was designed to consume and regurgitate ads from all over the internet. Humans hate ads, so of c
  • ... that ELIZA competed against the GPT 3.5 and 4 and came out even with them, more or less!

There is very little future in being right when your boss is wrong.

Working...