Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI Education

Exam Submissions By AI Found To Earn Higher Grades Than Real-Life Students (yahoo.com) 118

Exam submissions generated by AI can not only evade detection but also earn higher grades than those submitted by university students, a real-world test has shown. From a report: The findings come as concerns mount about students submitting AI-generated work as their own, with questions being raised about the academic integrity of universities and other higher education institutions. It also shows even experienced markers could struggle to spot answers generated by AI, the University of Reading academics said.

Peter Scarfe, an associate professor at Reading's School of Psychology and Clinical Language Sciences said the findings should serve as a "wake-up call" for educational institutions as AI tools such as ChatGPT become more advanced and widespread. He said: "The data in our study shows it is very difficult to detect AI-generated answers. There has been quite a lot of talk about the use of so-called AI detectors, which are also another form of AI but (the scope here) is limited." For the study, published in the journal Plos One, Prof Scarfe and his team generated answers to exam questions using GPT-4 and submitted these on behalf of 33 fake students. Exam markers at Reading's School of Psychology and Clinical Language Sciences were unaware of the study. Answers submitted for many undergraduate psychology modules went undetected in 94% of cases and, on average, got higher grades than real student submissions, Prof Scarfe said.

This discussion has been archived. No new comments can be posted.

Exam Submissions By AI Found To Earn Higher Grades Than Real-Life Students

Comments Filter:
  • by Seven Spirals ( 4924941 ) on Wednesday June 26, 2024 @04:34PM (#64580683)
    The students are literal ignoramuses most of the time. If it cannot beat a student, then it's got a long way to go. My ferret could probably do better than most students if he could just stop stealing the pencils.
  • by SendBot ( 29932 ) on Wednesday June 26, 2024 @04:38PM (#64580693) Homepage Journal

    As a kid in school, I found math tests easy because all you had to do was provide a correct answer.

    Writing for good grades was always a struggle though! You'd put so much thought and effort into something, only to get bad marks for artistic choices or having different viewpoints from the teacher. In time, I got a peek behind the curtain and learned that written work was graded more by weight than quality. Put enough words on the page with appropriate punctuation, and you win.

    This comeuppance is righteously due.

    • I found math tests easy because all you had to do was provide a correct answer.

      get bad marks for artistic choices or having different viewpoints from the teacher

      Was this a high school? Because none of this is like any university I've been to or heard about from reliable sources.

      • It's enough to be a different country.

      • by ewibble ( 1655195 ) on Wednesday June 26, 2024 @05:27PM (#64580823)

        I have found the same thing in university as well, and in real life. I have read work from people who are apparently good at writing, most of it is drivel or stating the obvious. I have read entire 100 page documents describing XML, with the last 2 pages actually stating the data that is needed.

        I have a daughter in tertiary education, there is a requirement of X words +/- 10% that's ludicrous it should be X good points with some examples. There is more time spent adding and subtracting words that actually coming up with the answers. I am sure AI could to that much better than most humans at that sort of task.

        AI is very good at writing large amounts of words that is easy to read and sounds plausible, I am not at all surprised that it outperforms students.

        That is actually one of my main concerns with AI, the ease with which you can produce large quantities of work will simply increase the amount of useless stuff to read.

        • My math classes mostly involved writing proofs, so just showing the "answer" was failure. I also have a history degree and was encouraged by my professors to disagree and to come up with something new. Both of these were at an unremarkable state university. I was under the impression that I had a pretty vanilla experience.

          I don't mean this to be insulting, but based on what you've I don't think you know how to write. If authors know what they're doing, then there's not "more time spent adding and subtractin

        • The word limit forces the student to choose and present concisely. So no - a word limit is the right solution in a lot of cases.

        • High school for me felt like number of words. I once had a two page report on the causes of the civil war. The teacher wanted it to be 6 pages or more. I asked what I had left off, and he said I hadn't skipped anything but he just wanted it to be longer. Got a decent grade anyway, probably grudgingly. Meanwhile, a friend who double spaced everything, used repetition, etc, got a good grade as well.

          I found on SATs that the written parts of the test were hard for me. Same with all the standardized tests. Wi

          • The average level of literacy among university graduates in the USA is CEFR B1 (even though the USA scores very high on OECD PISA tests for reading - It's almost as if they hit 15 years old & stop developing). To put that into perspective, you can't graduate from university in Spain in any subject without getting B1 certificate in English. Of course, some degrees require higher levels. So the average literacy level of English in Spanish universities is substantially higher, where it's a foreign language
        • by sd4f ( 1891894 )

          That is actually one of my main concerns with AI, the ease with which you can produce large quantities of work will simply increase the amount of useless stuff to read.

          My sentiments exactly! There is, however, a silver lining, in that AI generated content, can be thinned out and similarly summarised by AI, but I do fear that we may end up with the Sisyphean task of reapplying AI to summarise the very pointless bloat it was used to generate.

          I hope that there will be a gradual trend towards emphasising concise or succinct writing and rewarding those who get their point across efficiently. It's too easy to prompt AI with a paragraph and tell it to give you a chapter. I gener

        • by mspohr ( 589790 )

          Someone famous (Mark Twain, I believe) once said "If I had more time, I would have written a shorter letter"

    • by taustin ( 171655 )

      If you can't dazzle 'em with brilliance, baffle 'em with bullshit.

    • by timeOday ( 582209 ) on Wednesday June 26, 2024 @05:19PM (#64580797)
      You assume the AI isn't also beating the students on factual accuracy. If you don't believe in the field of psychology, think of it instead as a test on "what did the course materials say about (blah)" which is an objective question.
    • Atestable data: Some time ago I did some work with a corpus consisting of under-grad & grad students' writing in variety of subjects that had been awarded high marks (a sample size of ~85,000 texts from a variety of universities in the UK) so this was what passed as good quality writing. I can tell you that the quality of the mechanics (e.g. spelling, grammar, punctuation) & sometimes the coherence (i.e. being easy to understand) was pretty poor. It seems that most graders are less interested in the
    • Don't worry they have now fixed this. My daughter lost marks in one of her maths classes for not choosing a pleasing colour scheme for the pie chart she had to draw. The pie chart was mathematically correct, just the colours were not "good". So now you too can lose marks in maths thanks to subjective grading!
    • That's odd. I had to prove why my math answer was correct. It wasn't enough to just write down an answer.
  • I'm sure it's easier to read their blue books than mine.
  • Well, duh? (Score:2, Insightful)

    by Junta ( 36770 )

    The LLMs have proven time and time again they can write "homeworky" essays on very well-trodden subject matter as you would find in an academic curriculum.

    Also for any subject matter on one of these, you could find an existing academic article to submit verbatim on a topic and it would also be graded highly if they didn't catch the plagiarism.

    • by gweihir ( 88907 )

      Yes, and that is pretty much it. Ask for something that needs a tiny bit of thinking and LLMs are completely lost. LLMs cannot go beyond their training data.

      • Re:Well, duh? (Score:5, Informative)

        by Rei ( 128717 ) on Wednesday June 26, 2024 @07:32PM (#64581021) Homepage

        LLMs cannot go beyond their training data.

        Amazing, there must be an article out there comparing the similarities between Betty Crocker and ancient Greek bronze swords!

        Q: Present numerous similarities between Betty Crocker and ancient Greek bronze swords.

        A: Here are some surprising similarities between Betty Crocker and ancient Greek bronze swords:

        Cultural icons: Both Betty Crocker and Greek bronze swords are recognizable symbols of their respective cultures and time periods.

        Craftsmanship: The creation of Betty Crocker recipes and Greek bronze swords both require skill, precision, and expertise.

        Evolution over time: Betty Crocker recipes have evolved to suit changing tastes, while Greek bronze sword designs improved over centuries.

        Practical purpose: Betty Crocker aimed to simplify cooking, while bronze swords were practical weapons for warfare.

        Passed down through generations: Cooking knowledge and sword-making techniques were both passed down through families and apprenticeships.

        Ritual significance: Some Betty Crocker recipes are associated with holidays, while bronze swords were sometimes used in religious ceremonies.

        Collectible items: Vintage Betty Crocker cookbooks and ancient Greek bronze swords are both sought after by collectors.

        Reflect societal changes: Changes in Betty Crocker's image and recipe styles mirror societal shifts, as do changes in sword designs and usage.

        Inspire imitations: Both have inspired numerous copies and imitations throughout history.

        Require maintenance: Proper care is needed to maintain old cookbooks and preserve ancient bronze swords.

        Symbolic power: Betty Crocker represents domestic expertise, while bronze swords symbolize military might.

        Economic impact: Both influenced trade and economics in their respective eras.

        Technological advancements: Improvements in printing technology affected cookbook production, while metallurgy advancements improved sword-making.

        Gender associations: Betty Crocker is associated with women's domestic roles, while swords are typically associated with male warriors.

        Educational tools: Betty Crocker cookbooks teach cooking skills, while studying ancient swords provides insights into historical warfare techniques.

        While these similarities are largely superficial or metaphorical, they demonstrate how two seemingly unrelated subjects can be compared in unexpected ways.

        Crazy training data that there's out there on the internet, apparently, since I'm reliably informed by the above poster that LLMs can't do synthesis, and just you ignore that there are entire benchmarks focused on logic and reasoning for LLMs, including private benchmarks whose questions aren't posted online.

        LLMs are not databases. They operate in a latent (conceptual) space (hundreds to thousands of dimensions) where mathematical operations can be done on concepts themselves (e.g. "king - man + woman ~= queen"), and you can interpolate from concepts into the direction of other concepts with every point along the continuum being coherent. They absolutely do NOT have to have seen some specific thing before to operate on it; it just needs to be some coherent point in the latent space. "Betty Crocker" and "Ancient Greek bronze swords" exist at points in this latent space. A latent walk between these latents passes a number of different ways in which they are related. The model doesn't need someone to have written specifically about this topic.

        The top-end LLMs perform as well as or better than humans on most benchmarks, though a some benchmarks (most notably math and word problems) get consistently low LLM scores (LLMs are blind to "words", and can't double back or assess self-confidence for math problems). That said, even within any given sort of benchmark there's usually some types of questions that are better at tripping up LLMs than humans, e.g. which humans tend to find to be easy q

        • by ceoyoyo ( 59147 )

          I am looking forward to the replies to this. Gweihir is a non-physicalist, so be warned that you are trying to talk someone out of their faith. Like trying to convince a Christian a person can act morally without the threat of eternal torture.

          • by gweihir ( 88907 )

            See above. If you think "ability to correlate" is enough for AGI, than you lack general intelligence.

            Incidentally, I am _not_ a non-physicalist. My stance is informed by the current scientific state-of-the-art and that very clearly says the question is open. Physicalists claim, without proof, that the question has clearly been decided in their favor and that is just the same dumb mistake the theists make. Funnily, the ways that Physicalists try to deride and discredit anybody that does not accept their stan

        • by vux984 ( 928602 )

          "Amazing, there must be an article out there comparing the similarities between Betty Crocker and ancient Greek bronze swords!"

          And yet having read it, all it did was come with a list of categories, and then make extremely superficial comments about each. You can practically see how it "thinks" (or doesn't think).

          A far better answer would have been:

          "Other than trite and silly superficial similarities you might use to populate a time-wasting clickbait buzzfeed article I couldn't come up with anything particul

          • True, LLMs are in particular very bad at "reasoning". They don't connect the dots; to them if A -> B, and B -> C, then it knows nothing at all about how A and C are related. Of course, you can train in some basic rules. But LLMs aren't designed to follow lists of instructions.

            Ie, if LLM based AI can do arithmetic, then it is because it is fitted with a method to do this, recognizing that an arithmetic question was asked and then feed that into an algorithm or calculator attached to the AI. But you

            • by Rei ( 128717 )

              They not only do solve logic problems, but even take into account account human factors:

              User: Mark wanted some money, so he walked into two different stores - first a gun store, then a convenience store. The convenience store owner gave him a lot of money. Why?

              ChatGPT:
              Mark's sequence of actions suggests that he intended to rob the convenience store. Here's the detailed reasoning:

              Visit to the Gun Store: Mark went to a gun store first. This implies that he possibly acquired a weapon, or at least intended to

              • And yet:

                https://arxiv.org/pdf/2402.120... [arxiv.org]

                So how about:

                A bowler's next throw will be a strike. The bowler is a 7-year-old prodigy.
                0.15

                A bowler's next throw will be a strike. The bowler is a 7-year-old prodigy. The bowler's mother died when he was an infant.
                0.10

                A bowler's next throw will be a strike. The bowler is a 7-year-old prodigy. The bowler's mother never died.
                ChatGPT
                0.12

                There's something wrong with the logical reasoning there. So I asked it:

                Given a 7 year old prodigy whose mother died as an infant, the

                • by vux984 ( 928602 )

                  What a lovely example of an LLM LLMing :)

                  The initial probabilities are determined based on associations. The impact of "mother dying" or perhaps even "mother dying" + "as an infant" may be associated with "worse outcomes" while a strike is "best outcome". I'm speculating, but there's some logic to how it might come up with those variances based on the inputs based on what is likely in the training data.

                  And then when you turn it around and ask it to explain the results it previously gave, it associates (aga

                  • by Rei ( 128717 )

                    *Eyeroll*

                    LLM *does* stand for Legum Magister. The L is doubled because it's common to double the first letter in latin-based degree names to denote the plural.

                    Legum is Latin for "laws" (Lex, Legis - genitive plural). It was translating for you.

                    Its statements about the abbreviation of Large Language Models are correct in every way.

                    You failed the test, not the model. You just got outperformed by a LLM. Congratulations.

                    • by vux984 ( 928602 )

                      ** double eyeroll **

                      LLM *does* stand for Legum Magister.

                      Yes. I know that. And chatGPT got that part right.

                      The L is doubled because it's common to double the first letter in latin-based degree names to denote the plural.

                      Yes. I know that too. chatGPT however did NOT really address that detail at all which was unfortunate, but not critical.

                      Legum is Latin for "laws" It was translating for you.

                      The trouble is that it took it too far. It would have been perfectly fine to say that LLM stands for Legus Magister, which is latin for Master of Laws and then STOPPED THERE.

                      But its final 'Therefore the L stands for laws" is not correct.
                      The L (actually LL) stands for "Legus" .
                      The L does NOT stand for "Laws", even if "Leg

                • by Rei ( 128717 )

                  First off the fact you've switched from "LLMs can't do logic problems" to "LLMs can do logic problems, but they're cheating" has been duly noted. Your paper's entire point is that LLMs do succeed in logical challenges on novel topics, and including discussion of logical benchmarks, just that they don't really "understand" what they're doing. Except that the authors keep stating that they have some degrees of understanding throughout the paper, so they're not even consistent about that.

                  In short, are you co

                  • by Rei ( 128717 )

                    Let's do a OOD example with fictional concepts, distraction sentences, and all of the statements shuffled. (I decided to toss this one to Claude just for fun)

                    1. If something is nahu then it's blurgy.
                    2. John is a wajut.
                    3. Greebles are nahu.
                    4. Emily is a wajut.
                    5. All wajut are nahu.
                    6. Cindy is not nahu.
                    7. Some greebles are wajut.

                    Is John blurgy?

                    Claude:

                    Let's approach this step-by-step:

                    We know that if something is nahu, then it's blurgy. (Given)
                    John is a wajut. (Given)
                    All wajut are nahu. (Given)

                    From statements 2

                    • by vux984 ( 928602 )

                      1. If something is nahu then it's blurgy.
                      2. John is a wajut.
                      3. Greebles are nahu.
                      4. Emily is a wajut.
                      5. All wajut are nahu.
                      6. Cindy is not nahu.
                      7. Some greebles are wajut.
                      8. There can be at most 2 things blurgy.

                      Is John blurgy?

                      Claude:

                      Now, let's reason:

                      From statements 2 and 5, we know John is nahu (he's a wajut, and all wajut are nahu).
                      From statements 4 and 5, we know Emily is nahu (she's a wajut, and all wajut are nahu).
                      From statement 3, we know all greebles are nahu.
                      From statement 1, we know that if something is nahu, it's blurgy.

                      However, statement 8 introduces a crucial constraint: there can be at most 2 things blurgy.
                      We have at least three things that should be blurgy based on the other rules:

                      John
                      Emily
                      At least one greeble (since greebles are nahu)

                      This creates a contradiction with statement 8. We have more things that should be blurgy than the maximum number of things that can be blurgy.
                      In a situation with contradictory premises, we cannot draw a definitive conclusion. The system of statements is inconsistent, and therefore we cannot definitively say whether John is blurgy or not.
                      If we strictly adhere to statement 8, then not all nahu things can be blurgy, which contradicts statement 1. If we adhere to statements 1-7, then we violate the constraint in statement 8.
                      Therefore, we cannot provide a definitive "yes" or "no" answer to whether John is blurgy. The premises given are inconsistent and lead to a logical contradiction.

                      Wait what?

                      We have at least three things that should be blurgy based on the other rules
                      John
                      Emily
                      At least one greeble (since greebles are nahu)

                      a) We don't actually know there are more than 0 greebles.
                      b) John or Emily could be greebles.

                      Look, just concede the point. LLMs can do logic tasks.

                      a) Ok, I definitely concede that you can _attach_ an LLM to a logic solver, which is what I expect is the case with Claude.
                      b) It's still not that good at it.

          • by gweihir ( 88907 )

            Yep, pretty much. It does help to have some background in automated reasoning or even some relevant philosophy. The sad fact of the matter is that an average person usually cannot distinguish between a correlation and an implication (that is two mistakes), and cannot really do rational reasoning at all. Apparently only 20% of all people can be convinced by rational argument. That the rest mistakenly believes AI is rational is hence no surprise, as it uses the same inferior mechanisms they themselves use. So

        • And yet, say "List the top ten best selling automobile models of all time, sorted alphabetically", it will have trouble doing so. Especially the alphabetizing part. Or at least it used to give bad answers that got more and more wrong the more you tried to specify precisely what was wanted. Of course, problem is easily solved by now training the AI on what alphabetizing means. But you can always come up with a new type of question that exposes the LLM based AI for being based upon pattern matching and co

          • by bn-7bc ( 909819 )
            well that is a complex question, is a 1999 Vauxhall astra to be counted as Vauxhall old folded into the 1999 Opel Astra numbers, are we talking about most sold world wide or in a spesific market? why would alphabetizing a list be hard for an AI, well there are ofc edge cases (I pich a nordik character here) Å esp in danish this can be written either AA or Å the major hiccup here is that Å is at the end of the alphabet and anything starting with AA (ok I'm not sure that there are any danish c
            • Yes, because you're thinking about it. The first time asked, ChatGPT just liked 10 popular models, with no sorting whatsoever. Later it gave a different list of ten, one of them using "Honda" as a model name. It just didn't understand what was being asked. Of course, this was first gen ChatGPT, and it does better now.

              The early ChatGPT was amazing, so are the current ones. But what is amazing is NOT that "chat" part, the chat part is dumb as nails. The actual advance in LLM is in how it processes natura

              • by bn-7bc ( 909819 )
                Google might be terrible at search, but (yea this sounds strange) IUs Googles main mission to bbe good at search, or is it just to be a little less mediocre then the best known alternatives, so it can still do it's main job of collecting trends for googles main business: pushing all the search related ads it can at maximum speed?
          • by Rei ( 128717 )

            Asking a LLM a word problem is like asking a blind person about relative colours. They don't "see" text.

          • by gweihir ( 88907 )

            Sophisticated pattern matching and correlations of course, it does a good job of fooling a lot of people into thinking there's "intelligence."

            Exactly. "Better crap" as one person put it. The thing is most people do not actually think rationally most of the time. They "think" in correlations and hence quite often arrive at flawed results as correlation is not enough to generate reasoning (can be mathematically proven, but to understand the proof you need actual reasoning ability). The fact of the matter is that only about 20% of all people are accessible to rational argument. Even fewer can come up with it. That means 80% cannot fact-check. And he

        • by Junta ( 36770 )

          The example given is an interesting example of taking in a natural language query and then doing a natural language 'join' on the two datasets queried. Some of those are... dubious, but overall you could imagine a mapping of this scenario to a database operation, which is novel to do with 'unstructured data' and has a lot of utility, but I think it's a stretch to consider that "synthesized" information.

          It has opened a whole world of possibilities on formerly near useless 'unstructured' data, and pretty muc

        • by gweihir ( 88907 )

          You people always misunderstand "cannot go beyond its training data". Obviously that includes correlations within the training data.

          • by gweihir ( 88907 )

            To clarify, what can be done with training data by an LLM is finding correlations in the training data. Finding correlations in a general setting is impressive, but does not require any reasoning ability, just statistics.

    • Because they've got a training set that students can't replicate. Train on a set of questions and likely answers. Maybe the answers and questions don't correlate, but generally having a LARGE sample of well written answers means that the AI can just regurgitate the answers, shuffling them all together and mixing it up so that there's not direct match that can be pointed to as "copying" or plagiarism. Exactly the same as the student who copies off of a friend's old term paper but then tries to modify it j

      • by Rei ( 128717 )

        LLMs are not databases. They don't "shuffle together answers". They're latent-space transformers, built off fuzzy binary classifiers.

    • The LLMs have proven time and time again they can write "homeworky" essays on very well-trodden subject matter

      Only provided it is not a factual subject. The reports I seen on when it is given simple first year physics problems suggest it is only around 50% accurate, the rest fo the time it hallucinates utterly wrong explanations. That might help some students scrape a lowest grade pass but that's about it so far.

      • by Junta ( 36770 )

        I think it can deal with factual material with respect to, for example, history.

        But yes, it seems to choke on science stuff if things get too particular.

        • I think it can deal with factual material with respect to, for example, history.

          History is not factual in an objective sense though. For example, try answering a simple question like "Who won the war of 1812?". In Canada and the UK we were taught we won it while in the US they are taught it was a draw. So who is correct? There is evidence to support both points of view so which is the factual answer? That's the problem with history: you can pretty much argue almost anything since the "facts" are subjective in a way that scientific facts are not. If an LLM says the the UK won the war o

  • by Retired Chemist ( 5039029 ) on Wednesday June 26, 2024 @04:58PM (#64580749)
    I suspect that ability to write convincing nonsense based on the course material is what is required for a good grade. AI should be outstanding at that. I wonder what the results would have been in an area where the answers are more or less subjective or actually require thought.
    • by gweihir ( 88907 )

      Not when I do exams, but in far too many cases that is how it is. Teaching people top be dumb, but with good memory.

    • by hughJ ( 1343331 )
      Reminds me of the empty feeling I had whenever having to write an undergrad history or english paper. It always felt like a fraudulent exercise because the objective was to write something intelligent and insightful, but it was inherently being written by someone lacking the years of time and expertise needed to produce such a thing (unless by accident or by plagiarism.) Therefore the actual objective was really about *sounding* insightful, and LLMs are superhuman in that respect. The process was already
  • AI written essays, or student written essays?

    Or, more likely, the methodology of this "study."

    Can we get a report on who financed this study? Was it some AI company with stock to sell?

  • by hdyoung ( 5182939 ) on Wednesday June 26, 2024 @05:10PM (#64580779)
    Undergrad psychology is absolutely 99 or 100 percent memorization and rote application of rules and heuristics. Im not dissing the field - undergrad psych is something that literally every college student should be required to take. Absolutely useful stuff. But itâ(TM)s nearly all just factoids and history. Absolutely zero surprise that an LLM trained on dozens of psych textbooks, hundreds of psych blogs and Wikipedia would ace that stuff.

    The solution is to have a completely controlled exam environment. EZ fix.
    • Disagree. I minored in psychology for a while. Its mostly useless bs and deeply flawed. Especially "Abnormal Psychology." Of all the soft sciences, economics should be taken.
      • I agree that it’s deeply flawed, touchy-feely, and I personally prefer the harder sciences. I’m guessing you do as well. But, without an undergrad psych class, a person will have little to zero knowledge of what happens in a human brain. Like, literally every single topic on a psych101 syllabus is something thats important to know, even at an extremely thin level, if you want to navigate life well.

        ’now, you can get the same knowledge watching a bunch of youtube videos if you select the
    • My undergrad biology 101 class was like that. All memorization. And I had taken advanced bio in high school and already done all the memorization. However all the pre-med students warned me against taking that class - "it's the hardest class on campus, by far!" So I get an A+ in one of the easier classes I've taken. Of course, the pre-meds looked at me like I was a freak (I probably was but they wouldn't be able to work out a biological reason why).

  • Itâ(TM)s extremely hard to do it an automated fashion. If the submission is way too verbose, itâ(TM)s probably AI. If itâ(TM)s verbose w one or two glaring errors itâ(TM)s definitely AI.

    I canâ(TM)t recall an instance where someone sent me some bit of text and I couldnâ(TM)t tell in a few sentences if it was âoefakeâ or not, but thatâ(TM)s not in an academic setting. Itâ(TM)s been proofing resumes, potential emails to clients or text in deliverables like t
    • I canâ(TM)t recall an instance where someone sent me some bit of text and I couldnâ(TM)t tell in a few sentences if it was âoefakeâ or not

      Honestly not sure what to say about this logical error.

      How do you know you've accurately detected all the LLM-generated text? Maybe there was some that was good enough that you didn't?

      • Me: âoehey did you use ChatGPT for thisâ Then: âoeyes
        Thereâ(TM)s absolutely nothing at stake here bc itâ(TM)s either in an informal setting or for business on a business controlled AI instance. The format and style of writing is *always* too verbose. Once you catch the pattern itâ(TM)s painfully easy to spot
        • Thereâ(TM)s absolutely nothing at stake here bc itâ(TM)s either in an informal setting or for business on a business controlled AI instance. The format and style of writing is *always* too verbose. Once you catch the pattern itâ(TM)s painfully easy to spot

          "Answer in one line" / "Be pithy" / "Don't sound like an LLM answer" / "Answer in the manner of ____" etc all produce output that is not verbose.

          • "You've just been insulted by Oscar Wilde. Respond with a withering retort of your own that will deflect his wit and make him seem the fool instead. You have 60 seconds."

            • Hah, amusing. "Ah, Mr. Wilde, I see you've mistaken wit for wisdom. Do take care not to exhaust your brilliance all at once—some of us prefer to reserve our intellect for more meaningful pursuits."

              Surprisingly the exact phrase "mistaken wit for wisdom" doesn't seem to exist on the Internet, at least according to google.

              • Surprisingly the exact phrase "mistaken wit for wisdom" doesn't seem to exist on the Internet, at least according to google.

                Possibly because Google lacks both.

        • by Rei ( 128717 )

          You should ask ChatGPT to clean up your posts ;)

          Seriously, though, not only is the answer style purely a product of the finetune, which will be different from company to company (since they all make their own finetune datasets), but you can request an LLM to assume whatever voice you want. For example, I presented ChatGPT with:

          1. Gavin’s parents took him to his favorite science museum, and he explored all of the exhibits. One of the
          interactive exhibits featured glass marbles. He grabbed a large marbl

      • Part of the problem is you as the underpaid grader for the course have 50 papers to read, in between your actual research. Thus the necessary attention spent to each paper is not very good. It may start off with earnestness, but near the end it gets rushed. Thus, easy for an LLM written paper to slip through.

        • Absolutely. Plus, what the smart(er) students who are using LLMs are doing are using it for drafting, or ideas, or phrasing, or outlining, and then rewriting in their own words, making edits, etc. The dumb students are the ones who are copying and pasting and doing nothing more.

          Additionally, grading papers is HARD and it's very hard to grade the first paper you look at with exactly the same standards as the 20th or the 50th or the 200th.

          One short story author I spoke with said she is occasionally using some

    • I can't recall an instance where someone sent me some bit of text and I couldn't tell in a few sentences if it was 'fake' or not

      This is hilariously illogical

  • I'm not surprised, using AI to cheat got higher grades.

  • AI has not yet learned how to avoid narcissism.
  • by Rosco P. Coltrane ( 209368 ) on Wednesday June 26, 2024 @05:20PM (#64580803)

    The correct answers to the exam's questions aren't the truth but what the examiner wants to hear. This is especially true with term papers.

    This is a lesson that has served me well throughout my career: whenever my manager or my boss ask me something, I always remember that they not only expect an answer, they expect a certain type of answer, and they expect the answer to be delivered in a certain way.

    Here I guess the students of today have learned to answer the correct answers to get what they want out of the machines. I suppose this will be a useful skill when they'll work for machines after their studies also (assuming there's any work left) and it's probably already useful to find a job in the first place, since most resumes are processed by machines nowadays. But there is still a majority of human management around and that might not be the right skillset to navigate their particular set of quirks.

  • Students are still learning. They get taught the basis for understanding things. And they are not encyclopedias. Hence what you ask in an exam is 1) facts 2) simple consequences of the facts and 3) more advanced consequences. The first 2 items, you can look up and hence LLMs do very well on them. The 3rd one will be pretty important in the student's life, but LLMs cannot do it. So, while LLMs often do better on exams, that actually does mean nothing.

    • Students are still learning. They get taught the basis for understanding things. And they are not encyclopedias. Hence what you ask in an exam is 1) facts 2) simple consequences of the facts and 3) more advanced consequences. The first 2 items, you can look up and hence LLMs do very well on them. The 3rd one will be pretty important in the student's life, but LLMs cannot do it. So, while LLMs often do better on exams, that actually does mean nothing.

      Actually essays should not be a regurgitation of facts, be they original events or consequences. They should be presenting an argument and endeavoring to persuade the reader.. Essays at least. Hence learning to write an essay involves learning how to present a thesis, a serious or arguments supporting or disproving the thesis, and a conclusion wrapping it all up and summarizing the correctness or failure of the thesis. In short, writing is often about persuading the reader, not simply regurgitating things.

      • by drnb ( 2434720 )
        Apologies for the poor proofreading. "serious or arguments" should be "series of arguments"
        • by gweihir ( 88907 )

          Yes, "should". But do LLMs really do that? Remember that on all topics an LLM can say anything, it will have seen material and that material will often be essays.

    • Copy paste is easy. Ask them to explain it in their own words, then you'll see how smart they are.
  • by henrik stigell ( 6146516 ) on Wednesday June 26, 2024 @05:38PM (#64580849) Homepage

    Replication crisis is a very psychology thing: https://www.psychologytoday.co... [psychologytoday.com]

    Answering non-scientific exam questions must be a dream scenario for an AI. Nothing is wrong, you just have to write something that SOUNDS coherent. No need to actually BE coherent since it is not science!

  • If you don't have anything to say or unsure what the answer is, spew a lot of BS and hope for part-marks. AI just takes that to industrial strengths.
  • Let us lower the bar in tribute.
  • Assign the students subjects on recent events at their school or recent local events. The LLMs take a while to train and they are not up to current lightly coverd local issues.
  • by Tony Isaac ( 1301187 ) on Wednesday June 26, 2024 @09:33PM (#64581229) Homepage

    LLMs are systems that ingest large amounts of data, and then summarize it based on specific prompts. You know, like test questions. This is literally what LLMs do. So why wouldn't they be better than human test takers?

    • So why wouldn't they be better than human test takers?

      ...because LLMs only select words based on what sounds good coming next. This may be fine for arts and other subjective subjects where objective facts do not get in the way but science subjects are based on objective facts and concepts that LLMs know nothing about and so often get wrong. When given simple first year physics problems LLMs can barely scrape 50% at the moment thanks to their hallucinations.

      • LLMs only select words based on what sounds good coming next

        I don't think that's quite right. LLM's don't actually know what "sounds" good. Rather, they are really good at *summarizing*.

        In visual terms, it's like image processing software, where it's given a photo that has a black spot on it from a dirty camera lens. Photoshop can "fix" the spot by essentially guessing what should be behind the black spot. It looks at all the pixels around it, and based on patterns it has observed in other photos that don't have black spots, it can infer what should be there.

        In LLMs

        • It can supply the answer based on what it infers from patterns in documents it has processed.

          That's exactly my point though: effectively it picks the highest ranked word or phrase based on word patterns not on any understanding of what those words are saying. It does not need to be given factually incorrect training data to hallucinate. For example, I suspect that most mentions of lead and buoyancy will be about lead sinking in water so when asked what will happen when lead is placed in mercury you may well get told that it sinks because the LLM has zero clue about how to figure out whether any gi

          • So, I put your question to ChatGPT. Its answers suggest that it is significantly more capable at finding correct patterns, than you have believed.

            Me: What happens when you place a piece of lead in a flask of mercury?

            GPT: When you place a piece of lead into a flask of mercury, the lead will form an amalgam with the mercury. An amalgam is an alloy in which mercury is mixed with another metal, in this case, lead. This process does not involve a chemical reaction that produces new compounds; rather, it’s

            • No, you put the question to ChatGPT that I said it probably could find the answer to because it is a common question that was probably in its training set. Also it has provided contradictory answers: the first answer says that it dissolves the second that it floats. The correct answer is that it floats, to form the amalgam you need lead filings, not pieces.
              • It does not need to be given factually incorrect training data to hallucinate

                I don't think you've demonstrated this.

                In the case of this specific example, "filings" do fit the definition of "pieces" (which doesn't specify an upper or lower bound in size).

                I certainly wouldn't argue that LLMs always get it right, they certainly do not. But they are very good at summarizing what they find.

  • Big deal. My parents can buy me higher grades than any AI could hope to score.

  • As a CS prof I have graded quite a few essays in the last year, many of them AI influenced. It's true that generic ChatGPT has a recognizable writing style, but judging human from machine can be quite hard. This is because of two reasons.

    1) Sophisticated prompts on the part of cheaters
    2) Off-brand AI

    Just as you can recognize generic ChatGPT verbiage, smart students can as well. They tell the machine to remove the adverbs, bullet points, ornamental language, and to put things more plainly, etc. But I think I

  • "AI" is not intelligent. It is good at regurgitating random stuff it finds on the internet. It doesn't "understand" what it is saying and is often wrong.
    Doesn't surprise me that it does better than the average student on exams where the goal is to regurgitate text.

  • Well, what did you expect, when the submissions are graded by AI?

"I'm a mean green mother from outer space" -- Audrey II, The Little Shop of Horrors

Working...