Forgot your password?
typodupeerror
IT

'How Many AIs Does It Take To Read a PDF?' (theverge.com) 61

Despite AI's progress in building complex software, the ubiquitous PDF remains something of a grand challenge -- a format Adobe developed in the early 1990s to preserve the precise visual appearance of documents. PDFs consist of character codes, coordinates, and rendering instructions rather than logically ordered text, and even state-of-the-art models asked to extract information from them will summarize instead, confuse footnotes with body text, or outright hallucinate contents, The Verge writes.

Companies like Reducto are now tackling the problem by segmenting pages into components -- headers, tables, charts -- before routing each to specialized parsing models, an approach borrowed from computer vision techniques used in self-driving vehicles. Researchers at Hugging Face recently found roughly 1.3 billion PDFs sitting in Common Crawl alone, and the Allen Institute for AI has noted that PDFs could provide trillions of novel, high-quality training tokens from government reports, textbooks, and academic papers -- the kind of data AI developers are increasingly desperate for.
This discussion has been archived. No new comments can be posted.

'How Many AIs Does It Take To Read a PDF?'

Comments Filter:
  • you need to pay adobe $2.99/mo for AI access to pdf

    • I had this problem like 2 years ago, and assembled my own PDF > jpeg > OCR text. Works wonders. Doesnt do charts clearly because it wont detect a 50% / 50% pie chart, but certainly did tables and stuff, and if there were annotations or text describing it under the chart or image, it would capture that too.
      • I would think with a good AI you could go pdf->jpg and let the AI look at the rendered page.

        • by PPH ( 736903 )

          Yeah. 25 years ago. This was our approach for MSWord document assimilation.

          .doc -> .pdf -> OCR (with some bells and whistles)

          • by narcc ( 412956 )

            ... That is certainly a choice you could have made. I don't know why anyone would make that particular choice, given the tools that came with a typical Word installation at the time.

            I'm guessing you were already using OLE automation to create the PDF (print to Acrobat Distiller?), so why not just use that to extract the text instead? Just a few lines in VB or VBA is all you needed, less code than I guarantee it took you to create the PDF and run it through OCR.

            Why do so many developers go out of their way

        • You can.

          The real problem is that the LLM is more likely to, as the summary notes, summarize the content. Getting LLMs to dictate spatially structured text reliably is a bit trickier.
        • You would think that with an AI, it wouldn't be necessary to render as jpg. But then, there is no such thing as AI.

          There ought to be more information in the PDF than in a JPG. Transforming to a lossy format is well, losing information. I understand that all the training is with images and not PDFs, so with current training, it is likely better to convert.

          • by taustin ( 171655 )

            There ought to be more information in the PDF than in a JPG.

            Therein, I think, likes the problem. There's too much information, more than the AI has any use for, but it tries to make use of it anyway.

          • I want you to open a PDF in notepad and describe a rendering of it for me. For the life of me, I can't figure out why you use your eyeballs when the bytes are right there for your consumption.
  • by ArchieBunker ( 132337 ) on Monday February 23, 2026 @02:58PM (#66005982)

    I can see for AI is improving optical character recognition. I don't care one bit about some garbage summarize feature.

    • by allo ( 1728082 )

      AI did this 20 years ago. Google for MNIST. Today they recognize charts and whatever graphics.

      • by narcc ( 412956 )

        The MNIST dataset is a collection of labeled images of handwritten numbers. It's one of the "standard" datasets used by AI researches and students. (If you're a student or in the field, you've heard about it and very likely used it.) It's been around a lot longer than 20 years, though it has been updated.

        AI is a very broad term. "They" are not some monolithic thing, nor is there some natural hierarchy. The article is talking about LLMs. That we've trained one kind of model for OCR using the MNIST dataset

        • by allo ( 1728082 )

          Years ago people stated that OCR by AI (CNN, sometimes RNN) can recognize text that humans are unable to decipher. VLM are a new concept and may or may not improve on OCR (using LLM for context is a nice addition), but overall AI has a lot of solutions for OCR and all modern OCR tools you may be using build on that.

    • A few years ago, I worked for a company that retrieved records for court cases. They needed to extract pertinent data, such as people's names and other information pertinent to the legal case. At the time, they were integrating earlier AI technologies to try to extract such information automatically, but it fell flat on its face. In any case, this would be another valid use case, if they can get it to work.

  • Are AIs even capable of reading PDFs? Or do they just regurgitate what people who have read them say about them?

    • Yes, they can, but they can't do it well. As another example to those in the original link, I asked Google Gemini to compare two PDFs to find differences. The PDFs in question were commuter train schedules with different effective dates. The PDFs had tables with stations and times. Some trains were express trains (skipping stops) and some made all stops. I asked, "The attached are railroad schedules for the same train line during different time periods. Summarize the differences between them" followed by "A
      • by caseih ( 160668 )

        My luck with Google Notebook LM and pdfs is incredibly good. At least if you want to be able to summarize and lookup information in a pdf. It seems able to understand tables and everything. Not sure why Gemini struggles when Notebook LM has few problems.

        • His experience is common for LLMs. I suspect Notebook LM almost certainly has the same issue- using a PDF as a source of context for an LLM works very well, and reliably.

          Asking it to directly dictate the text within it can get a little trickier. Often for the same reason a person would hang up on it.
          "What about this footnote here? How do I represent the text under the charts?"

          People are expecting the LLM to solve a problem without actually knowing what the problem is.
          • by narcc ( 412956 )

            Often for the same reason a person would hang up on it.

            This is delusional thinking.

            • It's not thinking at all, it's an observation.

              You hand a PDF with a spatially complex layout to some random desk worker with no idea what the context of the request is, and give them no way to find that out- and you get their best guess as to what you want. It is nearly universally wrong.

              Your post was predictably stupid.
              • by narcc ( 412956 )

                Seeing what you want to see is not "observation".

                Don't waste my time with your nonsense.

                • Ya, get the fuck out of here, you fucking poser.
                  You try to present yourself as some kind of authority on the topic, but your posting history is a laughable history of you learning the meaning of the words you use far after you try to use them.

                  Like when you told us all about how ChatGPT was an RNN, using thousands of words expositing to us the limits of RNNs, and that meant it could be Turing Complete, but then, upon learning what a Transformer was (guess that meaning of that T eluded you), and that the n
                  • by narcc ( 412956 )

                    Cry harder, troll. No one cares about your bullshit.

                    • Did that look like crying, you insecure child? Is that how you help yourself cope with the reality of your inadequacy?
                      Pretty fucking humorous coming from someone accusing anyone else of delusion.

                      Let's do a quick "best hits".
                      Miraculous, I say. There's so much stupid in this, I don't know where to start. [slashdot.org]
                      The fuck? Do you not speak a single word at time? ChatGP...erm...RNN?... [slashdot.org]

                      So much expository bullshit.
                      You're fucking weak.
                    • by narcc ( 412956 )

                      Yep, I said something stupid years ago. You got me. I completely missed the transformer revolution as I wasn't working in the field at the time. I've since published in the field, a thing I can do because I have an actual education, unlike you.

                      I seriously doubt you want to play the stupid post game. As you know, because you're bizarrely obsessed with me, that isn't going to end well for you. You've proven time and again that you don't have even a basic understanding of, well, anything related to AI.

                    • I've since published in the field, a thing I can do because I have an actual education, unlike you.

                      Na, you haven't, lol.

                      I seriously doubt you want to play the stupid post game. As you know, because you're bizarrely obsessed with me, that isn't going to end well for you.

                      Oh, lord knows I've said my share of stupid shit. But unlike you- I don't lie about what I do, and what I'm proficient in.
                      But ya, let's dance motherfucker.

                      Like I said, cry harder little troll. No one cares about your bullshit. Pathetic.

                      Bullshit artist calls person troll. Seismic fucking impact, right there.

                      Would you like a list of the dozen times you've told people they should stop arguing with you because of your "formal background in data science and machine learning"?
                      Would you then like a more comprehensive list of "stupid shit you said back when you didn't k

                    • by narcc ( 412956 )

                      Do you know why you have this weird and creepy obsession with me? It's easy enough to explain. You wish you were me. You wish you had my education and my accomplishments. You wish you could actually do the math, but you don't have the discipline or the intellect, just an over-inflated sense of your own capability. So you pretend that all that complicated stuff doesn't matter because you "understand the concepts", even when it's obvious to everyone else that you don't.

                      My guess is that you actually belie

                    • What's that [slashdot.org] you say? [slashdot.org]
                      Wish I was you?
                      A completely full-of-shit bullshit artist?
                      Look at that! From not knowing what a fucking transformer is 4 years after it changed the fucking landscape of AI, to "I'm also published" [slashdot.org]
                      No, bud.
                      You're a fucking loser. Act accordingly.
                    • by narcc ( 412956 )

                      The funniest part about your little tantrum is that while you don't know who I am (because psychos like you exist) you definitely know about something I've done.

                      Like I said. Cry harder, little troll. Pretending I'm not what I very obviously am won't make your life any better.

                      Also, I'm old. Four years is nothing. (You'll find that out someday if you somehow manage to avoid drowning yourself in the shower.) I wasn't doing any work in AI after 2017, so yes, I missed it. It happens. That doesn't change t

                    • You're obviously old. You're also very obviously sensitive over the fact that you can't effectively gatekeep anymore.

                      The thing is, you are a liar.
                      You will say whatever you need to say to soothe your inferiority complex.
                      You went from saying clearly and demonstrably wrong things with strong emotion and claims of authority, to claiming you're "also published" (as soon as someone else claimed it).

                      You're just a sad little fucker trying to project authority where you have none.

                      Your posts all demonstrate
                    • by narcc ( 412956 )

                      Oh, you poor deluded little troll. You can believe whatever nonsense makes you feel better. There are, however, a few facts that you can't change that clearly infuriate you. Let me point them out for you:

                      1. You used to admire me, until ...
                      2. ... I embarrassed you when I exposed your deep ignorance and lack of proper education.
                      3. I was only able to do that because, unlike you, I have an actual education.
                      4. I'm the same person I was when I was your hero.
                      5. You're still the same sad little troll you were th

                    • 1. You used to admire me, until ...

                      Negative. I found your proclivity for being unable to accept that you were wrong obnoxious from the start.

                      2. ... I embarrassed you when I exposed your deep ignorance and lack of proper education.

                      Incorrect.
                      I will say, you did help me become less ignorant on one or two things over time, but mostly in the process of finding out how you were wrong.

                      3. I was only able to do that because, unlike you, I have an actual education.

                      Doubtful. Confidence is quiet. You're not.
                      The better explanation for you is that your education is dated, if existing at all, and you know it.

                      4. I'm the same person I was when I was your hero.

                      This is some kind of bizarre fantasy you have. It may explain your emotional reactions to anyone challenging yo

                    • by narcc ( 412956 )

                      Oops! You've contradicted yourself. Too funny!

                      Did you know that I have immense power over you? It's true. Look how worked up you get. That's because of your bizarre obsession with me. Maybe your a masochist?

                      I have never met one who argued for years from an incorrect foundation without once bothering to actually educate themselves, which I have demonstrated that you have done.

                      You're delusional. You found one mistake on my part that was 1) not foundational in any way and 2) was later corrected. You, on the other had, have never once admitted error, even though the reason you're so obsessed with me is that I pointed out all the nonsense in just one of your posts. Pathe

                    • Oops! You've contradicted yourself. Too funny!

                      Doubt it.

                      Did you know that I have immense power over you? It's true. Look how worked up you get. That's because of your bizarre obsession with me. Maybe your a masochist?

                      Worked up? This is clinical, not emotional.
                      You're projecting.

                      You're delusional. You found one mistake on my part that was 1) not foundational in any way and 2) was later corrected. You, on the other had, have never once admitted error, even though the reason you're so obsessed with me is that I pointed out all the nonsense in just one of your posts. Pathetic.

                      It was never about the mistake, you fucking idiot.
                      It was a demonstration that you will argue precisely as you are now from a position of ignorance, including the claims of authority and the accusations of lack of education and - still bizarrely - crying.
                      It was a demonstration that your authority is no authority at all.
                      The second set was to demonstrate that even your supposed principles are upon the sacrificial altar of your pathologic

                    • by narcc ( 412956 )

                      This is clinical, not emotional.

                      Silly little troll. Your posts and bizarre obsession with me strongly suggest otherwise.

                      Your claim of being published was a mere 8 months

                      Sigh... While your ego depends on denying reality, I really am qualified. Yes, I missed the transformer revolution as I hadn't done any work in AI since ~2017 and was (foolishly) basing my comments then on what I expected the state of the field to be at that time. I had some catching up to do, sure, but it was hardly an impossible feat! The only thing I published in 2024 was indeed in AI, coauthored with a friend of

                    • Silly little troll. Your posts and bizarre obsession with me strongly suggest otherwise.

                      This argument just doesn't hold water.
                      I think in your non-computer oriented mind, you imagine I actually downloaded, printed, and then went over every post you've ever made or something.
                      I'm a software engineer. Certainly you can understand that just because you can't think of a better way to do something, that a better way to do something doesn't exist... Actually scratch that- I literally demonstrated that you do precisely that.

                      Sigh... While your ego depends on denying reality, I really am qualified. Yes, I missed the transformer revolution as I hadn't done any work in AI since ~2017 and was (foolishly) basing my comments then on what I expected the state of the field to be at that time. I had some catching up to do, sure, but it was hardly an impossible feat! The only thing I published in 2024 was indeed in AI, coauthored with a friend of mine, a sociologist, who reached out to me because of my background. I know that hurts your feelings, but you'll get over it.

                      You claimed to be published in 2023 in "machine learning and data science", 8

                    • by narcc ( 412956 )

                      You claimed to be published in 2023

                      Okay? That's when it was accepted. It was printed/published in 2024. (I checked.) I also don't know what your "8 months" claim has to do with anything? Do you think that every paper takes years to produce? If you're tenure track, you're expected to publish 2-4 times a year. During active research, people/teams/groups can publish considerably more. I'd say that's just one more thing you don't know anything about, but we knew that about you already. Your ignorance knows no bounds!

                      If you believe that chain of argumentation makes your point stronger

                      You're pretty stupid,

                    • Lol. Bullshit artist emits bullshit. So surprising.

                      Imagine being so desperate to make people believe you were something you're not. It really is pathetic, man. You got a BTC I can donate some money for your therapy at?
                    • by narcc ( 412956 )

                      Couldn't find anything, eh? No surprise there at all.

                      Here's hoping that healthy dose of reality will help you get over your delusions and bizarre obsession with me.

                      Get well soon!

  • by TomClancy_Jack ( 638962 ) on Monday February 23, 2026 @03:53PM (#66006094)

    This has been a pet-peeve of mine for years with resumes. I've literally reached out to Adobe Acrobat product managers on LinkedIn to try to get them to listen.

    Standard HR systems like Workday are HORRIBLE at ingesting and reading resumes from PDFs. Thus why you have to not only upload a PDF but also painstakingly enter standardized fields. And you end up making the resume super ugly to make it as readable as possible. The market for job seekers is HUGE. Millions of people submitting to jobs all the time.

    Adobe could fix this with a super low-tech answer. They could even charge a subscription. It would require setting an industry-wide resume data standard - but if anyone can do it, it's Adobe and their marketshare. All they have to do is create a back-end metadata entry specifically for resumes that is hidden from humans but readable by machines. Standard fields like Job 1 Title, Job 1 Company, Job 1 Description, Job 1 dates. People fill this out once and are done.

    The other upside to this is it would free humans to make resumes that are easier to read for humans on the other end. No ORC required.

    I will die on this hill haha.

    • Re:Resumes (Score:4, Funny)

      by molarmass192 ( 608071 ) on Monday February 23, 2026 @04:09PM (#66006134) Homepage Journal

      I know it's a typo, but I like ORC much more than OCR. Can we re-arrange the words Optical Character Recognition as Optical Recognition of Characters with a silent "of"? Wait a sec, hold that thought, with a silent "by" we can do ORCA, Optical Recognition of Characters by AI! No? Anyone? Bueller? Bueller? Bueller?

  • by Big Hairy Gorilla ( 9839972 ) on Monday February 23, 2026 @03:56PM (#66006104)
    Right question: Can "AI" convert a pdf into a file editable by MS Word <shudder>?... or Libreoffice Writer ?
    Face it, It's the thing everyone wants to do.
    If it could, it would finally be a valuable use case for "AI".

    Forget about a cure for cancer. Being able to convert and edit a pdf, without errors, is truly the Holy Grail of "AI".
  • by Anonymous Coward

    A majority of the time that you have a PDF, you don't want precise visual formatting. As long as someone is looking at it on a screen, precise format preservation is generally a bad thing, which makes your system inferior to competitors.

    Indeed, the only times I've seen preservation of precise formatting actually be a good thing, is when the document in question is exclusively intended to be physically printed, on paper.

    But that doesn't change that tools which try to autocomplete sentences in a believable wa

  • PDF is complicated and made for display and print and not for parsing. Respect to everyone who implements a useful pdf to text tool.
    That said, modern AI tend to use screenshots of rendered PDF, probably for exactly that reason. It's probably easier just to render it with a headless libpoppler or whatever and then OCR it than parsing the mess directly.

    So is this article a Reductio ad or what?

  • I mean isn't the plain text right there inside the file along with all the markup? Should be one of the easier formats for AI to parse.
    • by znrt ( 2424692 ) on Monday February 23, 2026 @04:48PM (#66006230)

      if you have ever to work with pdf beyond merely staring at one you'll realize to what extent that format is an absolute disgrace. there are a zillion tools out there to manage pdf in a zillion ways and not a single one of them gets pdf parsing and layout right 100% of the time, not even adobe's. the only thing pdf had going for it was that it wasn't msword, and that's why it spread like a virus, but we could really benefit from having a proper truly portable (and universally adopted) document format even at the cost of a reduced functionality or scope.

      but the article isn't about that, like the post above yours says llms probably ignore the format altogether and just ocr-parse a printout. what this is about (apart from being clickbait to take you to the verge) is llms having problems in processing text in general exactly the way we would want them to, which is old news. the underlying cause is that llms simply predict text and lack real understanding, which is why their output often looks surprisingly good but is often not quite ok or completely wrong. this is the limit of what their statistical model can offer, now we're trying to furbish them with other supplementing techniques ... with only discrete success so far.

      • Mod parent up. Mod parent up.

        PDF sucks as a format. Problems too many to list. My pet peeve, as soon as paper size changes (eg Letter to A4), youâ(TM)ve changed the document to fit the media. Invalidating any thought about preserving integrity.

      • PDF is just a read only universal format, the issue is real corporations simply use PDF to share presentations, has nothing to do with the format itself. If a PDF is basically a picture of a presentation, presentations use graphs (abstract graphs sometimes), abstract pictures/shapes along with text to describe a story or message. You actually have to be human to put it all together to understand 1 page. A picture is worth a million words. Try having AI read through a comic book and see how well it can s
      • TBH any document format is going to have the images and graphics and graphics as text inline of the document issue for AI to ingest it. Someone could just as easily stick that graphic, graph or block of text as an image into a word doc or an openoffice doc. For all of these AI would likely need to render the document to an image and then just ingest it all using OCR. Then there's probably even questions about things like graphs and charts. How does AI ingest this, as a raw bitmap to just be redisplayed, or
  • Why not just render the PDF as an image and then process the image just like AI already can do?

    I don't see the challenge, here.

    • Why not just render the PDF as an image and then process the image just like AI already can do?

      I don't see the challenge, here.

      My thoughts exactly. I routinely paste images into ChatGPT that it parses quite well, at least in terms of text they contain. For things like error messages that pop-up on a computer, I don't even type any context and get (mostly) useful interpretations.

  • Lately, many of the sites I use for things are down a lot more than usual. And the outrages are far longer than in the past. I suspect too much vibe coding is the cause. Is anyone else seeing what I am seeing? What do others think?
    • The AI gold rush means hardware is expensive, which means cloud compute is expensive, so I'd imagine it is more that service providers are scaling down their costs by paying for less premium tiers of cloud infrastructure. You can see this in less and less previously free cloud functionality from apps and SAAS remaining free of a subscription.
  • Can regurgitate info, but can't get to context very well, if at all. I do wonder if it can ever catch up with what's known.
  • ... and do it well.

    To do a complex thing, string together multiple simple things.

  • by devslash0 ( 4203435 ) on Monday February 23, 2026 @07:07PM (#66006522)

    The only reliable way to parse a PDF is to flatten it to an image and parse using OCR. Anything else has been proven over and over again to put you in an asylum. It's also what most PDF parsing libraries do under the hood.

  • The problem with understanding pdfs is that it comes down to understanding how they look, and that's a computer vision problem, not a language problem solvable by LLMs. Vision is not as magical as LLMs yet.

    • by diz ( 10034 )

      This has been a solved problem for about 30 years, inside the opensource "xpdf" package secretly lives a standalone tool called "pdftotext"

      the use is simple:

      pdftotext -layout somefile.pdf

      enjoy reading somefile.txt as any other plaintext file. The -layout option is required because there is no requirement in pdf for characters to render in the order they will be read on the page. I personally generate pdf files from bottom to top in postscript because I can make the postscript file look like plaintext at

System checkpoint complete.

Working...