'How Many AIs Does It Take To Read a PDF?' (theverge.com) 61
Despite AI's progress in building complex software, the ubiquitous PDF remains something of a grand challenge -- a format Adobe developed in the early 1990s to preserve the precise visual appearance of documents. PDFs consist of character codes, coordinates, and rendering instructions rather than logically ordered text, and even state-of-the-art models asked to extract information from them will summarize instead, confuse footnotes with body text, or outright hallucinate contents, The Verge writes.
Companies like Reducto are now tackling the problem by segmenting pages into components -- headers, tables, charts -- before routing each to specialized parsing models, an approach borrowed from computer vision techniques used in self-driving vehicles. Researchers at Hugging Face recently found roughly 1.3 billion PDFs sitting in Common Crawl alone, and the Allen Institute for AI has noted that PDFs could provide trillions of novel, high-quality training tokens from government reports, textbooks, and academic papers -- the kind of data AI developers are increasingly desperate for.
Companies like Reducto are now tackling the problem by segmenting pages into components -- headers, tables, charts -- before routing each to specialized parsing models, an approach borrowed from computer vision techniques used in self-driving vehicles. Researchers at Hugging Face recently found roughly 1.3 billion PDFs sitting in Common Crawl alone, and the Allen Institute for AI has noted that PDFs could provide trillions of novel, high-quality training tokens from government reports, textbooks, and academic papers -- the kind of data AI developers are increasingly desperate for.
you need to pay adobe $2.99/mo for AI access to pd (Score:2)
you need to pay adobe $2.99/mo for AI access to pdf
Re: (Score:2)
Re: you need to pay adobe $2.99/mo for AI access t (Score:2)
I would think with a good AI you could go pdf->jpg and let the AI look at the rendered page.
Re: (Score:2)
Yeah. 25 years ago. This was our approach for MSWord document assimilation.
Re: (Score:2)
... That is certainly a choice you could have made. I don't know why anyone would make that particular choice, given the tools that came with a typical Word installation at the time.
I'm guessing you were already using OLE automation to create the PDF (print to Acrobat Distiller?), so why not just use that to extract the text instead? Just a few lines in VB or VBA is all you needed, less code than I guarantee it took you to create the PDF and run it through OCR.
Why do so many developers go out of their way
Re: (Score:2)
The real problem is that the LLM is more likely to, as the summary notes, summarize the content. Getting LLMs to dictate spatially structured text reliably is a bit trickier.
Re: (Score:2)
Re: (Score:2)
You would think that with an AI, it wouldn't be necessary to render as jpg. But then, there is no such thing as AI.
There ought to be more information in the PDF than in a JPG. Transforming to a lossy format is well, losing information. I understand that all the training is with images and not PDFs, so with current training, it is likely better to convert.
Re: (Score:2)
There ought to be more information in the PDF than in a JPG.
Therein, I think, likes the problem. There's too much information, more than the AI has any use for, but it tries to make use of it anyway.
Re: (Score:3)
The only good use case (Score:3)
I can see for AI is improving optical character recognition. I don't care one bit about some garbage summarize feature.
Re: (Score:3)
AI did this 20 years ago. Google for MNIST. Today they recognize charts and whatever graphics.
Re: (Score:3)
The MNIST dataset is a collection of labeled images of handwritten numbers. It's one of the "standard" datasets used by AI researches and students. (If you're a student or in the field, you've heard about it and very likely used it.) It's been around a lot longer than 20 years, though it has been updated.
AI is a very broad term. "They" are not some monolithic thing, nor is there some natural hierarchy. The article is talking about LLMs. That we've trained one kind of model for OCR using the MNIST dataset
Re: (Score:2)
Years ago people stated that OCR by AI (CNN, sometimes RNN) can recognize text that humans are unable to decipher. VLM are a new concept and may or may not improve on OCR (using LLM for context is a nice addition), but overall AI has a lot of solutions for OCR and all modern OCR tools you may be using build on that.
Re: (Score:2)
A few years ago, I worked for a company that retrieved records for court cases. They needed to extract pertinent data, such as people's names and other information pertinent to the legal case. At the time, they were integrating earlier AI technologies to try to extract such information automatically, but it fell flat on its face. In any case, this would be another valid use case, if they can get it to work.
Can AIs read? (Score:2)
Are AIs even capable of reading PDFs? Or do they just regurgitate what people who have read them say about them?
Re: (Score:2)
Re: (Score:2)
My luck with Google Notebook LM and pdfs is incredibly good. At least if you want to be able to summarize and lookup information in a pdf. It seems able to understand tables and everything. Not sure why Gemini struggles when Notebook LM has few problems.
Re: (Score:2)
Asking it to directly dictate the text within it can get a little trickier. Often for the same reason a person would hang up on it.
"What about this footnote here? How do I represent the text under the charts?"
People are expecting the LLM to solve a problem without actually knowing what the problem is.
Re: (Score:2)
Often for the same reason a person would hang up on it.
This is delusional thinking.
Re: (Score:2)
You hand a PDF with a spatially complex layout to some random desk worker with no idea what the context of the request is, and give them no way to find that out- and you get their best guess as to what you want. It is nearly universally wrong.
Your post was predictably stupid.
Re: (Score:2)
Seeing what you want to see is not "observation".
Don't waste my time with your nonsense.
Re: (Score:2)
You try to present yourself as some kind of authority on the topic, but your posting history is a laughable history of you learning the meaning of the words you use far after you try to use them.
Like when you told us all about how ChatGPT was an RNN, using thousands of words expositing to us the limits of RNNs, and that meant it could be Turing Complete, but then, upon learning what a Transformer was (guess that meaning of that T eluded you), and that the n
Re: (Score:2)
Cry harder, troll. No one cares about your bullshit.
Re: (Score:2)
Pretty fucking humorous coming from someone accusing anyone else of delusion.
Let's do a quick "best hits".
Miraculous, I say. There's so much stupid in this, I don't know where to start. [slashdot.org]
The fuck? Do you not speak a single word at time? ChatGP...erm...RNN?... [slashdot.org]
So much expository bullshit.
You're fucking weak.
Re: (Score:2)
Yep, I said something stupid years ago. You got me. I completely missed the transformer revolution as I wasn't working in the field at the time. I've since published in the field, a thing I can do because I have an actual education, unlike you.
I seriously doubt you want to play the stupid post game. As you know, because you're bizarrely obsessed with me, that isn't going to end well for you. You've proven time and again that you don't have even a basic understanding of, well, anything related to AI.
Re: (Score:2)
I've since published in the field, a thing I can do because I have an actual education, unlike you.
Na, you haven't, lol.
I seriously doubt you want to play the stupid post game. As you know, because you're bizarrely obsessed with me, that isn't going to end well for you.
Oh, lord knows I've said my share of stupid shit. But unlike you- I don't lie about what I do, and what I'm proficient in.
But ya, let's dance motherfucker.
Like I said, cry harder little troll. No one cares about your bullshit. Pathetic.
Bullshit artist calls person troll. Seismic fucking impact, right there.
Would you like a list of the dozen times you've told people they should stop arguing with you because of your "formal background in data science and machine learning"?
Would you then like a more comprehensive list of "stupid shit you said back when you didn't k
Re: (Score:2)
Do you know why you have this weird and creepy obsession with me? It's easy enough to explain. You wish you were me. You wish you had my education and my accomplishments. You wish you could actually do the math, but you don't have the discipline or the intellect, just an over-inflated sense of your own capability. So you pretend that all that complicated stuff doesn't matter because you "understand the concepts", even when it's obvious to everyone else that you don't.
My guess is that you actually belie
Re: (Score:2)
Wish I was you?
A completely full-of-shit bullshit artist?
Look at that! From not knowing what a fucking transformer is 4 years after it changed the fucking landscape of AI, to "I'm also published" [slashdot.org]
No, bud.
You're a fucking loser. Act accordingly.
Re: (Score:2)
The funniest part about your little tantrum is that while you don't know who I am (because psychos like you exist) you definitely know about something I've done.
Like I said. Cry harder, little troll. Pretending I'm not what I very obviously am won't make your life any better.
Also, I'm old. Four years is nothing. (You'll find that out someday if you somehow manage to avoid drowning yourself in the shower.) I wasn't doing any work in AI after 2017, so yes, I missed it. It happens. That doesn't change t
Re: (Score:2)
The thing is, you are a liar.
You will say whatever you need to say to soothe your inferiority complex.
You went from saying clearly and demonstrably wrong things with strong emotion and claims of authority, to claiming you're "also published" (as soon as someone else claimed it).
You're just a sad little fucker trying to project authority where you have none.
Your posts all demonstrate
Re: (Score:2)
Oh, you poor deluded little troll. You can believe whatever nonsense makes you feel better. There are, however, a few facts that you can't change that clearly infuriate you. Let me point them out for you:
1. You used to admire me, until ... ... I embarrassed you when I exposed your deep ignorance and lack of proper education.
2.
3. I was only able to do that because, unlike you, I have an actual education.
4. I'm the same person I was when I was your hero.
5. You're still the same sad little troll you were th
Re: (Score:2)
1. You used to admire me, until ...
Negative. I found your proclivity for being unable to accept that you were wrong obnoxious from the start.
2. ... I embarrassed you when I exposed your deep ignorance and lack of proper education.
Incorrect.
I will say, you did help me become less ignorant on one or two things over time, but mostly in the process of finding out how you were wrong.
3. I was only able to do that because, unlike you, I have an actual education.
Doubtful. Confidence is quiet. You're not.
The better explanation for you is that your education is dated, if existing at all, and you know it.
4. I'm the same person I was when I was your hero.
This is some kind of bizarre fantasy you have. It may explain your emotional reactions to anyone challenging yo
Re: (Score:2)
Oops! You've contradicted yourself. Too funny!
Did you know that I have immense power over you? It's true. Look how worked up you get. That's because of your bizarre obsession with me. Maybe your a masochist?
I have never met one who argued for years from an incorrect foundation without once bothering to actually educate themselves, which I have demonstrated that you have done.
You're delusional. You found one mistake on my part that was 1) not foundational in any way and 2) was later corrected. You, on the other had, have never once admitted error, even though the reason you're so obsessed with me is that I pointed out all the nonsense in just one of your posts. Pathe
Re: (Score:2)
Oops! You've contradicted yourself. Too funny!
Doubt it.
Did you know that I have immense power over you? It's true. Look how worked up you get. That's because of your bizarre obsession with me. Maybe your a masochist?
Worked up? This is clinical, not emotional.
You're projecting.
You're delusional. You found one mistake on my part that was 1) not foundational in any way and 2) was later corrected. You, on the other had, have never once admitted error, even though the reason you're so obsessed with me is that I pointed out all the nonsense in just one of your posts. Pathetic.
It was never about the mistake, you fucking idiot.
It was a demonstration that you will argue precisely as you are now from a position of ignorance, including the claims of authority and the accusations of lack of education and - still bizarrely - crying.
It was a demonstration that your authority is no authority at all.
The second set was to demonstrate that even your supposed principles are upon the sacrificial altar of your pathologic
Re: (Score:2)
This is clinical, not emotional.
Silly little troll. Your posts and bizarre obsession with me strongly suggest otherwise.
Your claim of being published was a mere 8 months
Sigh... While your ego depends on denying reality, I really am qualified. Yes, I missed the transformer revolution as I hadn't done any work in AI since ~2017 and was (foolishly) basing my comments then on what I expected the state of the field to be at that time. I had some catching up to do, sure, but it was hardly an impossible feat! The only thing I published in 2024 was indeed in AI, coauthored with a friend of
Re: (Score:2)
Silly little troll. Your posts and bizarre obsession with me strongly suggest otherwise.
This argument just doesn't hold water.
I think in your non-computer oriented mind, you imagine I actually downloaded, printed, and then went over every post you've ever made or something.
I'm a software engineer. Certainly you can understand that just because you can't think of a better way to do something, that a better way to do something doesn't exist... Actually scratch that- I literally demonstrated that you do precisely that.
Sigh... While your ego depends on denying reality, I really am qualified. Yes, I missed the transformer revolution as I hadn't done any work in AI since ~2017 and was (foolishly) basing my comments then on what I expected the state of the field to be at that time. I had some catching up to do, sure, but it was hardly an impossible feat! The only thing I published in 2024 was indeed in AI, coauthored with a friend of mine, a sociologist, who reached out to me because of my background. I know that hurts your feelings, but you'll get over it.
You claimed to be published in 2023 in "machine learning and data science", 8
Re: (Score:2)
You claimed to be published in 2023
Okay? That's when it was accepted. It was printed/published in 2024. (I checked.) I also don't know what your "8 months" claim has to do with anything? Do you think that every paper takes years to produce? If you're tenure track, you're expected to publish 2-4 times a year. During active research, people/teams/groups can publish considerably more. I'd say that's just one more thing you don't know anything about, but we knew that about you already. Your ignorance knows no bounds!
If you believe that chain of argumentation makes your point stronger
You're pretty stupid,
Re: (Score:2)
Imagine being so desperate to make people believe you were something you're not. It really is pathetic, man. You got a BTC I can donate some money for your therapy at?
Re: (Score:2)
Couldn't find anything, eh? No surprise there at all.
Here's hoping that healthy dose of reality will help you get over your delusions and bizarre obsession with me.
Get well soon!
Resumes (Score:3)
This has been a pet-peeve of mine for years with resumes. I've literally reached out to Adobe Acrobat product managers on LinkedIn to try to get them to listen.
Standard HR systems like Workday are HORRIBLE at ingesting and reading resumes from PDFs. Thus why you have to not only upload a PDF but also painstakingly enter standardized fields. And you end up making the resume super ugly to make it as readable as possible. The market for job seekers is HUGE. Millions of people submitting to jobs all the time.
Adobe could fix this with a super low-tech answer. They could even charge a subscription. It would require setting an industry-wide resume data standard - but if anyone can do it, it's Adobe and their marketshare. All they have to do is create a back-end metadata entry specifically for resumes that is hidden from humans but readable by machines. Standard fields like Job 1 Title, Job 1 Company, Job 1 Description, Job 1 dates. People fill this out once and are done.
The other upside to this is it would free humans to make resumes that are easier to read for humans on the other end. No ORC required.
I will die on this hill haha.
Re:Resumes (Score:4, Funny)
I know it's a typo, but I like ORC much more than OCR. Can we re-arrange the words Optical Character Recognition as Optical Recognition of Characters with a silent "of"? Wait a sec, hold that thought, with a silent "by" we can do ORCA, Optical Recognition of Characters by AI! No? Anyone? Bueller? Bueller? Bueller?
Wrong question (Score:3)
Face it, It's the thing everyone wants to do.
If it could, it would finally be a valuable use case for "AI".
Forget about a cure for cancer. Being able to convert and edit a pdf, without errors, is truly the Holy Grail of "AI".
PDF isn't handy format for humans, either (Score:1)
A majority of the time that you have a PDF, you don't want precise visual formatting. As long as someone is looking at it on a screen, precise format preservation is generally a bad thing, which makes your system inferior to competitors.
Indeed, the only times I've seen preservation of precise formatting actually be a good thing, is when the document in question is exclusively intended to be physically printed, on paper.
But that doesn't change that tools which try to autocomplete sentences in a believable wa
PDF is a fucking complicated format (Score:2)
PDF is complicated and made for display and print and not for parsing. Respect to everyone who implements a useful pdf to text tool.
That said, modern AI tend to use screenshots of rendered PDF, probably for exactly that reason. It's probably easier just to render it with a headless libpoppler or whatever and then OCR it than parsing the mess directly.
So is this article a Reductio ad or what?
Shouldn't this be easy? (Score:2)
Re:Shouldn't this be easy? (Score:4, Insightful)
if you have ever to work with pdf beyond merely staring at one you'll realize to what extent that format is an absolute disgrace. there are a zillion tools out there to manage pdf in a zillion ways and not a single one of them gets pdf parsing and layout right 100% of the time, not even adobe's. the only thing pdf had going for it was that it wasn't msword, and that's why it spread like a virus, but we could really benefit from having a proper truly portable (and universally adopted) document format even at the cost of a reduced functionality or scope.
but the article isn't about that, like the post above yours says llms probably ignore the format altogether and just ocr-parse a printout. what this is about (apart from being clickbait to take you to the verge) is llms having problems in processing text in general exactly the way we would want them to, which is old news. the underlying cause is that llms simply predict text and lack real understanding, which is why their output often looks surprisingly good but is often not quite ok or completely wrong. this is the limit of what their statistical model can offer, now we're trying to furbish them with other supplementing techniques ... with only discrete success so far.
Re: Shouldn't this be easy? (Score:2)
Mod parent up. Mod parent up.
PDF sucks as a format. Problems too many to list. My pet peeve, as soon as paper size changes (eg Letter to A4), youâ(TM)ve changed the document to fit the media. Invalidating any thought about preserving integrity.
Re: (Score:2)
Re: (Score:2)
Why not just render the PDF as an image? (Score:2)
Why not just render the PDF as an image and then process the image just like AI already can do?
I don't see the challenge, here.
Re: (Score:2)
Why not just render the PDF as an image and then process the image just like AI already can do?
I don't see the challenge, here.
My thoughts exactly. I routinely paste images into ChatGPT that it parses quite well, at least in terms of text they contain. For things like error messages that pop-up on a computer, I don't even type any context and get (mostly) useful interpretations.
more site outtages (Score:2)
Re: (Score:2)
More like basic google imho (Score:2)
Do one simple thing... (Score:2)
... and do it well.
To do a complex thing, string together multiple simple things.
The only reliable way (Score:3)
The only reliable way to parse a PDF is to flatten it to an image and parse using OCR. Anything else has been proven over and over again to put you in an asylum. It's also what most PDF parsing libraries do under the hood.
vision (Score:2)
The problem with understanding pdfs is that it comes down to understanding how they look, and that's a computer vision problem, not a language problem solvable by LLMs. Vision is not as magical as LLMs yet.
Re: (Score:1)
This has been a solved problem for about 30 years, inside the opensource "xpdf" package secretly lives a standalone tool called "pdftotext"
the use is simple:
pdftotext -layout somefile.pdf
enjoy reading somefile.txt as any other plaintext file. The -layout option is required because there is no requirement in pdf for characters to render in the order they will be read on the page. I personally generate pdf files from bottom to top in postscript because I can make the postscript file look like plaintext at