PDF OCR failing to capture all text

DamianR · June 20, 2025, 12:40am

Hello! We’ve generally been having good results with Gumloop for extracting data from PDFs. However, we recently noticed some data wasn’t being found no matter how directly we made the prompt. It turns out at the OCR stage, some text wasn’t being captured. This text is right aligned but still plainly visible and local tests with both pdftotext and tesseract show the text being found. We’re not sure why the Gumloop OCR component isn’t capturing it, but it’s a necessity for us moving forward.

Here is a recent run:
https://www.gumloop.com/pipeline?run_id=agqxkevoSVvco529DyPgEV&workbook_id=km7BsRa86qkg5qAtXZZQhY

In particular, we are expected “Date Prepared: 03/13/2025” and “Effective Date: 01/01/2025” to be in the OCR output as they appear in the original PDF, but it does not seem to be present.

We’d also be happy to share the original PDF that as used if you don’t have access to it on your backend, but we’d prefer to share that privately as it is a business document.

Any help you can provide with helping us solve this OCR issue would be appreciated!.

Gumloop_Bot · June 20, 2025, 12:40am

Hey @DamianR! If you’re reporting an issue with a flow or an error in a run, please include the run link and make sure it’s shareable so we can take a look.

Find your run link on the history page. Format: https://www.gumloop.com/pipeline?run_id={{your_run_id}}&workbook_id={{workbook_id}}
Make it shareable by clicking “Share” → ‘Anyone with the link can view’ in the top-left corner of the flow screen.
Provide details about the issue—more context helps us troubleshoot faster.

You can find your run history here: https://www.gumloop.com/history

Wasay-Gumloop · June 20, 2025, 1:02am

Hey @DamianR – Thank you for sharing this! In the run link I can’t see any PDF OCR reader being used, is that the correct link?

Also if you could share the PDF with wasay@gumloop.com that’d be great and I’ll look into it.

We’ll fix this and unblock the use-case for you!

DamianR · June 20, 2025, 1:48am

Thank you for the response! Apologies, I linked the subflow instead of the “wrapper” that does the OCR extraction. It can be found here:

https://www.gumloop.com/pipeline?run_id=MKYsw3npFoBqnewy2myzD7&workbook_id=hTtYPkvoFeTSxdKajN4DRu

I’ll send the PDF to you now. Thank you!

Wasay-Gumloop · June 20, 2025, 3:31am

Great, thank you! Replied via email.

Topic		Replies	Views
My PDF OCR node failed - help please Get Help Extract-Data	3	109	March 8, 2025
PDF reader keeps failing with Gmail Get Help Extract-Data , Gmail-Reader	3	146	April 12, 2025
Extract Data Failed (Attached link below) Bug Extract-Data	2	150	April 16, 2025
Gmail Reader not extracting all the information in email Get Help Gmail-Reader	4	113	March 31, 2025
Problem with PDF READER Get Help Extract-Data	11	152	March 4, 2025

PDF OCR failing to capture all text

Related topics