PDF OCR failing to capture all text

Hello! We’ve generally been having good results with Gumloop for extracting data from PDFs. However, we recently noticed some data wasn’t being found no matter how directly we made the prompt. It turns out at the OCR stage, some text wasn’t being captured. This text is right aligned but still plainly visible and local tests with both pdftotext and tesseract show the text being found. We’re not sure why the Gumloop OCR component isn’t capturing it, but it’s a necessity for us moving forward.

Here is a recent run:
https://www.gumloop.com/pipeline?run_id=agqxkevoSVvco529DyPgEV&workbook_id=km7BsRa86qkg5qAtXZZQhY

In particular, we are expected “Date Prepared: 03/13/2025” and “Effective Date: 01/01/2025” to be in the OCR output as they appear in the original PDF, but it does not seem to be present.

We’d also be happy to share the original PDF that as used if you don’t have access to it on your backend, but we’d prefer to share that privately as it is a business document.

Any help you can provide with helping us solve this OCR issue would be appreciated!.

Hey @DamianR! If you’re reporting an issue with a flow or an error in a run, please include the run link and make sure it’s shareable so we can take a look.

  1. Find your run link on the history page. Format: https://www.gumloop.com/pipeline?run_id={{your_run_id}}&workbook_id={{workbook_id}}

  2. Make it shareable by clicking “Share” → ‘Anyone with the link can view’ in the top-left corner of the flow screen.
    GIF guide

  3. Provide details about the issue—more context helps us troubleshoot faster.

You can find your run history here: https://www.gumloop.com/history

Hey @DamianR – Thank you for sharing this! In the run link I can’t see any PDF OCR reader being used, is that the correct link?

Also if you could share the PDF with wasay@gumloop.com that’d be great and I’ll look into it.

We’ll fix this and unblock the use-case for you!

Thank you for the response! Apologies, I linked the subflow instead of the “wrapper” that does the OCR extraction. It can be found here:

https://www.gumloop.com/pipeline?run_id=MKYsw3npFoBqnewy2myzD7&workbook_id=hTtYPkvoFeTSxdKajN4DRu

I’ll send the PDF to you now. Thank you!

Great, thank you! Replied via email.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.