I tried to create an agent which help extracted data from my payslip mail. Every week, I would receive an email containing PDF file, named PaySlip. The flow runs successfully but all rows contain data from the 1st PDF file only. I tried another approach by saving those PDF files into google drive and ensure those files have different names before running data extraction. The PDF files are named differently in the google drive folder, however, the content is still the same, the content of all 20 PDF files are coming from the 1st PDF file. Hope to hear from everyone how to solve this issue. Thanks
Hey @sengsan - Could you please share the run link from the history page with share access enabled so I can take a look at the flow and identify the issue?
Hey @sengsan - This is the workbook link, could you please share the run link instead from the history page so I can take a look at this error:
The flow runs successfully but all rows contain data from the 1st PDF file only
The run link would have the run_id
in the URL.
https://www.gumloop.com/pipeline?workbook_id=33MNEBivyuRLsnvT8peQxE&run_id=jVYq7mtRvz6ctsHH3HKBYq
Hope this is the correct one.
Thank you for sharing! The issue here is that the PDF Reader
node does not natively work with Drive File URLs. In this case you can connect the Attachment
output from the Gmail Reader
node with the File
input of the PDF Reader
node.
Here’s an example: https://www.gumloop.com/pipeline?workbook_id=mPGufFkyybBkbawdXiSp3B
Let me know if this works for you.
Thanks Wasay for supporting. I tried the suggested method. It did run successfully, but the content is not accurate. It is still the same problem, all rows or all PDFs files were exactly the same, no new data / files.
Could you share the run link for this please? Access to the sheet you’re writing would be helpful too: wasay@gumloop.com
Hi Wasay. Please kindly check this.
https://www.gumloop.com/pipeline?run_id=JncEEAyVX25FdgNM2njfwH&workbook_id=33MNEBivyuRLsnvT8peQxE
And the google sheet is already shared to your email. I tried this flow with attached PDF files having different name, it works perfectly. But when the attached PDF files have the same name, it does not work.
Thank you
Thank you for sharing! The file name should be unique when processing in loop mode. Try using a subflow here, example setup: https://www.gumloop.com/pipeline?workbook_id=m3tbg5fjDQ6KoyAm5j5oMw
Let me know if that works for you.
Hey @Wasay-Gumloop I’m struggling with this issue where the PDF reader can’t access PDF’s stored in Gdrive via URL. Gummie gave me a few workarounds:
- use find and replace to edit the URL so that the file is downloadable (didn’t work after various iterations)
- Use Google Drive File Reader, and output the text string straight into the PDF reader node. (also not working, getting a blank output)
I really need to use the Gdrive URL because it’s in a Gsheet that I’m looping through in the larger flow. Is there a known workaround for this issue that you could point me to?
Hey! The PDF Reader
node does not work directly with Drive URLs however you can use the Drive File Reader
node to read the file. Here’s an example: https://www.gumloop.com/pipeline?workbook_id=7A4g4VNto526JcJwguqRAA&run_id=C6khiQAeEgELdgtq9rirfN
Let me know if this works for you.
Got it, so skip the PDF reader altogether. Is there any downside to doing it that way? My understanding was that the PDF reader had this advanced reader mode that made the PDF more readable to the LLM’s before the extraction step. I was hoping that would make for a higher quality output. Could you speak to that?
You’re right about the advanced reader mode but since in your case we’re working with Drive URLs the only option is to use the Drive File Reader node. That said, I’ve made a note for drive url capability in the PDF reader node in our backlog.
Thanks! Is there a way I can follow the ticket/feature request so i get notified if it one day gets built?
We’re not publicly sharing the timeline or backlog as of now but you can keep an eye on https://www.gumloop.com/changelog