Best practice in formatting data extraction - transaction data

We are working on a flow to extract transaction data from bank statements .pdfs.

Currently we are using the ORC PDF reader, and extracting data from the statements.

Right now, your “Use AI to extract data” node is returning three separate arrays for the transactions:

  • Date array
  • Description array
  • Amount array

Example:

Date = [
[‘12/27’,‘12/20’,‘12/24’,‘12/23’,‘12/23’,‘12/23’,‘12/24’,‘12/24’,‘12/23’,‘12/23’,‘12/23’, …]
]
Description = [
[‘SP UHIREATTACK COM HTTPS:// UHIREATTPACK.WA’, ‘AUTOMATIC PAYMENT - THANK YOU’, ‘TST* BALDARAR ROSEVILLE MN’, …]
]
Amount = [
[‘-1,580.00’, ‘-3,570.00’, ‘100.00’, ‘159.11’, ‘150.00’, …]
]

Objective:

Unify these arrays into a single array of objects, something like:

[
{
“Date”: “12/27”,
“Description”: “SP UHIREATTACK COM HTTPS:// UHIREATTPACK.WA”,
“Amount”: “-1,580.00”
},
{
“Date”: “12/20”,
“Description”: “AUTOMATIC PAYMENT - THANK YOU”,
“Amount”: “-3,570.00”
},

]

**Is there a Gumloop node would be best to “Transform Data” or “Function” node right after the AI extraction node, or should I create a custom node? **

Additional Formatting Requirements:
Clean the amounts: How to make sure commas and negative signs are properly recognized? Remove commas (-1,580.00 -> -1580.00) in your transform node.

Validate date formats: how to convert dates to a standardized date format (e.g., YYYY-MM-DD)?

Address potential multi-line descriptions: Some statement descriptions might split lines or include extra spaces. How to include normalizing or trimming of the description text?

Double-check statement boundaries: How to ensure the extraction node doesn’t mix transactions from multiple pages or multiple statements?

Any guidance on the best way to structure would be greatly appreciated !

Update:
Using the “Run Code” node, to implement this function. Which is working.

When trying to Clean the Amounts, and Address potential multi-line descriptions am running into this error:

e[31m Run Code Failed!
File “/workspace/index.py”, line 27
return combined_transactions
^
IndentationError: unindent does not match any outer indentation level

I need to update the indentation on ‘return combined_transactions’ but it seems this field is immutable. Any ideas on how I could fix this?

This topic was automatically closed 60 minutes after the last reply. New replies are no longer allowed.

Hey @bdaaes - This topic was marked as solved by you but just making sure here that you’re unblocked on this. Let me know if you’re still facing any errors.

Hi Wasay-Gumloop this has been resolved, but thanks for following up!

Awesome, glad to hear that!