We are working on a flow to extract transaction data from bank statements .pdfs.
Currently we are using the ORC PDF reader, and extracting data from the statements.
Right now, your “Use AI to extract data” node is returning three separate arrays for the transactions:
- Date array
- Description array
- Amount array
Example:
Date = [
[‘12/27’,‘12/20’,‘12/24’,‘12/23’,‘12/23’,‘12/23’,‘12/24’,‘12/24’,‘12/23’,‘12/23’,‘12/23’, …]
]
Description = [
[‘SP UHIREATTACK COM HTTPS:// UHIREATTPACK.WA’, ‘AUTOMATIC PAYMENT - THANK YOU’, ‘TST* BALDARAR ROSEVILLE MN’, …]
]
Amount = [
[‘-1,580.00’, ‘-3,570.00’, ‘100.00’, ‘159.11’, ‘150.00’, …]
]
Objective:
Unify these arrays into a single array of objects, something like:
[
{
“Date”: “12/27”,
“Description”: “SP UHIREATTACK COM HTTPS:// UHIREATTPACK.WA”,
“Amount”: “-1,580.00”
},
{
“Date”: “12/20”,
“Description”: “AUTOMATIC PAYMENT - THANK YOU”,
“Amount”: “-3,570.00”
},
…
]
**Is there a Gumloop node would be best to “Transform Data” or “Function” node right after the AI extraction node, or should I create a custom node? **
Additional Formatting Requirements:
Clean the amounts: How to make sure commas and negative signs are properly recognized? Remove commas (-1,580.00 -> -1580.00
) in your transform node.
Validate date formats: how to convert dates to a standardized date format (e.g., YYYY-MM-DD
)?
Address potential multi-line descriptions: Some statement descriptions might split lines or include extra spaces. How to include normalizing or trimming of the description text?
Double-check statement boundaries: How to ensure the extraction node doesn’t mix transactions from multiple pages or multiple statements?
Any guidance on the best way to structure would be greatly appreciated !