Best practice in formatting data extraction - transaction data

bdaaes · February 13, 2025, 4:44pm

We are working on a flow to extract transaction data from bank statements .pdfs.

Currently we are using the ORC PDF reader, and extracting data from the statements.

Right now, your “Use AI to extract data” node is returning three separate arrays for the transactions:

Date array
Description array
Amount array

Example:

Date = [
[‘12/27’,‘12/20’,‘12/24’,‘12/23’,‘12/23’,‘12/23’,‘12/24’,‘12/24’,‘12/23’,‘12/23’,‘12/23’, …]
]
Description = [
[‘SP UHIREATTACK COM HTTPS:// UHIREATTPACK.WA’, ‘AUTOMATIC PAYMENT - THANK YOU’, ‘TST* BALDARAR ROSEVILLE MN’, …]
]
Amount = [
[‘-1,580.00’, ‘-3,570.00’, ‘100.00’, ‘159.11’, ‘150.00’, …]
]

Objective:

Unify these arrays into a single array of objects, something like:

[
{
“Date”: “12/27”,
“Description”: “SP UHIREATTACK COM HTTPS:// UHIREATTPACK.WA”,
“Amount”: “-1,580.00”
},
{
“Date”: “12/20”,
“Description”: “AUTOMATIC PAYMENT - THANK YOU”,
“Amount”: “-3,570.00”
},
…
]

**Is there a Gumloop node would be best to “Transform Data” or “Function” node right after the AI extraction node, or should I create a custom node? **

Additional Formatting Requirements:
Clean the amounts: How to make sure commas and negative signs are properly recognized? Remove commas (-1,580.00 -> -1580.00) in your transform node.

Validate date formats: how to convert dates to a standardized date format (e.g., YYYY-MM-DD)?

Address potential multi-line descriptions: Some statement descriptions might split lines or include extra spaces. How to include normalizing or trimming of the description text?

Double-check statement boundaries: How to ensure the extraction node doesn’t mix transactions from multiple pages or multiple statements?

Any guidance on the best way to structure would be greatly appreciated !

bdaaes · February 13, 2025, 6:56pm

Update:
Using the “Run Code” node, to implement this function. Which is working.

When trying to Clean the Amounts, and Address potential multi-line descriptions am running into this error:

e[31m Run Code Failed!
File “/workspace/index.py”, line 27
return combined_transactions
^
IndentationError: unindent does not match any outer indentation level

I need to update the indentation on ‘return combined_transactions’ but it seems this field is immutable. Any ideas on how I could fix this?

system · February 13, 2025, 7:57pm

This topic was automatically closed 60 minutes after the last reply. New replies are no longer allowed.

Wasay-Gumloop · February 13, 2025, 8:42pm

Hey @bdaaes - This topic was marked as solved by you but just making sure here that you’re unblocked on this. Let me know if you’re still facing any errors.

bdaaes · February 14, 2025, 4:31am

Hi Wasay-Gumloop this has been resolved, but thanks for following up!

Wasay-Gumloop · February 14, 2025, 4:47am

Awesome, glad to hear that!

Topic		Replies	Views
EXTRACT DATA TO JSON WRITER Not Connecting General Question Extract-Data	2	49	February 28, 2025
Extract Data Key as List Get Help Extract-Data	3	30	March 13, 2025
AI Extract To Table from Scrape Fails With Vague Error Get Help Extract-Data	3	71	February 28, 2025
Expert hire required: Bulk PDF reader / data extractor flow build required Hire a Gumloop Expert Extract-Data	3	79	May 17, 2025
Extract Data Failed (Attached link below) Bug Extract-Data	3	88	April 16, 2025

Best practice in formatting data extraction - transaction data

Example:

Objective:

Related topics