Web scraping node - Return HTML

shawnbuilds · March 1, 2025, 4:20am

For the web scraping node, can we extract the actual HTML contents from the page? I need to scrape the image tag, and button tags from the page to determine some links on the page. I think as the current web scraping nodes stand, they only grab the text content.

Grabbing profile picture based on img tags on the page

Grabbing a URL from button tags on the page

Wasay-Gumloop · March 1, 2025, 5:36am

Hey @shawnbuilds - Yes, you can use the Web Agent Scraper node with the Scrape Source action.

Let me know if that works for you.

Shrikar · March 1, 2025, 6:25am

@shawnbuilds do you mind elaborating what you are trying to do? If this is some therapist directory most certainly the are optimizing their SEO using sitemap. So if you share the root url we can probably find a better way of extracting the same with much simpler techniques. I am guessing you want to grab all the therapist image and the url?

shawnbuilds · March 1, 2025, 6:37am

Two types of URLs:

Clinic website Laura Langen - Peak Resilience Counselling
Directory website
Laura Langen, Counsellor, Vancouver, BC, V6Z | Psychology Today

Essentially i’m trying to find their JaneApp booking link from button

Trying to grab the img URL of their profile.

Couldn’t seem to configure the Web Agent Scraper properly either as per @Wasay’s recommendation.

shawnbuilds · March 1, 2025, 6:38am

Actually removing the “value” made it work

still open to better ways to do this tho! given that the agent web scraper takes 10 tokens

Shrikar · March 1, 2025, 8:19am

@Wasay-Gumloop something feels off here

I am not able to get the 12 elements that I see even with extracting the scrape source am I missing something?

On the console I am able to get the 12 elements

https://www.gumloop.com/pipeline?workbook_id=dhrQKKCiC6N6xbxmqaVhXk

Shrikar · March 1, 2025, 8:23am

I am able to do it in python but I don’t understand why we can’t do it in the scrape source https://www.gumloop.com/pipeline?workbook_id=wj9uLit6jYjgPYwK8XaAU4. Ideally we should be able to traverse a dom elements without spending anything on the llm credits. I also think this might be a feature worth adding don’t run the LLM content on the whole page source specially when we have a lot of page with heavy html elements(wasting tokens). Let the end user define the scope of the html element and then run llm on top of those for extracting data

system · March 2, 2025, 3:01pm

This topic was automatically closed 60 minutes after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Web Agent Scraper node keeps failing Get Help Website-Scraper	2	20	February 24, 2025
Smart Extraction for scraping Feature Request Extract-Data , Website-Scraper	1	50	March 5, 2025
Trying to login using web agent scraper Get Help Website-Scraper , Web-Agent	3	59	March 23, 2025
Web Agent Scraper problem Get Help Website-Scraper	4	45	February 24, 2025
Extracting text did not work—still gives me HTML Get Help Extract-Data	4	12	May 18, 2025

Web scraping node - Return HTML

Related topics