Web scraping node - Return HTML

For the web scraping node, can we extract the actual HTML contents from the page? I need to scrape the image tag, and button tags from the page to determine some links on the page. I think as the current web scraping nodes stand, they only grab the text content.

Grabbing profile picture based on img tags on the page

Grabbing a URL from button tags on the page

Hey @shawnbuilds - Yes, you can use the Web Agent Scraper node with the Scrape Source action.

Let me know if that works for you.

@shawnbuilds do you mind elaborating what you are trying to do? If this is some therapist directory most certainly the are optimizing their SEO using sitemap. So if you share the root url we can probably find a better way of extracting the same with much simpler techniques. I am guessing you want to grab all the therapist image and the url?

Two types of URLs:

Clinic website Laura Langen - Peak Resilience Counselling
Directory website
Laura Langen, Counsellor, Vancouver, BC, V6Z | Psychology Today

Essentially i’m trying to find their JaneApp booking link from button

Trying to grab the img URL of their profile.

Couldn’t seem to configure the Web Agent Scraper properly either as per @Wasay’s recommendation.

Actually removing the “value” made it work :slight_smile:

still open to better ways to do this tho! given that the agent web scraper takes 10 tokens

@Wasay-Gumloop something feels off here

I am not able to get the 12 elements that I see even with extracting the scrape source am I missing something?

On the console I am able to get the 12 elements

https://www.gumloop.com/pipeline?workbook_id=dhrQKKCiC6N6xbxmqaVhXk

I am able to do it in python but I don’t understand why we can’t do it in the scrape source https://www.gumloop.com/pipeline?workbook_id=wj9uLit6jYjgPYwK8XaAU4. Ideally we should be able to traverse a dom elements without spending anything on the llm credits. I also think this might be a feature worth adding don’t run the LLM content on the whole page source specially when we have a lot of page with heavy html elements(wasting tokens). Let the end user define the scope of the html element and then run llm on top of those for extracting data

1 Like

This topic was automatically closed 60 minutes after the last reply. New replies are no longer allowed.