Need help to extract data from a website with CAPTCHA

I’m trying to automate the extraction of NAPLAN test results for schools from the MySchool website.

For example, here’s a sample URL:
:link: https://www.myschool.edu.au/school/45423/naplan/results.

Current Approach

I’m using a Browser Replay with cookies method to capture the page content.

Problem

  • The captured screenshot only shows the loading page indefinitely.
  • I suspect the CAPTCHA mechanism is preventing access, despite using valid cookies.
  • I’ve tried increasing the wait time in the recording and revalidating cookies, but no success.

Questions

  • Has anyone successfully bypassed this type of CAPTCHA using cookies?
  • Is there an alternative method to retrieve these data?
  • Any insights on handling CAPTCHA in Gumloop.

Below is a screenshot of the automation flow and output for reference.

Output:

Workbook Link: https://www.gumloop.com/pipeline?workbook_id=djLB5D5zB89GD72T5RknCJ

Any help would be greatly appreciated! Thanks in advance.

Hey @VJB - Can you share what’s happening within the selected replay? Are you clicking accept and then proceeding to the page after that?

I’d say the broswer extension input node would be a better option here:
Doc: https://docs.gumloop.com/nodes/browser_extension/browser_extension_input
Tutorial: https://www.loom.com/share/6b343be195ba4a55a66ce26894b303f9

Let me know if that works for you.

Hi Wasay,

Inside the current replay I am just browsing the results page and scrolling down to the full page with enough wait time to load the page. I don’t need to accept because I have already done it once.

Browser Extension Input Node works manually no issues there. But I will need to do it manually for each School. I want it to be automated so that I could just provide links to the schools. Hope that makes sense.

I see, thank you for the info. Using the Browser Extension Input is the only option I can see right now due to the popup.

Can you share a few sample links of schools that you’d hope to provide as inputs so I can double-check and see if there’s a workaround?

I also dont see an API option which we can use to pull the data without scraping.

Here are a few sample links –

Really appreciate your help.

Thank you! Unfortunately, I don’t see an option to automatically bypass the terms page. Basically what you see in incognito for the URLs is what the scraper is able to scrape.

The Browser Extension Input would be the most straightforward (although I understand not ideal) solution here.

Thanks Wasay,
I guess I have to rely on the manual method for now.

1 Like

This topic was automatically closed 4 days after the last reply. New replies are no longer allowed.