Reddit Blocking Scraper Requests — How to Bypass for v.redd.it Video Extraction?

I literally just had this working yesterday but it looks like they’ve made a recent update.

I’m building a workflow to extract direct .mp4 links from Reddit-hosted videos (e.g., https://v.redd.it/abc123). These videos are served through a DASHPlaylist.mpd file, which contains the signed video URLs in <BaseURL> tags.

However, when I try to scrape https://v.redd.it/{id}/DASHPlaylist.mpd using Gumloop, I often receive an AccessDenied XML error or get blocked entirely. I suspect this is due to Reddit’s bot detection or header requirements.

Has anyone successfully scraped Reddit .mpd files in Gumloop? Specifically:

  • Can I spoof User-Agent and Referer headers inside a standard content or source scraper node?
  • If not, can I pass that request to a custom Python node to fetch and return the valid .mp4 link?
  • Any best practices for avoiding bot detection when working with Reddit media?

Thanks; would love to hear if anyone’s built a clean solution around this.

Hey @GUWLOOP! If you’re reporting an issue with a flow or an error in a run, please include the run link and make sure it’s shareable so we can take a look.

  1. Find your run link on the history page. Format: https://www.gumloop.com/pipeline?run_id={{your_run_id}}&workbook_id={{workbook_id}}

  2. Make it shareable by clicking “Share” → ‘Anyone with the link can view’ in the top-left corner of the flow screen.
    GIF guide

  3. Provide details about the issue—more context helps us troubleshoot faster.

You can find your run history here: https://www.gumloop.com/history

https://www.gumloop.com/pipeline?workbook_id=tjSEechGHq8JXY9ddSVS4p&tab=3

Hey @GUWLOOP – Yes this should be possible. My main question is how are you actually inputting these URLs? Is it manual, or do you already have a Google Sheet or a database of these URLs that you just want to download? If that’s the case, a simpler way than scraping and dealing with bot protection is to first upload the .mpd Reddit link to Drive, then read from the same Drive using Google Drive file writer and Google Drive file reader nodes. That gives you the file object, which you can then use however you want – upload it somewhere else, send it on Slack, or attach it to an email. It all depends on what you’re trying to do and what your inputs are.

If you do want to go down the scraping route, you should also be able to do it with a run code node or a custom node.

I’ve set up an example below using your Reddit video link and then send it as a Slack message using the Drive approach I mentioned.

https://www.gumloop.com/pipeline?workbook_id=mKrNExEv7ZZnSnLoo3eAPE&tab=6

Thanks, Wasay really appreciate the help.

To clarify, I’m not inputting URLs manually. I’m pulling Reddit post URLs from a Google Sheet. From there, I download the video file from the corresponding packaged-media.redd.it .mp4 link (with e= and s= parameters), and then write the link to that downloaded file back into the Sheet for reference.

I like your solution and it works well when the .mp4 link is hardcoded. Unfortunately, it breaks when the link is populated dynamically via the Reddit scraper. That’s where I’m running into issues.

This setup worked fine until recently. Reddit now seems to be blocking access to those media links at the network level. I’m getting a “blocked by network security” or Access Denied error, even though the link structure hasn’t changed.

For example, when I used your solution with a dynamically scraped .mp4 link, I got this error:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
  <Code>AccessDenied</Code>
  <Message>Access Denied</Message>
  <RequestId>HY7NTQTBWG5RP7P5</RequestId>
  <HostId>qpVKTn7IBfgBDh0Gnmcu92ncwJBzPHGFXb0BarcoEfUN598ylubYMWH5AnjcTGvGfJ7YXs83XC8kh8RyqQYoiFafJFBmF+bkkqGZIH2m6BQ=</HostId>
</Error>

I actually got it working again for a short window yesterday, so I’m guessing Reddit pushed a new bot-detection rule or patched something shortly after.

Is there any way to simulate a full browser environment (or pass the right headers/cookies) within a run code node to bypass this? Or do we need to rethink how we’re handling the fetch entirely?

Let me know if it helps to share my data flow or Sheet setup happy to dive deeper. Thanks again.