I literally just had this working yesterday but it looks like they’ve made a recent update.
I’m building a workflow to extract direct .mp4 links from Reddit-hosted videos (e.g., https://v.redd.it/abc123). These videos are served through a DASHPlaylist.mpd file, which contains the signed video URLs in <BaseURL> tags.
However, when I try to scrape https://v.redd.it/{id}/DASHPlaylist.mpd using Gumloop, I often receive an AccessDenied XML error or get blocked entirely. I suspect this is due to Reddit’s bot detection or header requirements.
Has anyone successfully scraped Reddit .mpd files in Gumloop? Specifically:
Can I spoof User-Agent and Referer headers inside a standard content or source scraper node?
If not, can I pass that request to a custom Python node to fetch and return the valid .mp4 link?
Any best practices for avoiding bot detection when working with Reddit media?
Thanks; would love to hear if anyone’s built a clean solution around this.
Hey @GUWLOOP! If you’re reporting an issue with a flow or an error in a run, please include the run link and make sure it’s shareable so we can take a look.
Find your run link on the history page. Format: https://www.gumloop.com/pipeline?run_id={{your_run_id}}&workbook_id={{workbook_id}}
Make it shareable by clicking “Share” → ‘Anyone with the link can view’ in the top-left corner of the flow screen.
Provide details about the issue—more context helps us troubleshoot faster.
Hey @GUWLOOP – Yes this should be possible. My main question is how are you actually inputting these URLs? Is it manual, or do you already have a Google Sheet or a database of these URLs that you just want to download? If that’s the case, a simpler way than scraping and dealing with bot protection is to first upload the .mpd Reddit link to Drive, then read from the same Drive using Google Drive file writer and Google Drive file reader nodes. That gives you the file object, which you can then use however you want – upload it somewhere else, send it on Slack, or attach it to an email. It all depends on what you’re trying to do and what your inputs are.
If you do want to go down the scraping route, you should also be able to do it with a run code node or a custom node.
I’ve set up an example below using your Reddit video link and then send it as a Slack message using the Drive approach I mentioned.
To clarify, I’m not inputting URLs manually. I’m pulling Reddit post URLs from a Google Sheet. From there, I download the video file from the corresponding packaged-media.redd.it.mp4 link (with e= and s= parameters), and then write the link to that downloaded file back into the Sheet for reference.
I like your solution and it works well when the .mp4 link is hardcoded. Unfortunately, it breaks when the link is populated dynamically via the Reddit scraper. That’s where I’m running into issues.
This setup worked fine until recently. Reddit now seems to be blocking access to those media links at the network level. I’m getting a “blocked by network security” or Access Denied error, even though the link structure hasn’t changed.
For example, when I used your solution with a dynamically scraped .mp4 link, I got this error:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>HY7NTQTBWG5RP7P5</RequestId>
<HostId>qpVKTn7IBfgBDh0Gnmcu92ncwJBzPHGFXb0BarcoEfUN598ylubYMWH5AnjcTGvGfJ7YXs83XC8kh8RyqQYoiFafJFBmF+bkkqGZIH2m6BQ=</HostId>
</Error>
I actually got it working again for a short window yesterday, so I’m guessing Reddit pushed a new bot-detection rule or patched something shortly after.
Is there any way to simulate a full browser environment (or pass the right headers/cookies) within a run code node to bypass this? Or do we need to rethink how we’re handling the fetch entirely?
Let me know if it helps to share my data flow or Sheet setup happy to dive deeper. Thanks again.
Thank you for sharing. I see the issue now. The issue is not related to bot protection or anything like that. It’s just that the video link is not actually available in the posts for the AI to extract. If you take a look at the link that it is extracting, it is the link to the post and not the link to the video.
Example of video link: https://v.redd.it/{id}/DASHPlaylist.mp4
Example of links extracted by AI in your run: https://v.redd.it/a8oznsc0lvcf1
Wasay, thanks again with your assistance in this manner! I’ve figured out a workaround where im using run code to generate a list and manually extracting the url in a spreadsheet as a work around.
I see what you mean by the link but even instructing the ai differently shows that reddit is blocking at some level unless I’m misunderstanding. I also have these files directed toa shareable google drive folder so it shouldn’t be an issue.
I understand what you mean now, and I appreciate your patience. It does seem like they’re blocking any requests to view the video in the browser, except for the original video you shared. I’m not sure there’s a reliable workaround for this, but if you want to explore bypassing bot protection or try downloading through the browser, you could consider using a Run Code node. For links that are blocked in this way, though, I don’t think the drive uploading method I mentioned earlier would work.