Feed Retrieval Problems: 403 Errors

samuelclay · November 8, 2024, 6:21pm

Here’s what you can email to publishers:

Cloudflare > Security > Web Application Firewall (WAF) > Custom Rules > Create Rule to whitelist “NewsBlur” in User-Agent “contains” field

sy8224 · November 8, 2024, 7:24pm

Note that custom WAF rules are limited per account. So this won’t always work.

This would also allow malicious bots that spoof the Newsblur agent to get through.

It’s clear that Newsblur is being detected as a malicious bot by Cloudflare’s heuristics across websites. Would it be better to just find out from Cloudflare why the requests are being blocked in the first place? Then adjust the requests so newsblur IPs wont be marked as malicious/ blocked?

samuelclay · November 8, 2024, 10:52pm

NewsBlur is in the Verified Bot program with Cloudflare, and I’ve reached out to them multiple times and they won’t give me any information on where NewsBlur stands.

My hope is that if enough of NewsBlur’s user make noise over on the Cloudflare forums, then we can finally reach somebody high up enough at Cloudflare to allow NewsBlur to fetch those feeds.

I spoke with a publisher this morning and he said that Cloudflare recently increased their “protection” on all of his sites as of a month ago, and that led to a large number of bots being blocked. He’s been whitelisting them individually because that’s the only way to get them in.

animaux1 · December 12, 2024, 1:49pm

Is there any update on this?

samuelclay · December 12, 2024, 7:09pm

I have an idea that I’m trying to find time to work on. It’ll move all of these 403 feeds to a new server in a different queue and allow them to bypass the Cloudflare denial. It’s going to take a bit of time to build, but I see what needs to happen now.

animaux1 · December 13, 2024, 2:49am

Awesome! Good luck!

bingox · December 15, 2024, 8:25am

https://www.ivpn.net/en/blog/index.xml

keeps timing out even though there doesn’t seem to be a problem with it

sy8224 · December 15, 2024, 9:12am

May be unrelated but you can use open rss for feeds that time out

https://openrss.org/www.ivpn.net/en/blog/index.xml

teancom · January 5, 2025, 1:03am

I’m seeing the same problem with Stereogum, though a quick check of their IPs shows Amazon AWS stuff, not Cloudflare. I’ve reached out to them via their technical support contact.

teancom · January 5, 2025, 1:04am

Forgot to mention the site id: NewsBlur

animaux1 · January 10, 2025, 9:18pm

Something must’ve gotten thru on some level as I see Daily Beast is working again.

How’s your project on this coming, Sam?

samuelclay · January 10, 2025, 9:29pm

I just launched Related Stories/Sites about an hour ago, and that was a huge lift: Discover Feeds by samuelclay · Pull Request #1832 · samuelclay/NewsBlur · GitHub. I’ll be blogging about it soon, probably late next week, once it’s working. Now that thats launched, I can work on a new feature.

I found a new project that will allow NewsBlur to act more like a browser to get around some problem feeds, so I’m hopeful that that will work. I’ve noticed that even on my local environment, Cloudflare will block NewsBlur, so they’re not even looking at IP address exlusively, which is disappointing.

animaux1 · January 11, 2025, 2:49am

Sounds complicated! Good work, thanks for the update!

animaux1 · February 19, 2025, 1:12am

Any new updates on this project?

samuelclay · March 5, 2025, 6:42am

I just pushed out an update that uses a new service called ScrapeNinja to get around the 403s. It should automatically re-fetch any forbidden feeds, but it may take a couple days to get around to all of those broken feeds. It doesn’t work 100% of the time, but it mainly works. Let me know!

animaux1 · March 6, 2025, 2:07am

Nice…! So far I’ve seen 3 of the feeds that were having issues now working! Very nice!

samuelclay · March 6, 2025, 2:33am

Yeah, glad to hear it! I have no doubt many will come back to life. It’s what happens in the next few days/weeks, since they are prone to break again. But I’m now paying for the privilege of proxying those sites. I was all set to build this myself but while researching it I came across a github repo that does exactly what I need, and then the README had a company that hosts it and I was sold.

samuelclay · March 6, 2025, 2:34am

This is the repo, fyi:

animaux1 · March 9, 2025, 3:41am

Still so far so good…very nice. Thank you!

animaux1 · March 13, 2025, 12:45pm

Still seems to be working well…