Feature: Support for static non rss sites via FiveFilters or similar

Inkmist · May 7, 2017, 7:08pm

Hi Samuel,

It would be awesome if NB could support static sites.

The objective is not to crawl huge sites and get banned but rather notify new content on infrequently updated static sites with no rss.
This would obviously be a premium feature.

1-When no feed is detected at the link, NB could suggest feeds with similar domains ordered by number subscribers. (I think you already sort of do this but it’s not clear)

2-Then it should provide the option to monitor the site via custom css selectors.
use an advanced menu for xpath or css selector.
Or even use the original view with a visual selector. (see: http://siteconfig.fivefilters.org/)

3-Cache a version and create a feed for all premium users

4-technical options:

4a: Set up a host with fivefilters RSS tools.
It’s cheap, supported and easily modifiable to save custom JSON to mongo.
If you agree I could even get you a free license or even provide a host. :-)

http://fivefilters.org/content-only/

4b- Write your own with BeautifulSoup or Scrappy.
(overkill? but spiders could be useful and easier to scale).
Not at all difficult.(less than 200 lines)
I’ve done it.

Bottom line js content would be difficult but truly static and simple html would be a huge step.

https://docs.scrapy.org/

4c- Long term: For bonus points use Selenium with phantomjs webdriver or advanced scrapy (https://github.com/scrapy-plugins/scrapy-splash ) to get around those increasingly popular js frameworks.

5-An option for js would be to select a library and let advanced users contribute spider code.
The crawler host would be segregated (maybe on the dev branch) and you would approve spiders for production.
Over time you could have trusted users contribute unsupervised.

Sorry if this was long, I just love NB.

Got a bit sad when you said rss readers are declining…
NB is the most important app of my day.

Thanks for your time.

PS: Oh, that twitter code update is great!. It’s night and day. Wow.

samuelclay · February 23, 2018, 11:28pm

Hey, this is a great writeup. But it’s also a ton fo work to implement. Google in fact used to do this and they scrapped the project because it was so finicky and unreliable.

But before we get into details, can you give me a few examples of URLs that you’d want to track in NewsBlur?