Aggressive filter for Duplicate items.

I know this has been asked for many times in the past but any movement on an aggressive duplicate item removal?  The new google news feeds have severe issues with duplicate items.  Can we de-duplicate based only off headline or some other method?  I really like your product otherwise!!

4 Likes

Give me a bit more context. Include screenshots of stories you wish were de-duped.

example:  https://imgur.com/KEebTKf

feed address: https://news.google.com/news/rss/headlines/section/topic/SCIENCE?ned=us&hl=en&gl=US

Thanks for even looking at this :slight_smile:

Another example from today, although I’m sure you get the idea by now  https://imgur.com/g7zInOT

feed
https://news.google.com/news/rss/headlines/section/topic/TECHNOLOGY?ned=us&hl=en&gl=US

Are the stories empty? NewsBlur already has an aggressive de-duper on a per feed basis. But it needs > 100 characters in a story to check against.

With their old feed (that didn’t have many dupe issues) they would have the headline and a one paragraph blurb about the article and what it was.  With the new feed they just seem to have the headline and not much else:

I also have duplicate items in my feeds. That would be a great feature to be able to filter these.

1 Like

This would be really great! I also have these issues with Google News

Any update on this? Looking at other reader sites such as inoreader they have the same issue.  See forum post example here: 

https://forum.inoreader.com/topic/12092-all-google-news-feeds-deprecated/?tab=comments#comment-28809

The end of the url is different and maybe triggering the duplicate.  Any way to filter of this to reduce duplicates?

1 Like

NewsBlur is already using a pretty aggressive de-duplication heuristic. I sometimes lower it, but it ends up eating comic sites that publish new comics with the same title. In most cases, the feed is claiming it’s a different story and NewsBlur’s de-duplicator doesn’t have enough of a story (minimum of 100 characters) to make a determination whether it’s a dupe or not.

The code is available here for you to peruse: https://github.com/samuelclay/NewsBlur/blob/master/apps/rss_feeds/models.py#L1909-L2009

1 Like

Well, it’s not working. If you wish I can send you examples.

Could a feature be added to exclude certain feeds from the de-duping so a user could exclude comic sites that would otherwise eliminate many of their new posts?

Similar thread De-duplicate similar news items

I’ve found another similar thread Aggressive filter for Duplicate items.

I think some flexibility with handling (removing) similar duplicated items would be helpful; i.e. based on title, body, links. In general I would likely go for a higher false positive then false negative if given the option.