Aggressive filter for Duplicate items.

Thomas_Pemberton · November 10, 2017, 10:09pm

I know this has been asked for many times in the past but any movement on an aggressive duplicate item removal? The new google news feeds have severe issues with duplicate items. Can we de-duplicate based only off headline or some other method? I really like your product otherwise!!

samuelclay · November 10, 2017, 10:16pm

Give me a bit more context. Include screenshots of stories you wish were de-duped.

Thomas_Pemberton · November 10, 2017, 10:29pm

example: https://imgur.com/KEebTKf

feed address: https://news.google.com/news/rss/headlines/section/topic/SCIENCE?ned=us&hl=en&gl=US

Thanks for even looking at this

Thomas_Pemberton · November 13, 2017, 6:04pm

Another example from today, although I’m sure you get the idea by now https://imgur.com/g7zInOT

feed
https://news.google.com/news/rss/headlines/section/topic/TECHNOLOGY?ned=us&hl=en&gl=US

samuelclay · November 13, 2017, 7:57pm

Are the stories empty? NewsBlur already has an aggressive de-duper on a per feed basis. But it needs > 100 characters in a story to check against.

Thomas_Pemberton · November 13, 2017, 8:12pm

With their old feed (that didn’t have many dupe issues) they would have the headline and a one paragraph blurb about the article and what it was. With the new feed they just seem to have the headline and not much else:

Emil_Pop · January 15, 2018, 12:38pm

I also have duplicate items in my feeds. That would be a great feature to be able to filter these.

Victor_Cunha · March 20, 2018, 4:50pm

This would be really great! I also have these issues with Google News

Thomas_Pemberton · March 29, 2018, 9:46pm

Any update on this? Looking at other reader sites such as inoreader they have the same issue. See forum post example here:

https://forum.inoreader.com/topic/12092-all-google-news-feeds-deprecated/?tab=comments#comment-28809

The end of the url is different and maybe triggering the duplicate. Any way to filter of this to reduce duplicates?

samuelclay · March 30, 2018, 4:46am

NewsBlur is already using a pretty aggressive de-duplication heuristic. I sometimes lower it, but it ends up eating comic sites that publish new comics with the same title. In most cases, the feed is claiming it’s a different story and NewsBlur’s de-duplicator doesn’t have enough of a story (minimum of 100 characters) to make a determination whether it’s a dupe or not.

The code is available here for you to peruse: https://github.com/samuelclay/NewsBlur/blob/master/apps/rss_feeds/models.py#L1909-L2009

Emil_Pop · December 13, 2018, 3:58pm

Well, it’s not working. If you wish I can send you examples.

tinytankpu · February 8, 2019, 7:05pm

Could a feature be added to exclude certain feeds from the de-duping so a user could exclude comic sites that would otherwise eliminate many of their new posts?

drhouse · June 9, 2019, 1:32am

Similar thread De-duplicate similar news items

drhouse · June 9, 2019, 1:32am

I’ve found another similar thread Aggressive filter for Duplicate items.

B111 · June 5, 2020, 4:29am

I think some flexibility with handling (removing) similar duplicated items would be helpful; i.e. based on title, body, links. In general I would likely go for a higher false positive then false negative if given the option.