I know this has been asked for many times in the past but any movement on an aggressive duplicate item removal? The new google news feeds have severe issues with duplicate items. Can we de-duplicate based only off headline or some other method? I really like your product otherwise!!
Give me a bit more context. Include screenshots of stories you wish were de-duped.
Thanks for even looking at this
Another example from today, although I’m sure you get the idea by now https://imgur.com/g7zInOT
Are the stories empty? NewsBlur already has an aggressive de-duper on a per feed basis. But it needs > 100 characters in a story to check against.
With their old feed (that didn’t have many dupe issues) they would have the headline and a one paragraph blurb about the article and what it was. With the new feed they just seem to have the headline and not much else:
I also have duplicate items in my feeds. That would be a great feature to be able to filter these.
This would be really great! I also have these issues with Google News
Any update on this? Looking at other reader sites such as inoreader they have the same issue. See forum post example here:
The end of the url is different and maybe triggering the duplicate. Any way to filter of this to reduce duplicates?
NewsBlur is already using a pretty aggressive de-duplication heuristic. I sometimes lower it, but it ends up eating comic sites that publish new comics with the same title. In most cases, the feed is claiming it’s a different story and NewsBlur’s de-duplicator doesn’t have enough of a story (minimum of 100 characters) to make a determination whether it’s a dupe or not.
The code is available here for you to peruse: https://github.com/samuelclay/NewsBlur/blob/master/apps/rss_feeds/models.py#L1909-L2009
Well, it’s not working. If you wish I can send you examples.
Could a feature be added to exclude certain feeds from the de-duping so a user could exclude comic sites that would otherwise eliminate many of their new posts?
Similar thread De-duplicate similar news items
I’ve found another similar thread Aggressive filter for Duplicate items.
I think some flexibility with handling (removing) similar duplicated items would be helpful; i.e. based on title, body, links. In general I would likely go for a higher false positive then false negative if given the option.