Is there a way to train just part of an author's name?

I subscribe to several of the former Gawker sites, and they frequently cross-post on each other’s pages.

For example, I might get an io9 article in my io9 feed, but then it gets shared to Gizmodo and I get it in my Gizmodo feed as well.

As near as I can tell, the only differentiation between the two stories is in the Author field: in my example above, the Gizmodo example would have as the author “John Smith on io9, shared by Jane Jones to Gizmodo.”

If I could train just the “on io9, shared by” part of the Author field, I’d be able to weed these out. Unfortunately, I don’t think the Author field can be trained in part. Is there a way I’m not seeing?

1 Like

(I’m aware that I could just start blocking every iteration of these I see and eventually get most of them, but that feels like an inelegant game of Whack-A-Mole. I just thought I’d check to see if there was a better way first.)

I would love to see some ML around single-sourcing articles. There’s so much cannibalism going on these days that I’ve been considering unfollowing some of the more prominent offenders. I wonder if Sam’s got that kind of time, it would really be a killer feature though. (To down vote items that I’ve already seen the original source for, or to promote the original source in favor of a linked-list entry.)

I’ve wanted this functionality for a while, too. I follow a couple of news sites that include various wire content along with their local content. Their author fields might contain “J. Reporter, The Somewhere Gazette” or “M. Writer, The Associated Press” and often double/triple bylines like “A. Smith, D. Jones and T. Johnson, The Somewhere Gazette.” I tried training every individual permutation, and it was endless, frustrating whack-a-mole (because there are thousands of wire-service reporters). I want to thumbs-up anything with “Somewhere Gazette” and thumbs-down “Associated Press.”

1 Like

Any news on this? Would love being able to cull the duplicates in some way.

1 Like