Questionable Content Feed missing items

It seems odd that there would be duplicates as they seem to be UUIDs, which are 128bit numbers that are very unlikely to collide. Could you show us which stories match the GUIDs? I’m also quite curious as to the entries where we’re getting a mix of multiple posts collapsed into a single news story on news blur, I think that may be giving you some of your duplication issues. Perhaps you’re getting the title from one post, and the GUID from a different post because of some malformed CDATA causing the XML parser to gag and merge posts. Perhaps some detection of end tags accidentally located inside of the CDATA?

I tried reaching out to Jeph on twitter. We need someone with verified weight. He’s producing his RSS feed by hand. There’s a much better way, I’d do it for free. I can actually do it without him but nobody would adopt it.

So it looks like it’s stopped working again; the last two updates aren’t showing up. I don’t know if this is a change from before, but the feed currently claims to be generated by “Feeder 2.3.8(1705); Mac OS X Version 10.8.4 (Build 12E55) http://reinventedsoftware.com/feeder”.

The current feed says it was last published on “Tue, 17 Sep 2013” so, since the guy produces the feed by hand (see other posts in this thread), it seems he just hasn’t updated the feed in the last few days.

He’s done his back in, so no comic the last few days.
There’s a filler up today, but those are never in the feed.

There’s still 2 items in the feed that aren’t on newsblur yet though, even though the filler isn’t in the feed. 2535 and “New Shoes”, whereas the 3rd in the feed, Prince Albert, is the first on newsblur. Here’s another oddity to the feed on QC. It has a…

near the start of several of the posts. Some parsers might choke on that.

Since the feed is updated manually I’m sure he hasn’t been updating it.

Ugh, time to write something.

Yeah, he updates the feed by hand. No idea *why* it’s this way, but it is what it is. That’s the root cause.

I’m having problems with QC now. Haven’t had much issue until last week, but now it’s pretty much completely broken. I’m using the same RSS URL as above, but this maps to a different Newsblur URL - http://www.newsblur.com/site/774/ques…

It’s not picking up new stories, at the same time as repeatedly re-marking an older one as ‘unread’. Today the “unread” count in the main list of feeds is probably correct, but doesn’t match up with the actual list of stories (which contains none of the several entries since the 16th).

If the sidebar doens’t match up with the story list - that’s got to be a Newsblur bug, right? I’ve had a look through the QC feed and there doesn’t appear to be a recent issue with duplicate GUID’s.

Thanks,
Will

I’m actually getting the most recent couple of QC strips returning to my Newsblur feed again and again. Enough, already! I’ve read those!

@tedder42: if you do it and post it here, at least I’d adopt it. Something reliable sounds better than something unreliable :slight_smile:

There’s still a bunch of stories that are being duped:

[Oct 21 17:00:52] —> [QC RSS*] Feed fetch in 6.06s
[Oct 21 17:00:52] —> [QC RSS*] Checking 100 new/updated against 95 stories
[Oct 21 17:01:10] —> [QC RSS] IntegrityError on updated story: 2514
[Oct 21 17:01:13] —> [QC RSS] IntegrityError on updated story: So Demure
[Oct 21 17:01:14] —> [QC RSS] IntegrityError on new story: F2D902B7-90E9-4C1B-8D45-04C008AE3845 - Tried to save duplicate unique keys (E11000 duplicate key error index: newsblur.stories.$story_hash_1 dup key: { : “774:ab98f3” })
[Oct 21 17:01:18] —> [QC RSS] IntegrityError on updated story: On The Scent
[Oct 21 17:01:21] —> [QC RSS] IntegrityError on updated story: 2458

If you check above in the comments, you’ll notice that the publisher maintains the RSS feed by hand, and NewsBlur checks every single GUID against every other GUID. So it’s rather hard to check by hand.

Note, I have two options. Keep it as is and disallow multiple stories with the same GUID (the purpose of which is to globally identify stories and is essentially the publisher saying that two stories are the same and one has been updated). Or I can allow dupes and suddenly you’ll see duplicates in a bunch of other unrelated RSS feeds.

My competitors allow dupes and that seems to be more acceptable, but I have no plans to change. Not only is it a major architectural change which will never happen, but the decision to de-dupe is the right one, as it is essentially doing what the publisher is explicitly stating by using the same GUID on multiple stories.

1 Like

I completely understand your handling of GUIDs. No one wants to implement a broken system because of broken implementations (I’m a software engineer for radio comms - that sort of battle is my life!).

I just took a dump of QCRSS.xml and checked all the GUIDs in that with a little Python script, and none of GUIDs are duplicate. That’s obviously only some recent history so can’t check back further than that.
Are these GUID collisions against older, previously downloaded, items?

Is the above log several stories colliding on that one quoted GUID, or is that several stories colliding with some previously stored GUIDs?

Is it possible that the author has sorted out his GUID problems now (possibly by changing some GUIDs?), but you’ve still got some older values stored? How far back does your history go - the oldest in the XML is over a year now, could previous history be flushed in some way? (obvously I have little idea about how your system works apart from what you’ve hinted at above).

Wow that’s a lot of questions…
Anyway, grateful for the rapid response!

Will, Jeph manually creates the RSS, so it tends to be a problem if he copy/pastes an entry and then fixes it later.

From what Mr Clay has said above, surely that would update as an update to the old entry that the GUID collides with (or no change because the error is noticed), and then when he fixes the GUID the entry should be processed properly and appear in the feed. There might be a delay but it should get through.

I’ve validated the current set of GUIDs as being unique but we’re still a few days behind in the list. That’s why I’m a little confused by the log output just above - it appears to be showing collisions that I can’t see.

Is the same GUID problem still ongoing? I checked for duplicate GUIDs again and everything looks unique, but I haven’t had any new entries since 24th Oct (and they’re in the feed pretty much daily).

Looks like Jeph knows about it. https://twitter.com/jephjacques/statu…

Should have probably actually looked at the XML!
The GUID’s are all unique, but that doesn’t help us if none of the entries are in there, apart from the most recent and that’s got the wrong date!

Thanks C Dave.

Honestly I would take the tack that this is uncommon, but the feeds are unlikely to fix themselves - and somehow mark the URL as “problematic” and have the parser ignore the GUIDs for those feeds. Then you get the more strong de-duping on everything except for the feeds that are the exception to the rule.

Is this issue back? I am noticing a lot of missing posts.

All - Newest First:
2654
2650
2645
2538

Maybe we need to get the feeder devs involved?