Questionable Content Feed missing items

“If you check above in the comments, you’ll notice that the publisher maintains the RSS feed by hand.”

The feed appears to be generated by a Mac application called feeder. I skimmed the RSS and the guids appeared to be unique and I doubt the publisher is just making them up, if he was it would be much easier to use the permalink.

That’s what I saw earlier. I haven’t seen evidence of the claimed duplicate GUID’s in the feed, but I still get a lot of missing items.

The problem never went away.

I still think that the duplicate GUIDs are caused by errors in parsing the feed, or errors in the feed that lead to parsing. The Title, and GUID are on opposite ends of the CDATA. If the end of the CDATA is missed, it could glob multiple posts together and associate a different title with the same GUID. I’ve seen evidence of this on newsblur posts from questionable content. You can occasionally see posts where different comics from different days appear in the same post.

You can see this right now if you scroll down to “Cullinary Aptitude” in the feed. Comic 2642 and 2625 are both in the same post.

Thats an interesting point, I looked closer and it seems that Chromes xml parser seems to be doing something funny with some CDATA sections.

Comic 2652 ends up with the CDATA split:
cellpadding=“0”

There are a few others split there as well, not sure why it would do that, it is just 0x20 (space)

Definitely no duplicated GUIDS visible now, and I can’t see anything else wrong in the RSS, but NewsBlur is still skipping entries.
Samual, if you see this, could you investigate please, as it appears to be a new issue.

He isn’t talking about the same GUID in the same XML file, but rather the same GUID tied to a different title in files polled on different days.

This happened to me today.

Screenshot of what Newsblur is currently showing me for the QC feed

Today’s post “No Stabbing Allowed” has a GUID that’s not matched by anything in yesterdays file (which I saved), and is still not appearing.

Yep, which indicates it might be what I stated. Stuff getting split across multiple cdatas and thus the program not figuring out where the cdata ends properly.

I’d love for someone to find Newsblur’s XML parser in the git repository and run the xml through it, then print out what it thinks the tree from that file looks like.

After poking around a bit, I found it. https://github.com/samuelclay/NewsBlu… << I’ll pull it down sometime this weekend and try it.

Yes, that’s what I’ve got. Everyone will see the same thing - the feed is only parsed once, for everyone.

Yeah, I had a look at that earlier. Technically the feed_fetcher is actually using ‘feedparser_trunk.py’, but the two files are identical. I’ve run it against the current contents of QCRSS.xml and get a guid for every one, and none are duplicates.

It’s actually a bit subtler. It’s Mongo that’s actually throwing the error. The stories are keyed on a hash of the GUID, calculated as ‘hashlib.sha1(guid).hexdigest()[:6]’. I’ve checked all of those though too - no duplicates on those either (for the 353 entries starting Mon 20th August).

My current guess, but this is a complete guess, is that there are older stories in the database than are in the XML feed. Jeff made a few errors in the past with the feed but it’s now fine, but the new feed content is colliding with the old feed content. That’s the only thing I can think of.

I wonder what it would take to wipe the feed out and reimport it?

Good news, I went in with a whacking stick and start clearing out debris from the feed fetcher’s overly aggressive de-duplicator. What was happening was that NewsBlur was determining that because the stories were 98.9% the same in terms of content (they were, just a single digit was different between some of them), it thought they were the same story.

I made the de-duplicator check to see if the two stories share at least 75% of their title and are published within a day of each other. That will probably do the trick, although keep an eye on some of your other feeds and let me know if anything looks amiss.

This feed will return to normal. Unfortunately there are 7 stories I couldn’t get back into the archive, but future stories should work.

1 Like

Thanks!

(Aside: Surely there are a lot of feeds where only the url changes by a single digit as counter / date urls update?)

Thanks a lot! I’ll keep an eye on it and see how well it works.

But that would explain why I was getting a whole lot of other comics having similar issues.