Questionable Content Feed missing items

C_Dave · March 6, 2014, 9:15am

“If you check above in the comments, you’ll notice that the publisher maintains the RSS feed by hand.”

tsuckow · March 6, 2014, 4:08pm

The feed appears to be generated by a Mac application called feeder. I skimmed the RSS and the guids appeared to be unique and I doubt the publisher is just making them up, if he was it would be much easier to use the permalink.

Will · March 6, 2014, 4:27pm

That’s what I saw earlier. I haven’t seen evidence of the claimed duplicate GUID’s in the feed, but I still get a lot of missing items.

Kazriko · March 9, 2014, 8:17am

The problem never went away.

I still think that the duplicate GUIDs are caused by errors in parsing the feed, or errors in the feed that lead to parsing. The Title, and GUID are on opposite ends of the CDATA. If the end of the CDATA is missed, it could glob multiple posts together and associate a different title with the same GUID. I’ve seen evidence of this on newsblur posts from questionable content. You can occasionally see posts where different comics from different days appear in the same post.

You can see this right now if you scroll down to “Cullinary Aptitude” in the feed. Comic 2642 and 2625 are both in the same post.

tsuckow · March 10, 2014, 4:54am

Thats an interesting point, I looked closer and it seems that Chromes xml parser seems to be doing something funny with some CDATA sections.

Comic 2652 ends up with the CDATA split:
cellpadding=“0”

There are a few others split there as well, not sure why it would do that, it is just 0x20 (space)

C_Dave · March 11, 2014, 5:40pm

Definitely no duplicated GUIDS visible now, and I can’t see anything else wrong in the RSS, but NewsBlur is still skipping entries.
Samual, if you see this, could you investigate please, as it appears to be a new issue.

Kazriko · March 12, 2014, 2:36am

He isn’t talking about the same GUID in the same XML file, but rather the same GUID tied to a different title in files polled on different days.

Patrick_O_Doherty · March 12, 2014, 5:16pm

This happened to me today.

Screenshot of what Newsblur is currently showing me for the QC feed

C_Dave · March 13, 2014, 9:20am

Today’s post “No Stabbing Allowed” has a GUID that’s not matched by anything in yesterdays file (which I saved), and is still not appearing.

Kazriko · March 13, 2014, 8:48pm

Yep, which indicates it might be what I stated. Stuff getting split across multiple cdatas and thus the program not figuring out where the cdata ends properly.

Kazriko · March 13, 2014, 8:50pm

I’d love for someone to find Newsblur’s XML parser in the git repository and run the xml through it, then print out what it thinks the tree from that file looks like.

Kazriko · March 13, 2014, 9:19pm

After poking around a bit, I found it. https://github.com/samuelclay/NewsBlu… << I’ll pull it down sometime this weekend and try it.

Will · March 13, 2014, 9:58pm

Yes, that’s what I’ve got. Everyone will see the same thing - the feed is only parsed once, for everyone.

Will · March 13, 2014, 10:14pm

Yeah, I had a look at that earlier. Technically the feed_fetcher is actually using ‘feedparser_trunk.py’, but the two files are identical. I’ve run it against the current contents of QCRSS.xml and get a guid for every one, and none are duplicates.

Will · March 13, 2014, 10:22pm

It’s actually a bit subtler. It’s Mongo that’s actually throwing the error. The stories are keyed on a hash of the GUID, calculated as ‘hashlib.sha1(guid).hexdigest()[:6]’. I’ve checked all of those though too - no duplicates on those either (for the 353 entries starting Mon 20th August).

Will · March 13, 2014, 10:24pm

My current guess, but this is a complete guess, is that there are older stories in the database than are in the XML feed. Jeff made a few errors in the past with the feed but it’s now fine, but the new feed content is colliding with the old feed content. That’s the only thing I can think of.

Kazriko · March 13, 2014, 10:44pm

I wonder what it would take to wipe the feed out and reimport it?

samuelclay · March 13, 2014, 11:38pm

Good news, I went in with a whacking stick and start clearing out debris from the feed fetcher’s overly aggressive de-duplicator. What was happening was that NewsBlur was determining that because the stories were 98.9% the same in terms of content (they were, just a single digit was different between some of them), it thought they were the same story.

I made the de-duplicator check to see if the two stories share at least 75% of their title and are published within a day of each other. That will probably do the trick, although keep an eye on some of your other feeds and let me know if anything looks amiss.

This feed will return to normal. Unfortunately there are 7 stories I couldn’t get back into the archive, but future stories should work.

C_Dave · March 14, 2014, 10:29am

Thanks!

(Aside: Surely there are a lot of feeds where only the url changes by a single digit as counter / date urls update?)

Kazriko · March 14, 2014, 6:28pm

Thanks a lot! I’ll keep an eye on it and see how well it works.

But that would explain why I was getting a whole lot of other comics having similar issues.