Oh no! There was an error!

I’ve been getting “Oh no! There was an error!” on four out of every five feed refreshes for the last 48 hours. What’s going on?


Let me take a look. What’s your username?

Thanks! lelandpaul

I went through a bunch of feeds, clicking before previous feeds had loaded, going through pages and pages, and everything worked perfectly. What browser/OS are you using? Can you describe anything more about how the errors pops up?

Chrome 16.0.912.63 on Mac OS 10.7.2. Trying to load more on the Everything page (feed view) – either when first switching to it or reloading via the hotkey – will load until the progress bar is just over half-way, then give the error.

Potentially related: The iPhone app was refusing to load more stories in the Everything view yesterday, without giving an error. (Latest version/iOS.)

Bingo, taking a look. (Looks like a parse error)

It’s the Game Shelf feed. There’s an emoji character in there that is tripping up the browser’s native JSON parser. Ugh, there’s so little I can do about this. I would have to en-encode the emoji in same fashion as to either turn it into an HTML entity (likethis;) or the UTF-8 characters that it translates to.

Mark it as read to get it out of the Everything feed, but yeah, it’s a problem and I could not figure out a solution for the hour I spent on it.

Here’s the JavaScript that will throw an error:

var c = ‘{“t”: “\U0001f468\U0001f469\U0001f43a\U0001f468\U0001f468\U0001f469\U0001f473\U0001f43a\U0001f468\U0001f469\U0001f469”}’;

> Error: “JSON Parse error: Invalid escape character U”

Grr, unicode.

Yikes. Thanks for looking into this; sorry it was such a frustrating experience. FWIW I love Newsblur – wouldn’t have bothered to report this if I didn’t!

According to json.org the way to represent Unicode is with a lower-case ‘u’. So instead, try

var c = ‘{“t”: “\u0001f468\u0001f469\u0001f43a\u0001f468\u0001f468\u0001f469\u0001f473\u0001f43a\u0001f468\u0001f469\u0001f469”}’;

It works for me. I get an error with U and no error with u.

1 Like

The problem is that I don’t get to choose the capital \U’s. That’s how the feed is saved. I’m not able to go rewriting the Unicode characters just to change a character. Something more fundamental is happening, but I’m not sure what. And another user just found this issue: http://getsatisfaction.com/newsblur/t…

OK, all fixed. Please let me know if this fixes the issue for you. The solution was to temporarily switch my JSON encoder on the backup from cjson to simplejson. Unfortuantely, cjson is 10-200x faster than simplejson, but we’re talking such tiny millisecond numbers, that it’s not worth it.

Yes, it’s fixed!

I can even tell you what the problem was. Sorry about the “wall of text” :slight_smile:

This is nothing to do with how the feed is saved; it’s about how you take some Unicode text from XML (correctly) and send it to the browser as JSON (buggily).

Observe that the codes involved were all greater than 0xFFFF, like 0x0001F468. The “\u” escape in most languages (including JSON) allows only a 4-hex-digit sequence after it. Languages like Python and C# use “\U” to represent symbols outside of the Basic Multilingual Plane. The BMP includes pretty much all popular languages’ characters, so it’s very rare to see chars like that.

However, json.org only defines “\u”, and only allows 4 hex digits with it, without explaining how to encode the remaining Unicode code points. Most languages supporting Unicode, JavaScript included, use a trade-off that many programmers don’t know about: they actually store the string encoded as UTF-16, so that most codepoints fit into 16 bits. This, however, requires that everything outside the BMP must be escaped using surrogate pairs. In JSON, there is no special escape for this; surrogate pairs are represented “literally”. So the correct way to encode 0x0001F468 is \uD83D\uDC68, as weird as that might seem.

Some languages support the “\U” shortcut, so that when you say “\U0001F468” they actually put “\uD83D\uDC68” into your string, but that excludes JSON. You can even see this in Python: len(u"\U0001F468") is *two*, because this actually means len(u"\uD83D\uDC68"). Similarly, u"\U0001F468" == u"\uD83D\uDC68" is true.

cjson, unfortunately, seems to do the opposite of what it should. It takes a string that’s already UTF-16, and then reverses the encoding, taking “\uD83D\uDC68” and outputting it as “\U0001F468”. This is simply wrong; it’s a bug in cjson. Funny how this bug can be fixed by *deleting* some code from it… There’s an “if” clause labelled with “Map UTF-16 surrogate pairs to Unicode \UXXXXXXXX escapes” in its source.

The browsers are right to fail on seeing something that’s not in the standard; “be lenient on input” is a paradigm that’s long outdated; see what mess it got us into with various internet “standards”. Be strict on input, fail if it’s wrong. That’s the way to do things.

So - report the bug to cjson and don’t use it, or fix and recompile it yourself, or check if the input has words in the range 0xD800-0xDFFF and fall back to a non-buggy encoder.

P.S. If you ask on StackOverflow.com you’ll probably get similarly detailed answers :slight_smile:

1 Like

Roman, thanks for that detailed write-up! Since this issue came up in three different support tickets, I decided to investigate thoroughly. I finally learn that in Python if I looped over the 8-digit hex sequence, I would get hex points that could be read by the browser:

>>> s = u’[santa glyph]’
>>> s
>>> s[0], s[1]
(u’\ud83c’, u’\udf85’)

This led me to the json decoder. Kind of a shame, since cjson hasn’t been touched in a few years.