Encoded URLs being improperly unencoded

One of the blogs I follow through NewsBlur is the SCOTUSBlog, and recently have had some issues with it having some encoded URLs that didn’t pass through NewsBlur properly.

Here’s an example: there was a story a couple of days ago with the URL http://www.scotusblog.com/2012/04/argument-preview-the-%e2%80%9coutside-salesman%e2%80%9d-exception-to-the-flsa%e2%80%99s-overtime-pay-requirement/ . Note the encoded characters (%80%9c)… if you copy that URL to the browser, it works. This story is still on the front page of the blog and you can see that that link works too, in an “a” tag.

But the link from NewsBlur was http://www.scotusblog.com/2012/04/argument-preview-the-â

1 Like

Shit. I just fixed this last week. It’s ridiculously complicated, because some URLs are already encoded, some are not, and I have to be able to figure them all out.

I think I’m missing something… There’s only a certain subset of characters allowed in a URL – hence the reason for encoding them. So it seems to me that if you see a character not in that subset, you should %-encode it; otherwise you should leave URLs alone. I don’t think NewsBlur should ever be in the position of decoding encoded characters, if that’s the URL in the feed contains them.

At least that’s how it seems to me.

By decoding them you are making assumptions about character sets, which the URL spec says is a no-no:

Where the local naming scheme uses octet values which are not allowed in the URL, these shall be represented in the URL by a percent sign “%” followed by two hexadecimal digits (0-9, A-F) giving the value for that octet. This specification makes no assumptions or requirements about the character sets, if any, referred to be the (decoded) octets a URL. Character codes other than those allowed by the syntax shall not be
used unencoded in a URL.

But… as I said, I’m sure I’m missing something, so I’m curious as to what. :slight_smile:

This is now fixed, but be on the lookout for other story permalinks that are broken. Argh, broken encoding (both urls and text/unicode) causes me headaches to no end.

1 Like

Awesome! It was even retroactive – that entry I highlighted above works now! Thank you! You’ve not yet made me regret giving you money. :slight_smile:

Glad to hear that. It was because I was using urllib.urlquote() in the view controller, which means the permalink is stored correctly in the DB, it was just massaged before being sent to the client. I’m too embarrassed to link to the github commit.