lots of html crap in livejournal feeds

http://maxkatz.livejournal.com
http://zyalt.livejournal.com/

under each article i see lots of html code with tags and crap

3 Likes

Looks like the feeds are messed up. That’s the data NewsBlur’s getting.

Excellent! Of course NewsBlur is getting it! :slight_smile:

What I am asking is to find that very trap in XML parsing, where the crap becomes visible.

And filter it! (oh please, please, be nice and filter it, i beg you :slight_smile:

Look, all the livejournal crap is contained between theese tags (i hope getsatisfaction don’t mess it):


<!-- TAGS MESSY TAGS -->

I believe that the cause is that stupid livejournal’s html comment tag

Here goes a full feed item as an example (i bet getsatisfaction can deal with it):

<item> <br> <guid ispermalink="true">http://dolboeb.livejournal.com/2536894.html</guid> <br> <pubdate>Thu, 04 Jul 2013 07:00:19 GMT</pubdate> <br> <title>Quarantasette</title> <br> <link>http://dolboeb.livejournal.com/2536894.html <br> <description> <br> Доброе утро, дорогие мои читатели.<br>Привет вам из дождливой Венеции, от её мёртвых дожей и вечно живых комаров. Так выглядит балкон, на котором я встретил первый день 48-го года своей жизни:<br><img src="http://ic.pics.livejournal.com/dolboeb/53631/629048/629048_original.jpg" border="0" vspace="5" width="900" height="565" alt="Вид с балкона на гостиницу Ruzzini"><br>Спасибо всем, кто поздравил, и тем, кто только собирается.<br>Жизнь прекрасна и удивительна, имею я вам доложить с утра пораньше.<br>Даже если что-то в жизни складывается не так, всегда стоит помнить о лучшем, что у нас есть: о наших детях, родителях, друзьях, любимых, о тех, с кем посчастливилось оказаться рядом.<br>Ура, товарищи.<br>И спасибо ещё раз.<br><div class="lj-like"><!-- <div class="lj-like-item lj-like-item-repost"> <a href="?url=http://dolboeb.livejournal.com/2536894.html" data-url="http://dolboeb.livejournal.com/2536894.html" >repost</a> </div> <div class="lj-like-item lj-like-item-facebook"> <fb:like href="http://dolboeb.livejournal.com/2536894.html" send="false" layout="button_count" width="100" show_faces="false" font="" action="recommend"> </fb:like> </div> <div class="lj-like-item lj-like-item-twitter"> <a href="http://twitter.com/share" class="twitter-share-button" data-url="http://dolboeb.livejournal.com/2536894.html" data-text="Quarantasette" data-count="horizontal" data-lang="ru" data-hashtags="">Tweet</a> </div> <div class="lj-like-item lj-like-item-google"> <g:plusone size="medium" href="http://dolboeb.livejournal.com/2536894.html"> </g:plusone> </div> <div class="lj-like-item lj-like-item-tumblr"> <a class="tumblr-share-button js-lj-share-entry" target="_blank" data-service="tumblr" data-text="%D0%94%D0%BE%D0%B1%D1%80%D0%BE%D0%B5%20%D1%83%D1%82%D1%80%D0%BE,%20%D0%B4%D0%BE%D1%80%D0%BE%D0%B3%D0%B8%D0%B5%20%D0%BC%D0%BE%D0%B8%20%D1%87%D0%B8%D1%82%D0%B0%D1%82%D0%B5%D0%BB%D0%B8.%20%D0%9F%D1%80%D0%B8%D0%B2%D0%B5%D1%82%20%D0%B2%D0%B0%D0%BC%20%D0%B8%D0%B7%20%D0%B4%D0%BE%D0%B6%D0%B4%D0%BB%D0%B8%D0%B2%D0%BE%D0%B9%20%D0%92%D0%B5%D0%BD%D0%B5%D1%86%D0%B8%D0%B8,%20%D0%BE%D1%82%20%D0%B5%D1%91%20%D0%BC%D1%91%D1%80%D1%82%D0%B2%D1%8B%D1%85%20%D0%B4%D0%BE%D0%B6%D0%B5%D0%B9%20%D0%B8%20%D0%B2%D0%B5%D1%87%D0%BD%D0%BE%20%D0%B6%D0%B8%D0%B2%D1%8B%D1%85%20%D0%BA%D0%BE%D0%BC%D0%B0%D1%80%D0%BE%D0%B2.%20%D0%A2%D0%B0%D0%BA%20%D0%B2%D1%8B%D0%B3%D0%BB%D1%8F%D0%B4%D0%B8%D1%82%20%D0%B1%D0%B0%D0%BB%D0%BA%D0%BE%D0%BD,%20%D0%BD%D0%B0%20%D0%BA%D0%BE%D1%82%D0%BE%D1%80%D0%BE%D0%BC%20%D1%8F%20%D0%B2%D1%81%D1%82%D1%80%D0%B5%D1%82%D0%B8%D0%BB%20%D0%BF%D0%B5%D1%80%D0%B2%D1%8B%D0%B9%20%D0%B4%D0%B5%D0%BD%D1%8C%2048-%D0%B3%D0%BE%20%D0%B3%D0%BE%D0%B4%D0%B0%20%D1%81%D0%B2%D0%BE%D0%B5%D0%B9%20%D0%B6%D0%B8%D0%B7%D0%BD%D0%B8%3A%20%20%20%D0%A1%D0%BF%D0%B0%D1%81%D0%B8%D0%B1%D0%BE%20%D0%B2%D1%81%D0%B5%D0%BC,%20%D0%BA%D1%82%D0%BE%20%D0%BF%D0%BE%D0%B7%D0%B4%D1%80%D0%B0%D0%B2%D0%B8%D0%BB,%20%D0%B8%20%D1%82%D0%B5%D0%BC,%20%D0%BA%D1%82%D0%BE%20%D1%82%D0%BE%D0%BB%D1%8C%D0%BA%D0%BE%20%D1%81%D0%BE%D0%B1%D0%B8%D1%80%D0%B0%D0%B5%D1%82%D1%81%D1%8F.%E2%80%A6" data-url="http%3A%2F%2Fdolboeb.livejournal.com%2F2536894.html" data-title="Quarantasette" href="http://tumblr.com/share/link" style="display:inline-block; overflow:hidden; width:62px; height:20px; background:url('http://platform.tumblr.com/v1/share_2.png') top left no-repeat transparent;" ></a> </div> --></div> <br> </description> <br> <comments>http://dolboeb.livejournal.com/2536894.html</comments> <br> <category>жизнь</category> <br> <category>календарь</category> <br> <music>הצעדה — הבילויים</music> <br> <title type="plain">הצעדה — הבילויים</title> <br> <security>public</security> <br> <reply-count>51</reply-count> <br> </item>```   
 
So what i am asking, is to remove the code between theese tags when you are parsing the livejournal feeds.   
 
  
 I hope that one little condition and one little regular expression can help.   
 Please.   
  
 Thank you in advance. :)

Newsblur is the only RSS reader I know that has problems with Livejournal. Even if LJ provides incorrect RSS, can Newsblur adapt?
Thanks.