RSS + Forums… pt2!

Well fuck, we’re doing a part 2. For those not familiar, I have a now-drafted (aka unpublished) article that went into detail about my forum-based rss news reader.

The project has seen some legal issues, which I’ve thankfully not been pulled up on yet, but I figured fuck it I’ll deal with that when I get to it.

Want to check the spoils of my labour and read news articles in the comfort of a forum? Well you can’t. It’s only for me at the moment. But you can read about it after the break!


Background

Some background here; when I was at PacMags, I started getting into forex trading. I personally like trading based on news and events, as I find it a lot more reliable than technical analysis. Unfortunately it requires digesting a large amount of news articles and I only have so much time in my life to be reading the news.

News apps fail me, with trash articles taking such a large predence in the modern morons life. So yeah, thanks for ruining the internet fuck tards.

To rememdy this I started creating my own news reader. The goal is to have it store all articles from rss news feeds, visit the links and grab the main articles text.

This has two benefits; I can introduce filters to keep moronic articles out of my life (FINALLY) and I can easily remove all those stupid video-only “articles” that’s taking over modern journalism.

The Setup

Well this is a little difficult to post about from an iPad.

Basically I have a server running a python script in cron. Mr Python Script loops over a bunch of RSS feeds, compiles a forum post, connects into MySQL and inserts the article across two databases (plus some custom columns).

Esotalk then sits on my web server. It operates in a very vanilla fashion, however it displays all these articles in their respective channels (fancy word for categories).

How it works

To be written… when it’s finished.

Challenges

I wanted to document the challenges here in case it helps other like minded idiots who are also trying to re-invent a certain circular object.

Problem Resolution
BeautifulSoup won’t work against a BeautifulSoup variable I haven’t resolved this and probably won’t be bothered to until the end of the project. Basically, I can’t load a link, use BeautifulSoup to pull thee main <article> then use BeautifulSoup to pull the <p>aragraphs from the <article>
When returning the number of rows returned from an SQL query and checking it’s >0, Python sooks about unread data in the cursor Load your cursor like this; cursor = cnx.cursor(buffered=True). I couldn’t be bothered learning why this resolved in.
urllib2 won’t handle 301 return codes.. or any other code I don’t think (haven’t tested) This is a tricky one. Some news places are publishing their 2LDs then running a 301 to redirect to a sub domain (ie jmb.id.au 301’s to www.jmb.id.au.. which is cumbersome and very old school). Like seriously, who the fuck is not publishing their pages on their 2LD’s? Fucking plebs man. Resolution is just as combersome and requires it’s own article.
For testing I have been working off two feeds. This reduces the inconsistency between sites, which is fantastic. Clearly you don’t want to download the same article from the same source constantly, and news websites typically use new titles for different articles. Unfortunately I received a 404 on an article and decided to check to see if the article actually did exist and maybe this source had just changed the URL but didn’t setup a 301. This lead me to realize that some journalists are selling their articles to multiple agencies, or a single agency reports on multiple portals. I’ll need to create a new column in the conversations table to match on. If I use the url after following 301’s, I’ll be able to match based on the true source instead of just the title, which will allow me to download double-ups from multiple sources. Unfortunately I will need to find a way to compare articles when they’re found and liken them to other articles. Anything with a likeness of over 90% should be posted as a conversation of the original article. This means I’ll be able to bundle multiple sources under the one article, condensing discussion to a single thread. Going forward, I’ll be able to adapt such a process with the new URL column for article re-checking, allowing me to check for revisions on a thread. I might setup another new column that controls the refresh rate. I’ll use a cron-based method of declaring how often links should be re-checked for changes and have a script to handle that. The scripts will constantly re-check posts for changes, and maybe have another “dead” column I can use to prevent rechecks of dead articles… I don’t know. I need a real resolution here but it’s time to go have dinner!

Other notes in the project

  • I’m writing the stub whilst staying in the city for the weekend. My family have paid for a hotel room for me and it has a balcony (thanks meriton). Unfortunately for me and my acrophonbia, we’re on the 31’st floor and thanks to said acrophobia, the entire time I’ve written this I’ve stared out of the balcony and started sweating from how high up I am. I saw a helicopter buzz past and even though he was super high up, it really re-inforced exactly how high up I am and I froze for a good five minutes. Not to mention we have a glass barrier for the side of the balcony, so this entire time I’ve just been expecting to see a body fly past, for the balcony to break off, for the building to fall over, all those irrational and unnecessary thoughts your fears brings to your build and enforces are reality. How the fuck do people live up this high? Fuckin’ dare devils.
  • For what is an extremely simple project in theory, it’s turning into a bit of debacle. Chrome really need to introduce a developers mode however using curl in Termius has proven significantly more beneficial as I have been able to debug return code issues faster. That still doesn’t take away from the fact that I’m battling with 301’s, inconsistent formatting, people refusing to get with standards, fucking pay walls (like.. seriously?) and the dwindling number of people publishing RSS feeds.
  • As a general musing, I should really adapt the site to publish to a Facebook page except it should link directly to the news article. But how would I select what’s published?
  • Seriously, I’m super fucking high up. Knees weak, palms sweaty. Mums spaghetti.