Getting RSS feeds for news websites that don’t provide them

This is about a fun little project I already wrote a few months ago, completely unrelated to other things I’m usually writing about, but I thought it might be useful for others too.

When I moved to Greece last year, I had the problem that there were not many news websites that provided local news in English and actually had an RSS feed. And having local news, next to global ones about what happens all over the world, seems like a good idea to know what happens close around you.
The only useful website I found was Ekathimerini. There were two others that seemed useful to have, The Press Project (a crowd-funded project) and To Vima, both of which don’t have an RSS feed (or not for their English version). Of course the real solution to this problem is to learn Greek, which I’m doing now but that’s going to take a while until I’m able to understand news articles without too much effort.

So what did I do? I wrote a small web service that parses the HTML of those websites and returns an RSS feed based on that, together with having it update regularly in the background and keeping some history of items. You can find it here: html-rss-proxy. The resulting RSS feeds seem to work very well in Liferea and Newsblur at least.

Since then it was also extended by another news website, and generally it’s rather simple to add new ones to the code. You just need to figure out how to extract the relevant information from the HTML of some website and then convert that into code like here and wrap it up in the general interface that the other parts of the code expect, like here.

If you add some website yourself, feel free to send me a pull request and I’ll merge it!

Technical bits

On the technical side, this seems to be one of the most stable pieces of software I ever wrote. It never crashed or otherwise failed since I started running it, and fortunately I also didn’t have to update the HTML parsing code yet because of website changes.

It’s written in Haskell, using the Scotty web framework, Cereal serialization library for storing the history of the past articles, http-conduit for fetching the websites, and html-conduit for parsing the HTML. Overall a very pleasant experience, thanks to the language being very convenient to write and preventing most silly mistakes at compile-time, and the high quality of the libraries.

The only part I’m not yet too happy about is the actual HTML parsing, it seems to verbose and redudant. I might port it to another library at a later time, maybe xml-html-conduit-lens.

Update

After saying that I don’t like the HTML parsing, I actually reimplemented it around xml-html-conduit-lens now. The result is much shorter code and it resembles the structure of the HTML, as you can see here for example.

Considering that people always say that lens is so complicated, and this is more than simple getters, I have to say it went rather painless. Only the compiler errors if the types don’t match are a bit tricky to understand at first.

New job (and company)

I have a new job! After working for 5+ years for Collabora, I decided that it’s time for a change and have quit my contract there early September. These have been great years, working with so many brilliant people on Free Software, but time to move on!

So for the future, it’s actually more than just a new job. Together with fellow GStreamer hackers Tim-Philipp Müller and Jan Schmidt we founded a new company: Centricular Ltd.. We are going to provide consultancy services around Free Software with a focus on GStreamer, Multimedia and Graphics initially (but not exclusively). Technically, not much will change in the kind of work I’m doing compared to the past, but this time we will handle everything ourselves so we can better serve the needs of our customers and are more flexible. Hopefully this also provides an even better work environment within a group of equals and gives us more time to contribute to GStreamer and other Free Software projects.

Check our website for a list of Open Source technologies we cover and services we will offer. Obviously this list is not complete and we will try to broaden it over time, so if you have anything interesting that is not listed there but you would need someone for, just ask.

We will also be at the GStreamer Conference in Edinburgh next week.

Great times ahead! 🙂

New Blog

Soo… after about 10 years I’m having a blog again. I hope I can keep it updated a bit more often than in the past 🙂

Here I’m planning to write about various topics that seem worthwhile to write about, including Free Software, coding, work related things, food, life in general or whatever else comes to my mind.