This is about a fun little project I already wrote a few months ago, completely unrelated to other things I’m usually writing about, but I thought it might be useful for others too.
When I moved to Greece last year, I had the problem that there were not many news websites that provided local news in English and actually had an RSS feed. And having local news, next to global ones about what happens all over the world, seems like a good idea to know what happens close around you.
The only useful website I found was Ekathimerini. There were two others that seemed useful to have, The Press Project (a crowd-funded project) and To Vima, both of which don’t have an RSS feed (or not for their English version). Of course the real solution to this problem is to learn Greek, which I’m doing now but that’s going to take a while until I’m able to understand news articles without too much effort.
So what did I do? I wrote a small web service that parses the HTML of those websites and returns an RSS feed based on that, together with having it update regularly in the background and keeping some history of items. You can find it here: html-rss-proxy. The resulting RSS feeds seem to work very well in Liferea and Newsblur at least.
Since then it was also extended by another news website, and generally it’s rather simple to add new ones to the code. You just need to figure out how to extract the relevant information from the HTML of some website and then convert that into code like here and wrap it up in the general interface that the other parts of the code expect, like here.
If you add some website yourself, feel free to send me a pull request and I’ll merge it!
Technical bits
On the technical side, this seems to be one of the most stable pieces of software I ever wrote. It never crashed or otherwise failed since I started running it, and fortunately I also didn’t have to update the HTML parsing code yet because of website changes.
It’s written in Haskell, using the Scotty web framework, Cereal serialization library for storing the history of the past articles, http-conduit for fetching the websites, and html-conduit for parsing the HTML. Overall a very pleasant experience, thanks to the language being very convenient to write and preventing most silly mistakes at compile-time, and the high quality of the libraries.
The only part I’m not yet too happy about is the actual HTML parsing, it seems to verbose and redudant. I might port it to another library at a later time, maybe xml-html-conduit-lens.
Update
After saying that I don’t like the HTML parsing, I actually reimplemented it around xml-html-conduit-lens now. The result is much shorter code and it resembles the structure of the HTML, as you can see here for example.
Considering that people always say that lens is so complicated, and this is more than simple getters, I have to say it went rather painless. Only the compiler errors if the types don’t match are a bit tricky to understand at first.
Cool that you’re using Haskell for this. Seems like a great fit.
About parsing and scraping HTML, I don’t know how you feel about XSLT/XPath but that seemed the most natural approach to me when I started working on tweeper[1,2].
Ciao,
Antonio
[1] https://git.ao2.it/tweeper.git/
[2] https://ao2.it/en/tags/xslt
With xml-html-conduit-lens it would basically come down to something like that with a (IMHO) saner syntax, having it integrated into the language and lots of sanity-checks at compile-time. XSLT generally seems a bit verbose for what I’m trying to do.
I’ve updated the parsing now, you might be interested in taking a short look at those changes. That’s what I was thinking of before (see also the “update” paragraph in the text above) 🙂