I consume a lot of content via Miniflux, an RSS/ATOM “feed” reader. Many news sites and services provide some content via this method. However, it is sometimes incomplete. The idea is that news sites want you to visit their site to generate advertising dollars.
Miniflux adds the ability to fetch the original article (in case they just provide a snippet) and then “scrape” the source for content and display that inside the article, instead. This allows you to get a nice, mostly uncluttered, view of the content you want. The main problem with this is: Most websites have odd layouts and weird ways of inserting content.
For example. NPR posts quotes in their articles via:
<div class="bucketwrap pullquote ..."> <aside aria-label="pullquote" ...> <div class="bucket" ...> <p>Nathan is such a froody dude.</p> </div> </aside> </div>
You know, instead of:
<blockquote> <p>Nathan is such a froody dude</p> <p><cite>Nathan</cide></p> </div>
OR, if you’re feeling fancy:
<figure> <blockquote> <p>Nathan is such a froody dude</p> <p><cite>Nathan</cite></p> </blockquote> </figure>
Miniflux does an okay job parsing this mess. In fact, the reason it does so well is that there are already some predefined rules included. We can, however, do much better than that.
While I know that this can be a cat and mouse game between a scraper and the site to be scraped, I can at least make my best efforts to clean up the junk that comprises an average article. To that end, I’ve begun cataloging the various scraper rules that I use to clean up articles. It’s by no means a perfect list of cleanup rules, but it makes articles much better to read.
If you’re reading this in some far flung future, and have a rule you’d like to share with me, and by that extension: the world, send me an email with the scraping rule and the site and I’ll add it.