Newspaper lets anyone do article extraction like Instapaper and Pocket.
Newspaper is a Python 2 library for extracting & curating articles from the web.
It wants to change the way people handle article extraction with a new, more precise layer of abstraction.
Besides “read later” services, there’s a growing number of APIs that provide article extraction as a service like diffbot and embed.ly. Those services are great, but it’s nice that newspaper is open source and hackable.
For instance, when I first checked out newspaper it only had plain text article extraction. Sometimes, though, I want the original markup of the article with some sanitization. It helps to have the paragraphs, links, and headers accurately represent the article. So, I forked the project, made some changes, and the maintainer codelucas was reactive and worked with me to get my changes merged in.
If you want a place to start working on article extraction Newspaper looks like a good bet.