Newspaper delivers Instapaper style article extraction #

Newspaper lets anyone do article extraction like Instapaper and Pocket.

Newspaper is a Python 2 library for extracting & curating articles from the web.
It wants to change the way people handle article extraction with a new, more precise layer of abstraction.

Besides “read later” services, there’s a growing number of APIs that provide article extraction as a service like diffbot and Those services are great, but it’s nice that newspaper is open source and hackable.

For instance, when I first checked out newspaper it only had plain text article extraction. Sometimes, though, I want the original markup of the article with some sanitization. It helps to have the paragraphs, links, and headers accurately represent the article. So, I forked the project, made some changes, and the maintainer codelucas was reactive and worked with me to get my changes merged in.

If you want a place to start working on article extraction Newspaper looks like a good bet.

Prism – command line and Ruby library parser for Microformats #

Prism from Mark Wunsch is a cool Ruby library and command line tool for parsing Microformats. It even supports vCard export:

twitter_contacts = Prism.find '', :hcard
me = twitter_contacts.first
#=> "Mark Wunsch"
#=> "Wunsch"
#=> ""'mark.vcf','w') {|f| f.write me.to_vcard }
## Add me to your address book!

Or if the command line is your bag:

$: prism --hcard > ~/Desktop/me.vcf

or using STDIN and cURL:

$: curl | prism --hcard > ~/Desktop/me.vcf

Prism also includes a DSL for parsing your own POSH formats:

class Navigation < Prism::POSH::Base
    search {|document| document.css('ul#navigation') }
    # Search a Nokogiri document for nodes of a certain type

    validate {|node| node.matches?('ul#navigation') }
    # Validate that a node is the right element we want

    has_many :items do
        search {|doc| doc.css('li') }
    # has_many and has_one define properties, which themselves inherit from
    # Prism::POSH::Base, so you can do :has_one, :has_many, :search, :extract, etc.

[Source on GitHub] [Docs] [Mark Wunsch on Twitter]