Eager to get started? This page gives a good introduction in how to get started with newspaper. This assumes you already have newspaper installed. If you do not, head over to the Installation section.
Building a news source¶
Source objects are an abstraction of online news media websites like CNN or ESPN. You can initialize them in two different ways.
Source will extract its categories, feeds, articles, brand, and description for you.
You may also provide configuration parameters like
browser_user_agent, and etc seamlessly. Navigate to the advanced section for details.
>>> import newspaper >>> cnn_paper = newspaper.build('http://cnn.com') >>> sina_paper = newspaper.build('http://www.lemonde.fr/', language='fr')
However, if needed, you may also play with the lower level
Source object as described
in the advanced section.
Every news source has a set of recent articles.
The following examples assume that a news source has been initialized and built.
>>> for article in cnn_paper.articles: >>> print(article.url) u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/' u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html' ... >>> print(cnn_paper.size()) # cnn has 3100 articles 3100
By default, newspaper caches all previously extracted articles and eliminates any article which it has already extracted.
This feature exists to prevent duplicate articles and to increase extraction speed.
>>> cbs_paper = newspaper.build('http://cbs.com') >>> cbs_paper.size() 1030 >>> cbs_paper = newspaper.build('http://cbs.com') >>> cbs_paper.size() 2
The return value of
cbs_paper.size() changes from 1030 to 2 because when we first
crawled cbs we found 1030 articles. However, on our second crawl, we eliminate all
articles which have already been crawled.
This means 2 new articles have been published since our first extraction.
You may opt out of this feature with the
You may also pass in the lower level``Config`` objects as covered in the advanced section.
>>> import newspaper >>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False) >>> cbs_paper.size() 1030 >>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False) >>> cbs_paper.size() 1030
Extracting Source categories¶
>>> for category in cnn_paper.category_urls(): >>> print(category) u'http://lifestyle.cnn.com' u'http://cnn.com/world' u'http://tech.cnn.com' ...
Extracting Source feeds¶
>>> for feed_url in cnn_paper.feed_urls(): >>> print(feed_url) u'http://rss.cnn.com/rss/cnn_crime.rss' u'http://rss.cnn.com/rss/cnn_tech.rss' ...
Extracting Source brand & description¶
>>> print(cnn_paper.brand) u'cnn' >>> print(cnn_paper.description) u'CNN.com delivers the latest breaking news and information on the latest...'
Article objects are abstractions of news articles. For example, a news
would be CNN while a news
Article would be a specific CNN article.
You may reference an
Article from an existing news
Source or initialize
one by itself.
Referencing it from a
>>> first_article = cnn_paper.articles
Article by itself.
>>> from newspaper import Article >>> first_article = Article(url="http://www.lemonde.fr/...", language='fr')
Note the similar
language= named paramater above. All the config parameters as described for
Source objects also apply for
Article objects! Source and Article objects have a very similar api.
There are endless possibilities on how we can manipulate and build articles.
Downloading an Article¶
We begin by calling
download() on an article. If you are interested in how to
quickly download articles concurrently with multi-threading check out the
>>> first_article = cnn_paper.articles >>> first_article.download() >>> print(first_article.html) u'<!DOCTYPE HTML><html itemscope itemtype="http://...' >>> print(cnn_paper.articles.html) u'' fail, not downloaded yet
Parsing an Article¶
You may also extract meaningful content from the html, like authors and body-text.
You must have called
download() on an article before calling
>>> first_article.parse() >>> print(first_article.text) u'Three sisters who were imprisoned for possibly...' >>> print(first_article.top_image) u'http://some.cdn.com/3424hfd4565sdfgdg436/ >>> print(first_article.authors) [u'Eliott C. McLaughlin', u'Some CoAuthor'] >>> print(first_article.title) u'Police: 3 sisters imprisoned in Tucson home' >>> print(first_article.images) ['url_to_img_1', 'url_to_img_2', 'url_to_img_3', ...] >>> print(first_article.movies) ['url_to_youtube_link_1', ...] # youtube, vimeo, etc
Performing NLP on an Article¶
Finally, you may extract out natural language properties from the text.
You must have called both
parse() on the article
As of the current build, nlp() features only work on western languages.
>>> first_article.nlp() >>> print(first_article.summary) u'...imprisoned for possibly a constant barrage...' >>> print(first_article.keywords) [u'music', u'Tucson', ... ] >>> print(cnn_paper.articles.nlp()) # fail, not been downloaded yet Traceback (... ArticleException: You must parse an article before you try to..
nlp() is expensive, as is
parse(), make sure you actually need them before calling them on
all of your articles! In some cases, if you just need urls, even
download() is not necessary.
Here are random but hopefully useful features!
hot() returns a list of the top
trending terms on Google using a public api.
popular_urls() returns a list
of popular news source urls.. In case you need help choosing a news source!
>>> import newspaper >>> newspaper.hot() ['Ned Vizzini', Brian Boitano', Crossword Inventor', 'Alex & Sierra', ... ] >>> newspaper.popular_urls() ['http://slate.com', 'http://cnn.com', 'http://huffingtonpost.com', ... ] >>> newspaper.languages() Your available languages are: input code full name ar Arabic de German en English es Spanish fr French he Hebrew it Italian ko Korean no Norwegian pt Portuguese sv Swedish zh Chinese