Advanced

This section of the docs shows how to do some useful but advanced things with newspaper.

Multi-threading article downloads

Downloading articles one at a time is slow. But spamming a single news source like cnn.com with tons of threads or with ASYNC-IO will cause rate limiting and also doing that is very mean.

We solve this problem by allocating 1-2 threads per news source to both greatly speed up the download time while being respectful.

>>> import newspaper
>>> from newspaper import news_pool

>>> slate_paper = newspaper.build('http://slate.com')
>>> tc_paper = newspaper.build('http://techcrunch.com')
>>> espn_paper = newspaper.build('http://espn.com')

>>> papers = [slate_paper, tc_paper, espn_paper]
>>> news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
>>> news_pool.join()

At this point, you can safely assume that download() has been
called on every single article for all 3 sources.

>>> print(slate_paper.articles[10].html)
u'<html> ...'

Keeping Html of main body article

Keeping the html of just an article’s body text is helpbut because it allows you to retain some of the semantic information in the html. Also it will help if you end up displaying the extracted article somehow.

Here is how to do so:

>>> from newspaper import Article

>>> a = Article('http://www.cnn.com/2014/01/12/world/asia/north-korea-charles-smith/index.html'
    , keep_article_html=True)

>>> a.download()
>>> a.parse()

>>> a.article_html
u'<div> \n<p><strong>(CNN)</strong> -- Charles Smith insisted Sunda...'

The lxml (dom object) and top_node (chunk of dom that contains our ‘Article’) are also cached incase users would like to use them.

Access after parsing() with:

>>> a.download()
>>> a.parse()
>>> a.clean_dom
<lxml object ...  >

>>> a.clean_top_node
<lxml object ...  >

Adding new languages

First, please reference this file and read from the highlighted line all the way down to the end of the file.

https://github.com/codelucas/newspaper/blob/master/newspaper/text.py#L57

One aspect of our text extraction algorithm revolves around counting the number of stopwords present in a text. Stopwords are: some of the most common, short function words, such as the, is, at, which, and on in a language.

Reference this line to see it in action: https://github.com/codelucas/newspaper/blob/master/newspaper/extractors.py#L668

So for latin languages, it is pretty basic. We first provide a list of stopwords in stopwords-<language-code>.txt form. We then take some input text and tokenize it into words by splitting the white space. After that we perform some bookkeeping and then proceed to count the number of stopwords present.

For non-latin languages, as you may have noticed in the code above, we need to tokenize the words in a different way, splitting by whitespace simply won’t work for languages like Chinese or Arabic. For the Chinese language we are using a whole new open source library called jieba to split the text into words. For arabic we are using a special nltk tokenizer to do the same job.

So, to add full text extraction to a new (non-latin) language, we need:

1. Push up a stopwords file in the format of stopwords-<2-char-language-code>.txt in newspaper/resources/text/.

2. Provide a way of splitting/tokenizing text in that foreign language into words. Here are some examples for Chinese, Arabic, English

For latin languages:

1. Push up a stopwords file in the format of stopwords-<2-char-language-code>.txt in newspaper/resources/text/. and we are done!

Finally, add the new language to the list of available languages in the following files:

  • README.rst
  • docs/index.rst
  • docs/user_guide/quickstart.rst
  • newspaper/utils.py

Explicitly building a news source

Instead of using the newspaper.build(..) api, we can take one step lower into newspaper’s Source api.

>>> from newspaper import Source
>>> cnn_paper = Source('http://cnn.com')

>>> print(cnn_paper.size()) # no articles, we have not built the source
0

>>> cnn_paper.build()
>>> print(cnn_paper.size())
3100

Note the build() method above. You may go lower level and de-abstract it for absolute control over how your sources are constructed.

>>> cnn_paper = Source('http://cnn.com')
>>> cnn_paper.download()
>>> cnn_paper.parse()
>>> cnn_paper.set_categories()
>>> cnn_paper.download_categories()
>>> cnn_paper.parse_categories()
>>> cnn_paper.set_feeds()
>>> cnn_paper.download_feeds()
>>> cnn_paper.generate_articles()

>>> print(cnn_paper.size())
3100

And voila, we have mimic’d the build() method. In the above sequence, every method is dependant on the method above it. Stop whenever you wish.

Parameters and Configurations

Newspaper provides two api’s for users to configure their Article and Source objects. One is via named parameter passing recommended and the other is via Config objects.

Here are some named parameter passing examples:

>>> import newspaper
>>> from newspaper import Article, Source

>>> cnn = newspaper.build('http://cnn.com', language='en', memoize_articles=False)

>>> article = Article(url='http://cnn.com/french/...', language='fr', fetch_images=False)

>>> cnn = Source(url='http://latino.cnn.com/...', language='es', request_timeout=10,
                                                            number_threads=20)

Here are some examples of how Config objects are passed.

>>> import newspaper
>>> from newspaper import Config, Article, Source

>>> config = Config()
>>> config.memoize_articles = False

>>> cbs_paper = newspaper.build('http://cbs.com', config)

>>> article_1 = Article(url='http://espn/2013/09/...', config)

>>> cbs_paper = Source('http://cbs.com', config)

Here is a full list of the configuration options:

keep_article_html, default False, “set to True if you want to preserve html of body text”

http_success_only, default True, “set to False to capture non 2XX responses as well”

MIN_WORD_COUNT, default 300, “num of word tokens in article text”

MIN_SENT_COUNT, default 7, “num of sentence tokens”

MAX_TITLE, default 200, “num of chars in article title”

MAX_TEXT, default 100000, “num of chars in article text”

MAX_KEYWORDS, default 35, “num of keywords in article”

MAX_AUTHORS, default 10, “num of author names in article”

MAX_SUMMARY, default 5000, “num of chars of the summary”

MAX_SUMMARY_SENT, default 5, “num of sentences in summary”

MAX_FILE_MEMO, default 20000, “python setup.py sdist bdist_wininst upload”

memoize_articles, default True, “cache and save articles run after run”

fetch_images, default True, “set this to false if you don’t care about getting images”

follow_meta_refresh, default False, “follows a redirect url in a meta refresh html tag”

image_dimension_ration, default 16/9.0, “max ratio for height/width, we ignore if greater”

language, default ‘en’, “run newspaper.languages() to see available options.”

browser_user_agent, default ‘newspaper/%s’ % __version__

request_timeout, default 7

number_threads, default 10, “number of threads when mthreading”

verbose, default False, “turn this on when debugging”

You may notice other config options in the newspaper/configuration.py file, however, they are private, please do not toggle them.

Caching

TODO

Specifications

Here, we will define exactly how newspaper handles a lot of the data extraction.

TODO