This section of the docs shows how to do some useful but advanced things with newspaper.
Multi-threading article downloads¶
Downloading articles one at a time is slow. But spamming a single news source like cnn.com with tons of threads or with ASYNC-IO will cause rate limiting and also doing that is very mean.
We solve this problem by allocating 1-2 threads per news source to both greatly speed up the download time while being respectful.
>>> import newspaper >>> from newspaper import news_pool >>> slate_paper = newspaper.build('http://slate.com') >>> tc_paper = newspaper.build('http://techcrunch.com') >>> espn_paper = newspaper.build('http://espn.com') >>> papers = [slate_paper, tc_paper, espn_paper] >>> news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total >>> news_pool.join() At this point, you can safely assume that download() has been called on every single article for all 3 sources. >>> print(slate_paper.articles.html) u'<html> ...'
Keeping Html of main body article¶
Keeping the html of just an article’s body text is helpbut because it allows you to retain some of the semantic information in the html. Also it will help if you end up displaying the extracted article somehow.
Here is how to do so:
>>> from newspaper import Article >>> a = Article('http://www.cnn.com/2014/01/12/world/asia/north-korea-charles-smith/index.html' , keep_article_html=True) >>> a.download() >>> a.parse() >>> a.article_html u'<div> \n<p><strong>(CNN)</strong> -- Charles Smith insisted Sunda...'
The lxml (dom object) and top_node (chunk of dom that contains our ‘Article’) are also cached incase users would like to use them.
Access after parsing() with:
>>> a.download() >>> a.parse() >>> a.clean_dom <lxml object ... > >>> a.clean_top_node <lxml object ... >
Adding new languages¶
First, please reference this file and read from the highlighted line all the way down to the end of the file.
One aspect of our text extraction algorithm revolves around counting the number of stopwords present in a text. Stopwords are: some of the most common, short function words, such as the, is, at, which, and on in a language.
Reference this line to see it in action: https://github.com/codelucas/newspaper/blob/master/newspaper/extractors.py#L668
So for latin languages, it is pretty basic. We first provide a list of
stopwords-<language-code>.txt form. We then take some input text and
tokenize it into words by splitting the white space. After that we perform some
bookkeeping and then proceed to count the number of stopwords present.
For non-latin languages, as you may have noticed in the code above, we need to tokenize the words in a different way, splitting by whitespace simply won’t work for languages like Chinese or Arabic. For the Chinese language we are using a whole new open source library called jieba to split the text into words. For arabic we are using a special nltk tokenizer to do the same job.
So, to add full text extraction to a new (non-latin) language, we need:
1. Push up a stopwords file in the format of
2. Provide a way of splitting/tokenizing text in that foreign language into words. Here are some examples for Chinese, Arabic, English
For latin languages:
1. Push up a stopwords file in the format of
newspaper/resources/text/. and we are done!
Explicitly building a news source¶
Instead of using the
newspaper.build(..) api, we can take one step lower
>>> from newspaper import Source >>> cnn_paper = Source('http://cnn.com') >>> print(cnn_paper.size()) # no articles, we have not built the source 0 >>> cnn_paper.build() >>> print(cnn_paper.size()) 3100
build() method above. You may go lower level and de-abstract it
for absolute control over how your sources are constructed.
>>> cnn_paper = Source('http://cnn.com') >>> cnn_paper.download() >>> cnn_paper.parse() >>> cnn_paper.set_categories() >>> cnn_paper.download_categories() >>> cnn_paper.parse_categories() >>> cnn_paper.set_feeds() >>> cnn_paper.download_feeds() >>> cnn_paper.generate_articles() >>> print(cnn_paper.size()) 3100
And voila, we have mimic’d the
build() method. In the above sequence,
every method is dependant on the method above it. Stop whenever you wish.
Parameters and Configurations¶
Newspaper provides two api’s for users to configure their
Source objects. One is via named parameter passing recommended and
the other is via
Here are some named parameter passing examples:
>>> import newspaper >>> from newspaper import Article, Source >>> cnn = newspaper.build('http://cnn.com', language='en', memoize_articles=False) >>> article = Article(url='http://cnn.com/french/...', language='fr', fetch_images=False) >>> cnn = Source(url='http://latino.cnn.com/...', language='es', request_timeout=10, number_threads=20)
Here are some examples of how Config objects are passed.
>>> import newspaper >>> from newspaper import Config, Article, Source >>> config = Config() >>> config.memoize_articles = False >>> cbs_paper = newspaper.build('http://cbs.com', config) >>> article_1 = Article(url='http://espn/2013/09/...', config) >>> cbs_paper = Source('http://cbs.com', config)
Here is a full list of the configuration options:
keep_article_html, default False, “set to True if you want to preserve html of body text”
http_success_only, default True, “set to False to capture non 2XX responses as well”
MIN_WORD_COUNT, default 300, “num of word tokens in article text”
MIN_SENT_COUNT, default 7, “num of sentence tokens”
MAX_TITLE, default 200, “num of chars in article title”
MAX_TEXT, default 100000, “num of chars in article text”
MAX_KEYWORDS, default 35, “num of keywords in article”
MAX_AUTHORS, default 10, “num of author names in article”
MAX_SUMMARY, default 5000, “num of chars of the summary”
MAX_SUMMARY_SENT, default 5, “num of sentences in summary”
MAX_FILE_MEMO, default 20000, “python setup.py sdist bdist_wininst upload”
memoize_articles, default True, “cache and save articles run after run”
fetch_images, default True, “set this to false if you don’t care about getting images”
follow_meta_refresh, default False, “follows a redirect url in a meta refresh html tag”
image_dimension_ration, default 16/9.0, “max ratio for height/width, we ignore if greater”
language, default ‘en’, “run
newspaper.languages() to see available options.”
browser_user_agent, default ‘newspaper/%s’ % __version__
request_timeout, default 7
number_threads, default 10, “number of threads when mthreading”
verbose, default False, “turn this on when debugging”
You may notice other config options in the
however, they are private, please do not toggle them.
Here, we will define exactly how newspaper handles a lot of the data extraction.