.. _quickstart: Quickstart ========== Eager to get started? This page gives a good introduction in how to get started with newspaper. This assumes you already have newspaper installed. If you do not, head over to the :ref:`Installation ` section. Building a news source ---------------------- Source objects are an abstraction of online news media websites like CNN or ESPN. You can initialize them in two *different* ways. Building a ``Source`` will extract its categories, feeds, articles, brand, and description for you. You may also provide configuration parameters like ``language``, ``browser_user_agent``, and etc seamlessly. Navigate to the :ref:`advanced ` section for details. .. code-block:: pycon >>> import newspaper >>> cnn_paper = newspaper.build('http://cnn.com') >>> sina_paper = newspaper.build('http://www.lemonde.fr/', language='fr') However, if needed, you may also play with the lower level ``Source`` object as described in the :ref:`advanced ` section. Extracting articles ------------------- Every news source has a set of *recent* articles. The following examples assume that a news source has been initialized and built. .. code-block:: pycon >>> for article in cnn_paper.articles: >>> print(article.url) u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/' u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html' ... >>> print(cnn_paper.size()) # cnn has 3100 articles 3100 Article caching --------------- By default, newspaper caches all previously extracted articles and **eliminates any article which it has already extracted**. This feature exists to prevent duplicate articles and to increase extraction speed. .. code-block:: pycon >>> cbs_paper = newspaper.build('http://cbs.com') >>> cbs_paper.size() 1030 >>> cbs_paper = newspaper.build('http://cbs.com') >>> cbs_paper.size() 2 The return value of ``cbs_paper.size()`` changes from 1030 to 2 because when we first crawled cbs we found 1030 articles. However, on our second crawl, we eliminate all articles which have already been crawled. This means **2** new articles have been published since our first extraction. You may opt out of this feature with the ``memoize_articles`` parameter. You may also pass in the lower level``Config`` objects as covered in the :ref:`advanced ` section. .. code-block:: pycon >>> import newspaper >>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False) >>> cbs_paper.size() 1030 >>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False) >>> cbs_paper.size() 1030 Extracting Source categories ---------------------------- .. code-block:: pycon >>> for category in cnn_paper.category_urls(): >>> print(category) u'http://lifestyle.cnn.com' u'http://cnn.com/world' u'http://tech.cnn.com' ... Extracting Source feeds ----------------------- .. code-block:: pycon >>> for feed_url in cnn_paper.feed_urls(): >>> print(feed_url) u'http://rss.cnn.com/rss/cnn_crime.rss' u'http://rss.cnn.com/rss/cnn_tech.rss' ... Extracting Source brand & description ------------------------------------- .. code-block:: pycon >>> print(cnn_paper.brand) u'cnn' >>> print(cnn_paper.description) u'CNN.com delivers the latest breaking news and information on the latest...' News Articles ------------- Article objects are abstractions of news articles. For example, a news ``Source`` would be CNN while a news ``Article`` would be a specific CNN article. You may reference an ``Article`` from an existing news ``Source`` or initialize one by itself. Referencing it from a ``Source``. .. code-block:: pycon >>> first_article = cnn_paper.articles[0] Initializing an ``Article`` by itself. .. code-block:: pycon >>> from newspaper import Article >>> first_article = Article(url="http://www.lemonde.fr/...", language='fr') Note the similar ``language=`` named paramater above. All the config parameters as described for ``Source`` objects also apply for ``Article`` objects! **Source and Article objects have a very similar api**. There are endless possibilities on how we can manipulate and build articles. Downloading an Article ---------------------- We begin by calling ``download()`` on an article. If you are interested in how to quickly download articles concurrently with multi-threading check out the :ref:`advanced ` section. .. code-block:: pycon >>> first_article = cnn_paper.articles[0] >>> first_article.download() >>> print(first_article.html) u'