[Python] Get links to every article of a blog

16 December 2013 python, scrapy

If you’re like me, you read a lot of blogs. You probably have a bunch of favorite blog that you read careful every single article.

Reading articles as they are posted is straightforward: you just need to subscribe to the RSS or periodically visit the site. But if you want to read articles written a long time ago, it’s a pain. Indeed, blogs usually use pagination to limit the number of article per page, and this prevents you from knowing if you read all articles.

Since I can’t stand the idea of possibly missing one article of Fabulous Adventures in Coding, I decided to write a web crawler to get all the links to all articles in one page.

1. Install scrapy

There are many web crawling frameworks available. I’m going to use Scrapy in the article.

The scrapy documentation contains an installation guide, but it was not working on my system (Ubuntu 13.10)

I ended up doing this:

# Install python development libs
sudo apt-get install python-setuptools python-dev libxml2-dev libxslt-dev libz-dev

# Install Scrapy
sudo easy_install scrapy

There are a few warnings that can be safely ignore.

2. Create a project

Scrapy contains a command line tool to create a project

scrapy startproject BlogRipper

The next step is to define the type of data that is going to be extracted by scrapy. This is done by creating a class derived from Item.

Edit BlogRipper/items.py and replace by:

from scrapy.item import Item, Field

class ArticleItem(Item):
    title = Field()
    url = Field()

This is extremely straightforward: an ArticleItem contains two fields, title and url.

3. Create a spider

A scrapy project must contain at least one spider. A spider is a class that is responsible of analyzing the pages of one website.

Once again a command line tool is going to help:

scrapy genspider -t basic ericlippert blogs.msdn.com

Note: -t basic tells scrapy to create a BaseSpider; the default is to create a CrawlingSpider which is great but not efficient for what i want to do.

Now, we can modify the spider. Edit BlogRipper/spiders/ericlippert.py:

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.http import Request
from scrapy.spider import BaseSpider
from BlogRipper.items import ArticleItem

class EriclippertSpider(BaseSpider):
    name = 'ericlippert'
    allowed_domains = ['blogs.msdn.com']
    start_urls = ['http://blogs.msdn.com/b/ericlippert/']
    article_extractor = SgmlLinkExtractor(allow=r'/archive/\d{4}/\d{2}/\d{2}/')
    next_page_extractor = SgmlLinkExtractor(
                allow=r'/default.aspx\?PageIndex=\d+$', 
                restrict_xpaths="//a[@class='selected']/following-sibling::a[@class='page']")

    def parse(self, response):
        articles = self.article_extractor.extract_links(response)  
        for link in articles:
            item = ArticleItem()      
            item['title'] = link.text
            item['url'] = link.url
            yield item
        next_pages = self.next_page_extractor.extract_links(response)
        if next_pages:
            yield Request(next_pages[0].url)

4. Understand how it works

The code defines a spider class derived from BaseSpider

The BaseSpider requires:

a name
a list of allowed_domains
a list of start_urls
a parse() method that will be call to analyze each scrapped page

In the parse() implementation i need 2 kinds of links:

the links to every articles in the page (25 per page on that blog)
the link to the next page (only one per page)

These links are extracted thanks to two SgmlLinkExtractor.

The article_extractor will look for links like:

/archive/2012/11/29/a-new-fabulous-adventure.aspx
/archive/2012/11/09/it-s-still-essential.aspx
...

The next_page_extractor will look for links like:

/default.aspx?PageIndex=2
/default.aspx?PageIndex=3
...

Since we want to scrap the pages in order, we need to get the link to the next page only. This is done by the xpath contraint on next_page_extractor:

//a[@class='selected']/following-sibling::a[@class='page']

You need to look at the page’s source to understand this xpath.

The return value of the parse() method is a list that contains Item and Request. Each yielded Item is added to the scrapped items. Each yielded Request triggers a parse() on the new page.

5. Create a custom exporter

We could already run our spider with

scrapy crawl ericlippert

But the scrapped links will not be saved anywhere.

Scrapy already contains exporters for JSON, XML, etc. But I want the output to be readable from GitHub. So I’ll create an exporter for the Markdown format.

Create a file BlogRipper/exporters.py:

from scrapy.contrib.exporter import BaseItemExporter

class MarkdownItemExporter(BaseItemExporter):
    
    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        self.count = 0
        
    def export_item(self, item):
        self.count += 1
        self.file.write("%d. [%s](%s)\n" % 
         (self.count, item['title'].encode('utf-8'), item['url']))

And now we just need to add that exporter in settings.py:

FEED_EXPORTERS = {
    'md': 'BlogRipper.exporters.MarkdownItemExporter',
}

6. Run !

Everything is ready, just type:

scrapy crawl ericlippert -t md -o ericlippert.md

Here is the resulting log (with LOG_LEVEL='INFO'):

13:58:05 [scrapy] INFO: Scrapy 0.20.2 started (bot: BlogRipper)
13:58:05 [ericlippert] INFO: Spider opened
13:58:34 [ericlippert] INFO: Closing spider (finished)
13:58:34 [ericlippert] INFO: Stored md feed (783 items) in: ericlippert.md
13:58:34 [ericlippert] INFO: Dumping Scrapy stats:
	{'downloader/request_bytes': 12500,
	 'downloader/request_count': 32,
	 'downloader/request_method_count/GET': 32,
	 'downloader/response_bytes': 740052,
	 'downloader/response_count': 32,
	 'downloader/response_status_count/200': 32,
	 'finish_reason': 'finished',
	 'finish_time': datetime.datetime(2013, 12, 13, 12, 58, 34, 83028),
	 'item_scraped_count': 783,
	 'log_count/INFO': 4,
	 'request_depth_max': 31,
	 'response_received_count': 32,
	 'scheduler/dequeued': 32,
	 'scheduler/dequeued/memory': 32,
	 'scheduler/enqueued': 32,
	 'scheduler/enqueued/memory': 32,
	 'start_time': datetime.datetime(2013, 12, 13, 12, 58, 5, 805472)}
13:58:34 [ericlippert] INFO: Spider closed (finished)

7. Results

Here are the results of several blogs:

The complete project is available on GitHub: https://github.com/bblanchon/BlogRipper/