[Python] Get links to every article of a blog
If you’re like me, you read a lot of blogs. You probably have a bunch of favorite blog that you read careful every single article.
Reading articles as they are posted is straightforward: you just need to subscribe to the RSS or periodically visit the site. But if you want to read articles written a long time ago, it’s a pain. Indeed, blogs usually use pagination to limit the number of article per page, and this prevents you from knowing if you read all articles.
Since I can’t stand the idea of possibly missing one article of Fabulous Adventures in Coding, I decided to write a web crawler to get all the links to all articles in one page.
1. Install scrapy
There are many web crawling frameworks available. I’m going to use Scrapy in the article.
The scrapy documentation contains an installation guide, but it was not working on my system (Ubuntu 13.10)
I ended up doing this:
# Install python development libs
sudo apt-get install python-setuptools python-dev libxml2-dev libxslt-dev libz-dev
# Install Scrapy
sudo easy_install scrapy
There are a few warnings that can be safely ignore.
2. Create a project
Scrapy contains a command line tool to create a project
scrapy startproject BlogRipper
The next step is to define the type of data that is going to be extracted by scrapy.
This is done by creating a class derived from Item
.
Edit BlogRipper/items.py
and replace by:
from scrapy.item import Item, Field
class ArticleItem(Item):
title = Field()
url = Field()
This is extremely straightforward: an ArticleItem
contains two fields, title
and url
.
3. Create a spider
A scrapy project must contain at least one spider. A spider is a class that is responsible of analyzing the pages of one website.
Once again a command line tool is going to help:
scrapy genspider -t basic ericlippert blogs.msdn.com
Note: -t basic
tells scrapy to create a BaseSpider
; the default is to create a CrawlingSpider
which is great but not efficient for what i want to do.
Now, we can modify the spider.
Edit BlogRipper/spiders/ericlippert.py
:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.spider import BaseSpider
from BlogRipper.items import ArticleItem
class EriclippertSpider(BaseSpider):
name = 'ericlippert'
allowed_domains = ['blogs.msdn.com']
start_urls = ['http://blogs.msdn.com/b/ericlippert/']
article_extractor = SgmlLinkExtractor(allow=r'/archive/\d{4}/\d{2}/\d{2}/')
next_page_extractor = SgmlLinkExtractor(
allow=r'/default.aspx\?PageIndex=\d+$',
restrict_xpaths="//a[@class='selected']/following-sibling::a[@class='page']")
def parse(self, response):
articles = self.article_extractor.extract_links(response)
for link in articles:
item = ArticleItem()
item['title'] = link.text
item['url'] = link.url
yield item
next_pages = self.next_page_extractor.extract_links(response)
if next_pages:
yield Request(next_pages[0].url)
4. Understand how it works
The code defines a spider class derived from BaseSpider
The BaseSpider
requires:
- a
name
- a list of
allowed_domains
- a list of
start_urls
- a
parse()
method that will be call to analyze each scrapped page
In the parse()
implementation i need 2 kinds of links:
- the links to every articles in the page (25 per page on that blog)
- the link to the next page (only one per page)
These links are extracted thanks to two SgmlLinkExtractor
.
The article_extractor
will look for links like:
/archive/2012/11/29/a-new-fabulous-adventure.aspx
/archive/2012/11/09/it-s-still-essential.aspx
...
The next_page_extractor
will look for links like:
/default.aspx?PageIndex=2
/default.aspx?PageIndex=3
...
Since we want to scrap the pages in order, we need to get the link to the next page only.
This is done by the xpath contraint on next_page_extractor
:
//a[@class='selected']/following-sibling::a[@class='page']
You need to look at the page’s source to understand this xpath.
The return value of the parse()
method is a list that contains Item
and Request
.
Each yielded Item
is added to the scrapped items.
Each yielded Request
triggers a parse()
on the new page.
5. Create a custom exporter
We could already run our spider with
scrapy crawl ericlippert
But the scrapped links will not be saved anywhere.
Scrapy already contains exporters for JSON, XML, etc. But I want the output to be readable from GitHub. So I’ll create an exporter for the Markdown format.
Create a file BlogRipper/exporters.py
:
from scrapy.contrib.exporter import BaseItemExporter
class MarkdownItemExporter(BaseItemExporter):
def __init__(self, file, **kwargs):
self._configure(kwargs)
self.file = file
self.count = 0
def export_item(self, item):
self.count += 1
self.file.write("%d. [%s](%s)\n" %
(self.count, item['title'].encode('utf-8'), item['url']))
And now we just need to add that exporter in settings.py
:
FEED_EXPORTERS = {
'md': 'BlogRipper.exporters.MarkdownItemExporter',
}
6. Run !
Everything is ready, just type:
scrapy crawl ericlippert -t md -o ericlippert.md
Here is the resulting log (with LOG_LEVEL='INFO'
):
13:58:05 [scrapy] INFO: Scrapy 0.20.2 started (bot: BlogRipper)
13:58:05 [ericlippert] INFO: Spider opened
13:58:34 [ericlippert] INFO: Closing spider (finished)
13:58:34 [ericlippert] INFO: Stored md feed (783 items) in: ericlippert.md
13:58:34 [ericlippert] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 12500,
'downloader/request_count': 32,
'downloader/request_method_count/GET': 32,
'downloader/response_bytes': 740052,
'downloader/response_count': 32,
'downloader/response_status_count/200': 32,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 12, 13, 12, 58, 34, 83028),
'item_scraped_count': 783,
'log_count/INFO': 4,
'request_depth_max': 31,
'response_received_count': 32,
'scheduler/dequeued': 32,
'scheduler/dequeued/memory': 32,
'scheduler/enqueued': 32,
'scheduler/enqueued/memory': 32,
'start_time': datetime.datetime(2013, 12, 13, 12, 58, 5, 805472)}
13:58:34 [ericlippert] INFO: Spider closed (finished)
7. Results
Here are the results of several blogs:
The complete project is available on GitHub: https://github.com/bblanchon/BlogRipper/