Hello, I am an Engineering Manager at Facebook with 13+ years in Ad Technology, Natural Language Processing and Data mining. (Learn More)
by Pravin Paratey

Writing a spider in 10 mins using Scrapy

I came across Scrapy a few days back and have grown to really love it. This tutorial will illustrate how you can write a simple spider using Scrapy to scrape data off Paul Smith. All this in 10 minutes.

This tutorial is available on [github](http://github.com/pravin/scrapy-tutorial). Please refer to it for code updates to accomodate periodic changes to the Paul Smith website.

Lets begin

  1. [Download](http://scrapy.org/download/) and install scrapy and its dependencies.
  2. This done, open up your terminal and type `python scrapy-ctl.py startproject paul_smith`. A scrapy project will be created.
  3. Navigate to `~/paul_smith/paul_smith/spiders` and create the file paul_smith.py with the following contents:

    from scrapy.spider import BaseSpider
    
    class PaulSmithSpider(BaseSpider):
      domain_name = "paulsmith.co.uk"
      start_urls = ["http://www.paulsmith.co.uk/paul-smith-jeans-253/category.html"]
      
      def parse(self, response):
        open('paulsmith.html', 'wb').write(response.body)
        
    SPIDER = PaulSmithSpider()
    
  4. To run the spider, go to `~/paul_smith` type `python scrapy-ctl.py crawl paulsmith.co.uk` on the command line. This will fetch the page and save it to paulsmith.html.
  5. The next step is to parse the contents of the page. Open the page in your favourite editor and try to understand the pattern of the items we want to capture. You can see that `<div class="product-group-1">` contains the required information. We are going to modify out code like so:

    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    
    class PaulSmithSpider(BaseSpider):
      domain_name = "paulsmith.co.uk"
      start_urls = ["http://www.paulsmith.co.uk/paul-smith-jeans-253/category.html"]
      
      def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@class="product-group-1"]')
        for site in sites:
          print site.extract()
    
    SPIDER = PaulSmithSpider()
    

    You can read more on XPath Selectors [here](http://doc.scrapy.org/topics/selectors.html).

  6. Finally, looking at the HTML again, we can extract title, link, img-src & sale-price like so:

    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    import random
    
    class PaulSmithSpider(BaseSpider):
      domain_name = "paulsmith.co.uk"
      start_urls = ["http://www.paulsmith.co.uk/paul-smith-jeans-253/category.html"]
      
      def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@class="product-group-1"]')
        random.shuffle(sites)
        for site in sites:
          title = site.select('h3[@class="desc"]/text()').extract()
          hlink = site.select('a/@href').extract()
          price = site.select('p[@class="price price-GBP"]/text()').extract()
          image = site.select('a/img/@src').extract()
    
          print title, hlink, image, price
    
    SPIDER = PaulSmithSpider()
    

    You can save this data to your datastore in whatever way you wish.

  7. The output of 3 random items scraped using the above code can be seen below.

Output

Source code