Robustify – blog of a developer engineer

Inspiration

Well I’ve just started using Scrapy as a framework for my data scraping projects, and one of my first challenge – amongst others – was to extract specific data from a JSON response.

At first I searched the web for best practices but I could hardly find any article on this common issue. So I had to come up with my own, and that’s what I’ll share with you through this post.

Findings

As I wanted to populate my Scrapy Item using the built in ItemLoader, first I browsed the official Scrapy Item Loaders documentation, and guess what, I found something interesting and maybe useful called SelectJmes, which is an input-output processor that can query a value using the provided JSON path.

Unfortunately it requires you to install jmespath, and to know how to construct JSON paths, but also it makes extracting data more flexible and robust. To find out more about jmespath you should visit the official tutorial JMESPath tutorial.

Okay let’s set up our project and give it a chance, play a bit around in the shell and see how we can utilize it.

Setting up

If you want to do a little practice instead of just reading you should need the following tools:

Python3.x – I’m using 3.5.2 for this tutorial
Scrapy – version 1.4.0 at the moment
jmespath – a nice query language for JSON data
virtualenv [optional] – virtual environment wrapper for Python application

Otherwise you can grab the complete code from here: https://github.com/roboostify/jmes-scrapy.

First of all I encourage you to use virtualenv whenever you can, as it helps you to create a separate environment that you can tailor according to your need, also no root privileges needed to run your scripts within a virtual environment.

Let’s open up your favourite terminal and create a project folder and cd into it.

mkdir jmes-scrapy
cd jmes-scrapy/

Create a virtual environment [optional]

Now if you don’t like the idea of using a virtual environment then you can skip the following, otherwise:

sudo apt-get install virtualenv
virtualenv venv -p python3
source venv/bin/activate

I’m using Ubuntu 16.04 and it comes with Python 2.7.12 as the default interpreter, so to tell virtualenv what kind of Python environment it should set up I need to use the -p python3 flag, where python3 is the name of the Python3.x executable.

As you can observe a new directory called venv is created with all the neat stuff needed to run your Python scripts inside.

With the source venv/bin/activate command you can activate the virtual environment. You can test that the desired version got set up by checking the version of the Python interpreter, like python --version.

To deactivate the environment just type deactivate.

That’s it, this way you got a separated playground for your project where you can mess around.

Install the dependencies

For our project we will need Scrapy for the data scraping, and jmespath to extract the data from JSON as we discussed earlier. Let’s install them through pip:

pip install Scrapy
pip install jmespath

Fortunately we have really a few dependencies, so we can continue with setting up a Scrapy project, but first we need to find something to scrape, preferably something that sends us a JSON response.

What to scrape?

Well, I browsed the list of public JSON APIs and I found this gem JSONPlaceholder.

It seems really neat and simple, so we can use it for our purposes – and I hope it will be proven to be future-proof as well.

You can choose between several endpoints, but in this tutorial I’m gonna be using the one you can find at jsonplaceholder.typicode.com/users.

It will provide us a list of JSON objects like this one:

  {
    "id": 1,
    "name": "Leanne Graham",
    "username": "Bret",
    "email": "Sincere@april.biz",
    "address": {
      "street": "Kulas Light",
      "suite": "Apt. 556",
      "city": "Gwenborough",
      "zipcode": "92998-3874",
      "geo": {
        "lat": "-37.3159",
        "lng": "81.1496"
      }
    },
    "phone": "1-770-736-8031 x56442",
    "website": "hildegard.org",
    "company": {
      "name": "Romaguera-Crona",
      "catchPhrase": "Multi-layered client-server neural-net",
      "bs": "harness real-time e-markets"
    }
  },

We don’t have to scrape all of these fields but only a few of them:

id
name
email
address
phone
company name

Set up a Scrapy project

Okay we made a real progress for sure so far. Let’s get continued with the scraper project setup:

scrapy startproject jmesscraper
cd jmesscraper/
scrapy genspider json jsonplaceholder.typicode.com/users

This way Scrapy set us up a project skeleton, that we can easily extend. We also added a spider and for the sake of simplicity gave it the name User, as it will be scraping some user data.

Now you can take a look around what’s inside in your project at the moment, but don’t forget to come back, as still interesting things are going to follow up!

Before implementing our scraping algorithm, first let’s define the structure of our Item, for this open the items.py file and replace it with:

import scrapy
class UserItem(scrapy.Item):
    """User item definition for jsonplaceholder /users endpoint."""
    user_id = scrapy.Field()
    name = scrapy.Field()
    email = scrapy.Field()
    address = scrapy.Field()
    phone = scrapy.Field()
    company = scrapy.Field()

Fine, we have created an item for our users, now we can move on and create the spider to scrape them.

Navigate to our spider at spiders/user.py, open it up and extend it like this:

import json

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join, MapCompose, SelectJmes

from jmes_scraper.items import UserItem

class UserSpider(scrapy.Spider):
    """Spider to scrape `http://jsonplaceholder.typicode.com/users`."""
    name = 'user'
    allowed_domains = ['jsonplaceholder.typicode.com/users']
    start_urls = ['http://jsonplaceholder.typicode.com/users/']

    # dictionary to map UserItem fields to Jmes query paths
    jmes_paths = {
        'user_id': 'id',
        'name': 'name',
        'email': 'email',
        'address': 'address.["zipcode", "city", "street", "suite"]',
        'phone': 'phone',
        'company': 'company.name',
    }

    def parse(self, response):
        """Main parse method."""
        jsonresponse = json.loads(response.body_as_unicode())

        for user in jsonresponse:

            loader = ItemLoader(item=UserItem())  # create an ItemLoader to populate a UserItem
            loader.default_input_processor = MapCompose(str)  # apply str conversion on each value
            loader.default_output_processor = Join(' ')

            for (field, path) in self.jmes_paths.items():
                loader.add_value(field, SelectJmes(path)(user))

            yield loader.load_item()

Well at first sight it’s not too verbose, but actually quite a few things are going on. Let’s break it down:

import json
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import SelectJmes
from jmes_scraper.items import UserItem

Obviously we have our imports at the beginning. We import scrapy, and its ItemLoader, the SelectJmes processors, our previously defined UserItem and last but not least the built-in json module, as we will need it to convert the response body to a Python list.

    # dictionary to map UserItem fields to Jmes query paths
    jmes_paths = {
        'user_id': 'id',
        'name': 'name',
        'email': 'email',
        'address': 'address.["zipcode", "city", "street", "suite"]',
        'phone': 'phone',
        'company': 'company.name',
    }

Skipping the original class boilerplate that Scrapy generated for us, we have this jmes_paths dict here. Actually it’s a mapping dictionary, that holds the UserItem field and JMES query path key-value pairs, so we can use a loop to iterate through it and pass these to our ItemLoader.

Okay let’s get to the parse method:

jsonresponse = json.loads(response.body_as_unicode())

for user in jsonresponse:

    loader = ItemLoader(item=UserItem())  # create an ItemLoader to populate a UserItem
    loader.default_input_processor = MapCompose(str)  # apply str conversion on each value
    loader.default_output_processor = Join(' ')

    for (field, path) in self.jmes_paths.items():
        loader.add_value(field, SelectJmes(path)(user))

    yield loader.load_item()

First, using the body_as_unicode() method, we convert the body of the response to a string, as its original type is bytes which cannot be processed by the json module.

Next the json.loads method, will convert our JSON string to a Python object, in this case a list, which will hold the dicts of our users.

Then we iterate over the list of user dictionaries of our jsonresponse.

Within the loop an ItemLoader is being instantiated using the UserItem as its model.

Here it’s important to note that the loader converts every field to an iterable, because processors will always look for an iterable object as their input.
Therefore you can see there are two default settings made for the loader, namely an input and an output processor set as default. In this case the input processor will convert every field to a string, and the output processor will join them using a blank space as a separator.

After these settings another for loop will be using the key-value pairs from the jmes_path dictionary, to extract the values from the user dictionary by applying the SelectJmes processor on them.
Then the loader.add_value method will pass the queried value to the item.

Finally the loader.load_item() method returns the populated item that we can yield.

Run the spider

To run the spider go to your terminal and navigate to the root directory of your project and start it like:

scrapy crawl user -o users.csv

I also appended the -o flag to the command so at the end of the scrape you will find a file users.csv in the directory where you executed the command, which will be populated with the data of our UserItems.

That’s it, we’re done!

Final words

Well you could see an approach of scraping JSON endpoints, utilizing the inbuilt SelectJmes processor together with a Scrapy Item and ItemLoader.

Not to mention I only touched the tip of the iceberg here, as I found this solution useful during my projects so far, but please feel free to share your thoughts and ideas to help improve others!