Tuesday, June 30, 2015

Moved blog to: http://eliasdorneles.github.io

The content of this blog is now at: http://eliasdorneles.github.io.

Those of you who use a feed reader, the new address is: http://eliasdorneles.github.io/feeds/all.atom.xml

See you there!

Migrei o conteúdo do blog: http://eliasdorneles.github.io.

Para quem usa leitor de feed, o novo endereço é: http://eliasdorneles.github.io/feeds/all.atom.xml

Vejo você lá!

Wednesday, January 28, 2015

Useful small scripts for your ~/bin


So you are a command line geek, you do your shell-scripts and one-liners using bash to get information you need.

You use cut for handling things delimited by some character like comma or space, but sometimes you have a list of stuff separated by a varied amount of spaces, like this:

$ cat myfile.txt
Amelia       555-5553     amelia.zodiacusque@gmail.com
Julie        555-6699     julie.perscrutabor@skeeve.com

This form of vertical alignment is cool because you can skim through any column very quickly, but it makes your life a little bit harder when processing it in the command line. Well, maybe you're smiling to yourself because you already know about awk, and you use it all the time for this kind of stuff doing something like:

$ cat myfile.txt | awk '{ print $3 }'

That's all fine, awk is a great tool to grok, but typing all that just for getting out a field is a bit annoying. So I've written a little script to make that yet easier: just download it and put somewhere in your $PATH. After that, next time you find yourself in this situation you'll do:

$ cat myfile.txt | fields 1
$ cat myfile.txt | fields 2
$ cat myfile.txt | fields 2 3
555-5553 amelia.zodiacusque@gmail.com
555-6699 julie.perscrutabor@skeeve.com

Another thing you may found yourself running to awk is when you need a sum of numbers. Often when you need a summation you will also need some more stuff, and then it might be time to start writing a Python/Ruby/Perl script instead of hacking in the command line. But I've often finding myself wanting just a simple sum that justified putting the little awk one-liner into a script of its own:

$ cat ~/bin/total_sum
# Report sum of numbers fed to the stdin

awk '{ total+=$1 } END { print total }'
$ echo -e "1\n2\n3\n4"
$ echo -e "1\n2\n3\n4" | total_sum

If you find yourself having to count digits from a long number representing the size in bytes of some big file, you're not alone.

I got tired of this and wrote a Python script to humanize these numbers (download here), shamelessly stealing the naturalsize function from the humanize Python library.

Look how pleasant it is to use it:

$ humanize 32432
32.4 kB
$ echo -e "10\n1200\n54356\n3123342\n3294384948" > some_file.txt
$ cat some_file.txt
$ cat some_file.txt | humanize
10 Bytes
1.2 kB
54.4 kB
3.1 MB
3.3 GB
$ cat some_file.txt | humanize --binary
10 Bytes
1.2 KiB
53.1 KiB
3.0 MiB
3.1 GiB

That's all I had, folks! :)

Sunday, January 18, 2015

Things I Learned from Destroy All Software - Season 1

So, a few weeks ago I purchased and watched the first season of Destroy All Software screencasts (from Gary Bernhard), and it was awesome. I'd say there are different kinds of stuff to learn from it, depending on your personal interests and experience.

Here are my notes for some things I found useful and want to remember for later.

About Git:

You can use the  git rev-list HEAD  command to get a list of commits in the current branch. This is useful for writing scripts to report something about each commit in the repo. You can, for example, check the evolution of line counts over the course of the project (or any other statistic like number of tests, number of files, etc). You can also run the tests for every commit in the history:

git rev-list HEAD | while read rev; do
   git checkout $rev && git clean -fd && make test;

Gary has a script ready for this: run-command-on-git-revisions

Git tracks everything that happens in the local repo, even stuff that is not shared when you push (like when you rewrite history doing a rebase). You can use  git reflog  to see the history of local changes and  git reset --hard REFLOG_ENTRY  to go back to where you were.

About design:

Avoiding nil is good because it makes your code more predictable and tracebacks more understandable.

When adding tests to a suite, it's important to pay attention at your stubs. If they are getting complicated, the design can probably be improved in a way that will render better tests and better production code.

When adding extra functions to a 3rd party API, sometimes it's tempting to do monkey-patching of the library for a small change. It's usually better to use a wrapper it instead, because later you will probably need more changes and this will be easier if your production code is already using a wrapper.

Isolated tests are good because they run faster, they encourage better design and code clarity. This is something I've read about repeatedly in the past and also applied to some of the code I've done myself, so it's not that new. But it was great to watch someone applying this on several different code bases, showing some interesting paths -- it inspired me to apply a bit more of this in my work.

About Unix:

To bring a background process to foreground, besides fg, you can also use %N where N is the job number reported by the command jobs.

You can use the output of a command as if it were a filename, using <(COMMAND), like so:

$ diff <(echo) <(date)
> Sat Jan 17 18:47:48 BRST 2015

The shell will run the commands and pass to the program file descriptors with the proper contents, which you can verify doing:

$ echo <(ls -l) <(ls -l)
/dev/fd/63 /dev/fd/62

This is useful for commands like diff that need more than one input file (and therefore can't just use the stdin), to use it with arbitrary stuff generated from other commands.

I found this specially helpful for when I want to refactor a script: before I make changes, I store its original output in a file and then while refactoring I keep comparing the output of the script with the file to see if I'm not breaking anything:

diff <(./my_script) original.txt

I can also be useful to see differences between two different versions of the same web application or sites:

diff <(curl -s http://somesite.com)  <(curl -s http://anothersite.com)

About vim:

It's nice to see people using vim well, because I can compare with my own habits and see things where I can improve. Gary seems to strive to keep a tight feedback loop in everything he's doing, so seeing him using vim and the shell to build things in this fashion is pretty cool. I see it as a nice validation for the choice of tools.

I found myself using splits a bit more, after watching the screencasts. I'm more used to using vim tabs, mostly because I'm not much comfortable with stuff happening outside my view. This is probably something I can work better if I grow some tolerance for it.

Apart from this, I stole a bunch of vim functions and ideas from Gary's dotfiles, which I also share in my dotfiles.

Sunday, October 12, 2014

The Visual Display of Quantitative Information, or How to Make Better Graphs

book cover imageThe Visual Display of Quantitative Information, by Edward Tufte, is a beautiful book. It's not just well-written, it's really beautiful, you feel like every inch of the book was planned and designed with great care. The book makes the case for better data graphics, shows you several examples of great graphics (some of them were published centuries ago), plus some bad examples and how to improve them.

After reading it, I feel much more prepared to create graphs and choose better visualizations for the different kinds of data that may end up on my lap. Also, while reading the book, I got a lot of ideas for new things to try  I don't know if I'll ever be able to actually implement them, but it's been refreshing anyway.

I've put down a few notes from the book just to whet your appetite, so here you are.

Good graphics tell a story

Data graphics are not about aesthetic sensibility of the artist who created it, nor making boring data a bit more fun. Great data graphics tell you a story about something, communicates complex stuff in a clear way, makes you wonder about the data on display, there is no need for distracting decorations.

That's also why good graphics are often about multivariate and complex data, bringing new ways to look at it, enabling you to make comparisons and reason about it.


Good graphics don't lie

It's all about conveying precise information, so any tricks to distract the viewer from the truth are a bad idea. It's not much different than the written words, after all.

Therefore, when constructing data graphics, make the physical representation of numbers in the paper or the screen always directly proportional to the quantities they represent. Do not bend rules in a way that may distort the data and induce erroneous comparisons. Do not use 2D graphics for 1D data. Finally, do not quote data out of context: show the full history of the measurements and take inflation into account when showing money-over-time.

Above all else show the data

Data-ink ratio is the proportion of a graphic's ink devoted to the non-redundant display of data-information. Namely, it represents the parts of a graphic that cannot be erased without loss of information.

The process of creating a great data graphic involves maximizing the data-ink ratio, reducing all the non-relevant information. The folks at DarkHorse Analytics have done a good demonstration of this in their neat Remove to Improve slideshows (read more about it on their blog).

Friendly data graphics are accessible

And this does not mean that you should "dumb down" the graphic to make it more accessible, but that you need to have the viewer in mind while constructing it.

Therefore, you will spell out the words instead of using abbreviations, annotate the graphic with helpful little messages instead of requiring elaborated legends, use colors in a way that color-deficient people can also make sense of the graphic (tip: use red-blue instead of red-green for contrast), use clear, precise and modest font types, upper-and-lower case and with serifs.

Liked it? Buy the book and read it, it's worthy it.

Thanks Paul, for the great book recommendation. =)

Saturday, August 30, 2014

Web Scraping with Scrapy - first steps

Imagine you want to extract content from the Web that isn't all in only one page: you need a way to navigate through the site to get to the pages that contain the useful information. For example, maybe you want to get the latest "big questions" articles of the Mental Floss website, but only those in the Origins and Fact Check categories.


If you have an interest in Python and web scraping, you may have already played with the nice requests library to get content of pages from the Web. Maybe you have toyed around using BeautifulSoup or lxml to make the content extraction easier. Well, now we are going to show you how to use the Scrapy framework, which has all these functionalities and many more, so that solving the sort of problem we introduced above is a walk in the park.

It is worth noting that Scrapy tries not only to solve the content extraction (called scraping), but also the navigation to the relevant pages for the extraction (called crawling). To achieve that, a core concept in the framework is the Spider -- in practice, a Python object with a few special features, for which you write the code and the framework is responsible for triggering it.

Just so that you have an idea of what it looks like, come on take a peek at the code of a little program below that uses Scrapy to extract some information (link, title and number of views) from a YouTube channel. Don't worry about understanding this code yet, we're just showing it here so that you have a feeling of a code using Scrapy. By the end of this tutorial, you'll be able to understand and write programs like this one. =)

import scrapy
from scrapy.contrib.loader import ItemLoader

class YoutubeVideo(scrapy.Item):
   link = scrapy.Field()
   title = scrapy.Field()
   views = scrapy.Field()

class YoutubeChannelLister(scrapy.Spider):
   name = 'youtube-channel-lister'
   youtube_channel = 'LongboardUK'
   start_urls = ['https://www.youtube.com/user/%s/videos' % youtube_channel]

   def parse(self, response):
       for sel in response.css("ul#channels-browse-content-grid > li"):
           loader = ItemLoader(YoutubeVideo(), selector=sel)

           loader.add_xpath('link', './/h3/a/@href')
           loader.add_xpath('title', './/h3/a/text()')
           loader.add_xpath('views', ".//ul/li[1]/text()")

           yield loader.load_item()

Before we talk more about Scrapy, make sure you have the latest version installed using the command (depending on your environment, you may need to use sudo or the --user option for pip install):

pip install --upgrade scrapy

Note: depending on your Python environment, the installation may be a bit tricky because of the dependency on Twisted. If you use Windows, check out the specific instructions in the official installation guide. If you use a Debian-based Linux distro, you may want to use the official Scrapy APT repository.

To be able to follow this tutorial, you'll need Scrapy version 0.24 or above. You can check your installed Scrapy version using the command:

python -c 'import scrapy; print("%s.%s.%s" % scrapy.version_info)'

The output of this command in the environment we used for this tutorial is like this:

$ python -c 'import scrapy; print("%s.%s.%s" % scrapy.version_info)'

The anatomy of a spider

A Scrapy spider is responsible for defining how to follow the links "navigating" through a website (that's the so-called crawling part) and how to extract the information from the pages into Python data structures.

To define a minimal spider, create a class extending scrapy.Spider and give it a name using the name attribute:

import scrapy

class MinimalSpider(scrapy.Spider):
   """The smallest Scrapy-Spider in the world!"""
   name = 'minimal'

Put this in a file with the name minimal.py and run your spider to check if everything is okay, using the command:

scrapy runspider minimal.py

If everything is fine, you'll see in the screen some messages from the log marked as INFO and DEBUG. If there is any message marked as ERROR, it means that there is something wrong and you need to check for errors in your spider code.

The life of a spider starts with the generation of HTTP requests (Request objects) to put in motion the framework engine. The part of the spider responsible for this is the start_requests() method, that returns an iterable with the first requests to be done for the spider.

Adding this element to our minimal spider, we have:

import scrapy

class MinimalSpider(scrapy.Spider):
   """The smallest Scrapy-Spider of the world, maybe"""
   name = 'minimal'

   def start_requests(self):
       return [scrapy.Request(url)
               for url in ['http://www.google.com', http://www.yahoo.com']]

The start_requests() method must return an iterable of scrapy.Request objects, which represent an HTTP request to be made by the framework (these contain data like URL, parameters, cookies, etc) and define a function to be called when the request is complete -- a callback.

Note: if you are familiar with implementing AJAX in JavaScript, this way of work dispatching requests and registering callbacks may sound familiar.

In our example, we return a simple list of requests to Google and Yahoo websites, but the start_requests() method could also be implemented as a Python generator.
If you have tried to execute the example like it is now, you may noticed that there is something still missing, because Scrapy will show two messages marked as ERROR, complaining that a method was not implemented:

 File "/home/elias/.virtualenvs/scrapy/local/lib/python2.7/site-packages/scrapy/spider.py", line 56, in parse
   raise NotImplementedError

This happens because, as we didn't register a callback for the Request objects, Scrapy tried to call the default callback, which is the parse() method of the Spider object. Let's add this method to our minimal spider, so that we can execute it:

import scrapy

class MinimalSpider(scrapy.Spider):
   """The 2nd smallest Scrapy-Spider of the world!"""
   name = 'minimal'

   def start_requests(self):
       return (scrapy.Request(url)
               for url in ['http://www.google.com', http://www.yahoo.com'])

   def parse(self, response):
       self.log('GETTING URL: %s' % response.url)

Now, when you execute it using the command: scrapy runspider minimal.py you should see something like this in the output:

2014-07-26 15:39:56-0300 [minimal] DEBUG: Crawled (200) <GET http://www.google.com.br/?gfe_rd=cr&ei=_PXTU8f6N4mc8Aas1YDABA> (referer: None)
2014-07-26 15:39:56-0300 [minimal] DEBUG: GETTING URL: http://www.google.com.br/?gfe_rd=cr&ei=_PXTU8f6N4mc8Aas1YDABA
2014-07-26 15:39:57-0300 [minimal] DEBUG: Redirecting (302) to <GET https://br.yahoo.com/?p=us> from <GET https://www.yahoo.com/>
2014-07-26 15:39:58-0300 [minimal] DEBUG: Crawled (200) <GET https://br.yahoo.com/?p=us> (referer: None)
2014-07-26 15:39:58-0300 [minimal] DEBUG: GETTING URL: https://br.yahoo.com/?p=us

To make our code even cleaner, we can take advantage of the default implementation of start_requests(): if you don't define it, Scrapy will create requests for a list of URLs in the attribute named start_urls -- the same kind of thing we're doing above. So, we'll keep the same functionality and reduce the code, using:

import scrapy

class MinimalSpider(scrapy.Spider):
   """A menor Scrapy-Aranha do mundo!"""
   name = 'minimal'
   start_urls = [

   def parse(self, response):
       self.log('GETTING URL: %s' % response.url)

Like in the parse() method shown above, every callback gets the content of the HTTP response as an argument (in a Response object). So, inside this callback, where we already have the content of the page, that's where we'll do the information extraction, i.e., the data scraping itself.


Callbacks, Requests & Items
Functions registered as callbacks for the requests can return an iterable of objects, in which every object can be:

  • an instance of a subclass of scrapy.Item, which you define to contain the data to be collected from the page
  • an object of type scrapy.Request representing yet another request to be made (possibly registering another callback)

With this mechanism of requests and callbacks that may generate new requests (with new callbacks), you can program the navigation through a site generating requests for the links to be followed, until getting to the pages that contain the items you're interested. For example, for a spider that needs to extract products from the website of an online store navigating through categories, you could use a structure like the following:

import scrapy

class SkeletonSpider(scrapy.Spider):
   name = 'spider-mummy'
   start_urls = ['http://www.some-online-webstore.com']

   def parse(self, response):
       for c in [...]:
           url_category = ...
           yield scrapy.Request(url_category, self.parse_category_page)

   def parse_category_page(self, response):
       for p in [...]:
           url_product = ...
           yield scrapy.Request(url_product, self.parse_product)

   def parse_product(self, response):

In the above structure, the default callback -- parse() method -- handles the response of the first request to the online store website and generates new requests for the pages of the categories, registering another callback to handle them -- the parse_category_page() method. This last method does something similar, generating the requests for the product pages, this time registering a callback that extracts the item objects with the product data.

Why do I need to define classes for the items?

Scrapy proposes that you create a few classes that represent the items you intend to extract from the pages. For example, if you want to extract the prices and details of products from an online store, you could use a class like the following:

import scrapy

class Product(scrapy.Item)
   description = scrapy.Field()
   price = scrapy.Field()
   brand = scrapy.Field()
   category = scrapy.Field()

As you can see, the item classes are just subclasses from scrapy.Item in which you add the desired fields (instances of the class scrapy.Field). You can then use an instance of this class like if it were a Python dictionary:

>>> p = Product()
>>> p['price'] = 13
>>> print p
{'price': 13}

The biggest difference from a traditional dictionary is that an Item by default does not allow you to assign a value to a key that was not declared as a field:

>>> p['silly_walk'] = 54
KeyError: 'Product does not support field: silly_walk'

The advantage of defining classes for items is that it allows you to take advantage of other features of the framework that works for these classes. For example, you can use the feed exports mechanism to export the collected items to JSON, CSV, XML, etc. You can also exploit the item pipeline features, that allows you to plug-in other processing on top of the collected items (things like validating the extracted data, removing duplicated items, storing in a database, etc).

Now, let's do some scraping!
To do the scraping itself, i.e., extracting the data from the page, it's nice if you know XPath, a language created for doing queries in XML content which is core to the selectors mechanism of the framework. If you don't know XPath, you can use CSS selectors in Scrapy just as well. We encourage you to learn some XPath nevertheless, because it allows for expressions much more powerful than just CSS (in fact, the CSS functionality in Scrapy works by converting your CSS expressions to XPath expressions). We'll put some links to useful resources about these at the end of the article.

So, you can test the result of XPath or CSS expressions for a page using the Scrapy shell. Run the command:

scrapy shell http://stackoverflow.com

This command makes a request to the informed URL and opens a Python shell (or IPython, if you have it installed) while making available some objects for you to explore. The most important object is the variable response, which contains the response of the HTTP request and corresponds to the response argument received by the callbacks.


>>> response.url
>>> response.headers
{'Cache-Control': 'public, no-cache="Set-Cookie", max-age=49',
'Content-Type': 'text/html; charset=utf-8',
'Date': 'Sat, 09 Aug 2014 03:47:31 GMT',
'Expires': 'Sat, 09 Aug 2014 03:48:20 GMT',
'Last-Modified': 'Sat, 09 Aug 2014 03:47:20 GMT',
'Set-Cookie': 'prov=5a8741f7-7ee3-4993-b723-72142d48696c; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly',
'Vary': '*',
'X-Frame-Options': 'SAMEORIGIN'}

You can use the xpath() and css() methods of the response object to query the HTML content in the response:

>>> response.xpath('//title') # gets the title via XPath
[<Selector xpath='//title' data=u'<title>Stack Overflow</title>'>]
>>> response.css('title') # gets the title via CSS
[<Selector xpath=u'descendant-or-self::title' data=u'<title>Stack Overflow</title>'>]
>>> len(response.css('div')) # counts the number of div elements

The result of calling one of these methods is a list object containing selector objects resulting from the query. This list object has an extract() method which extracts the HTML content from all the selectors together. The selectors, on the other hand, besides having their own extract() method to extract their content, also have xpath() and css() methods that you can use to do new queries in the scope of each selector.

Take a look at the examples below in the same Scrapy shell, that will help clearing up things a little bit.

Extracts HTML content from <title> element, calling the extract() method from the selector list (note that the result is a Python list):

>>> response.xpath('//title').extract()
[u'<title>Stack Overflow</title>']

Stores the first selector of the result in a variable and calls the extract() method on the selector (see how the result now is just a string):

>>> title_sel = response.xpath('//title')[0]
>>> title_sel.extract()
u'<title>Stack Overflow</title>'

Applies the XPath expression text() to get the text content of the selector, and calls the extract() method from the resulting list:

>>> title_sel.xpath('text()').extract()
[u'Stack Overflow']

Prints the extraction of the first selector resulting of the XPath expression text() applied to selector in variable title_sel:

>>> print title_sel.xpath('text()')[0].extract()
Stack Overflow

Well, when you have a good grip on this way to work with selectors, the simple way to extract an item is just to create an instance of the desired Item class and fill the values obtained using this selectors API.

Here, take a look at the code of a spider using this technique to get the most frequently asked questions of StackOverflow:

import scrapy
import urlparse

class Question(scrapy.Item):
   link = scrapy.Field()
   title = scrapy.Field()
   excerpt = scrapy.Field()
   tags = scrapy.Field()

class StackoverflowTopQuestionsSpider(scrapy.Spider):
   name = 'so-top-questions'

   def __init__(self, tag=None):
       questions_url = 'http://stackoverflow.com/questions'
       if tag:
           questions_url += '/tagged/%s' % tag

       self.start_urls = [questions_url + '?sort=frequent']

   def parse(self, response):
       build_full_url = lambda link: urlparse.urljoin(response.url, link)

       for qsel in response.css("#questions > div"):
           it = Question()

           it['link'] = build_full_url(
               qsel.css('.summary h3 > a').xpath('@href')[0].extract())
           it['title'] = qsel.css('.summary h3 > a::text')[0].extract()
           it['tags'] = qsel.css('a.post-tag::text').extract()
           it['excerpt'] = qsel.css('div.excerpt::text')[0].extract()

           yield it

As you can see, the spider defines an Item class named Question, and uses the Selectors API to iterate through the HTML elements of the questions (obtained with the CSS selector #questions > div) and generating a Question object for each one of these elements, filling all the fields (link, title, tags and question excerpt).

There are two interesting things worth noticing in the extraction done in the parse() callback: the first one is that we use a pseudo-selector ::text to get the text content of the elements, avoiding the HTML tags. The second is how we use the function urlparse.urljoin() to combine the URL of the request with the content of the href attribute, making sure that the result of this will be a correct absolute URL.

Put this code in a file named top_asked_so_questions.py and run it using the command:

scrapy runspider top_asked_so_questions.py -o questions.json

If everything went well, Scrapy will show in the screen the scraped items and also write a file named questions.json containing them. At the end of the output, you should see some stats, including the item scraped count:

2014-08-02 14:27:37-0300 [so-top-questions] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 242,
'downloader/request_count': 1,
'item_scraped_count': 50,
'log_count/DEBUG': 53,
'log_count/INFO': 8,
'start_time': datetime.datetime(2014, 8, 2, 17, 27, 36, 912002)}
2014-08-02 14:27:37-0300 [so-top-questions] INFO: Spider closed (finished)
Note: if you run this twice in a row, you need to remove the output file questions.json file before each run. This is because Scrapy by default appends to a file instead of overwriting it, which ends up making the JSON file unusable. This is done for historical reasons, it made sense for spiders which used the JSON Lines format (the previous default), and may change in the future.

Arachnoid arguments

You may have noticed that the class for this spider has a constructor accepting an optional argument called tag. We can specify a value for this argument for the spider to get the frequently asked questions with the python tag, using the -a option:

scrapy runspider top_asked_so_questions.py -o python-questions.json -a tag=python

Using this little trick you can write generic spiders, so that you just pass some parameters and get a different result. For example, you may write one spider for several sites that have the same HTML structure, making the URL of the site a parameter. Or, a spider for a blog in which the parameters define a time range of the posts and comments to extract.

Putting it all together
In the previous sections, you saw how to do web crawling with Scrapy, navigating through the pages of a site using the mechanism of requests and callbacks. You also saw how to use the Selector API to extract the content of a page into items and execute the spider using the command scrapy runspider.

Now, we shall put it all together in a spider that solves the problem we presented in the introduction: let's scrape the latest "big questions" articles from mentalfloss.com, offering an option to inform the category (Origins, The Body, Fact Check, etc). This way, if you just run the spider, it should scrape all the articles in the blog; if you pass in a category, it should scrape only the articles of that subject.

Note: Before writing a spider, it's useful to explore a little bit the pages of the site using the browser's inspection capabilities and the scrapy shell, so that you can see how the site is structured and you can also try a few CSS or XPath expressions in the shell. There are also some browser extensions that allow you to test XPath expressions directly in a page: XPath Helper for Chrome and XPath Checker for Firefox. Discovering the best way to extract the content of a site using XPath or CSS is more of an art than a science, therefore we won't try to explain much here, but it's worthy telling you that you learn a lot after a little experience.

Have a look at the final code of the spider:

import scrapy
import urlparse

class Article(scrapy.Item):
   title = scrapy.Field()
   content = scrapy.Field()
   link = scrapy.Field()
   author = scrapy.Field()
   date = scrapy.Field()

class MentalFlossArticles(scrapy.Spider):
   name = 'mentalfloss-articles'

   def __init__(self, category=None):
       articles_url = 'http://mentalfloss.com/big-questions'

       if category:
           articles_url += '/' + category

       self.start_urls = [articles_url]

   def parse(self, response):
       """Gets the page with the article list,
       find the article links and generates
requests for each article page
       article_links = response.xpath(

       for link in article_links:
           article_url = urlparse.urljoin(
               response.url, link)
           yield scrapy.Request(article_url,

   def extract_article(self, response):
       """Gets the article page and extract
an item with the article data
       article = Article()
       css = lambda s: response.css(s).extract()

       article['link'] = response.url
       article['title'] = css("h1.title > span::text")[0]
       article['date'] = css('.date-display-single::text')[0]

       article['content'] = " ".join(
           css('#content-content p::text'))

       article['author'] = css(
           " a::text")[0]

       yield article

Just like before, you can run the spider with:

scrapy runspider mentalfloss.py -o articles-all.json

And to get the articles from each section, you can use commands like:

scrapy runspider mentalfloss.py -o articles-origins.json -a category=origins

scrapy runspider mentalfloss.py -o articles-fact-check.json -a category=fact-check

The code for this spider has a very similar structure to the previous one, with its argument handling and everything.

The main difference is that in this one, the first callback (the parse() method) generates other requests for the article pages, which are handled by the second callback: the extract_article() method, which scrapes the article data.

The content extraction also does a little bit more work. We created a css() helper function to abbreviate calling response.css(<selector>).extract() and used that to get the result of our selectors to fill the Article item. Note also how we take advantage of Python's feature of concatenating literal strings on the CSS selector for the author field, to break it in two lines.

Final words
If you made until here, congratulations! Here is a trophy for you:


Now that you have learned to write Scrapy spiders and therefore are enabled to download the whole Internet to your home PC, try not to get banned by the website hosts laying around! :)

Visit the official documentation for Scrapy, there is a lot of good stuff there, like the tutorial teaching you how to create complete Scrapy projects, frequently asked questions, tips for doing huge crawls, how to debug a spider, tips on how to avoid being banned and a lot more.

UPDATED: removed -t json from commands, unnecessary since Scrapy 0.24 (thanks, Mikhail!)

UPDATED: added note about Scrapy default behavior of appending to output file (thanks again, Mikhail!)

Useful resources: