Scraping data with Capybara
When you would like to retrieve data from a website, you can use an API: an Application Programming Interface. This is a url that emits the data in a paresable format like JSON or XML. When the program doesn’t have an API, however, how do you get it?
Parsing HTML Yourself
Parsing HTML with regular expressions is a losing game. Instead, we’ll use the industry standard nokogiri. In many cases, using nokogiri directly is all you need. Grab the HTML content, get the HTML nodes you want, do your thing.
Installing Nokogiri
Install nokogiri from your terminal:
1
|
|
It includes some C code, so it might take a little while to install (it will say building native extensions). If there are problems with the gem install, your development environment isn’t properly configured. For instance, you might have an incompatible version of GCC or be missing some header libraries. Read through Nokogiri’s installation instructions if this is the case.
An HTML-Nokogiri Experiment
This is how you’d get the list of the breaking news links from the Denver post. Drop this code into a Ruby file and execute it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
If you don’t have pry, install it with
1
|
|
What Just Happened
The program made a URI object to parse our uri (you can think a uri as being the same thing as a url).
Then it made a GET request to that uri to get the page’s body.
It gives the body to Nokogiri::HTML
, which parses the HTML
and gives us back a Nokogiri document, an object we can use to interact with the html.
In this case, we use "css" to give it a css selector that will find all links
inside of list elements.
We also stuck a pry at the bottom so that we can play with those objects if we like.
Maybe we’d like to see what we can do with a link, we’ll use Pry’s ls -v
to list out
all the interesting things on it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
We see the link has its own css
method, so we could run another query from within this node.
And we can see the href attribute, it looks like we get attributes by using bracket notation, like a hash.
This is a great way to play around with a new library and learn about it,
You can check out the documentation to learn more about what you can do with Nokogiri, but this isn’t a lesson about Nokogiri, lets get on to Capybara!
Beyond HTML
Sometimes, though, websites want you to do pesky things like fill in forms and follow redirects. Websites check your cookies and store session data, and Nokogiri isn’t optimized to handle that sort of thing.
For this, you’d want a tool that understands these things. Mechanize is a common library for this, but it just parses html like we did with Nokogiri, and then makes more get requests when we "click" on a link, or "submit" a form. It’s great for this purpose… except that many webpages require JavaScript to run correctly, and Nokogiri doesn’t have a JavaScript engine in it, it’s just reflecting on the static structure of the page.
Phantom.js, Capybara and Poltergeist
To really interact with the page, we’d need to be in a browser. This is what Phantom.js is, a browser like Opera or Firefox, but it’s specifically intended to be run by programs. It doesn’t pop up a window like the browser you use, it does all the work without ever displaying the pages.
Capybara provides an interface for interacting with websites from Ruby. It is specifically intended for testing, but all the functionality it provides is useful for scraping, too. It leaves the specifics of how to talk to the website to a "driver" This allows you to use it with numerous tools, such as rack-test, Which hooks it into your rack layer, allowing you to navigate your website Without ever loading a server.
The driver that knows how to talk to Phantom.js is called Poltergeist So we’ll use Capybara to navigate the web, click links, and so forth, like Mechanize, but we’ll have it use Poltergeist so it does this in Phantom.js, and we can interact with JavaScript
Installation
1 2 |
|
Setup
To set these things up, we need to: * require the gems in order to have the code available * configure Poltergeist to not blow up on JavaScript errors (aka every website w/ js) * then tell Capybara to use Poltergeist as its driver * then get an object from Capybara that we can use to navigate the web (we’ll name ours "browser")
Lets save the below code in setup-capybara.rb
1 2 3 4 5 6 7 8 9 10 11 |
|
And then we’ll load it up in pry and open Capybara’s documentation with it.
1 2 3 4 5 6 7 8 |
|
And now lets head over to the Denver Post and try getting the information we previously had.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Looks like there are some good methods available to us. Methods to click_on
things,
fill_in
forms (I would assume), a go_back
method, that’s probably like our
back button in the browser. Oh,current_url
looks like it should tell us where
we are, lets try that.
1 2 |
|
Yep! Oh, and that all
looks promising, lets try and use it like we did with Nokogiri’s css
.
1 2 3 4 5 |
|
Great, and now what can we do with a link?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Hmm, a text
method, a value
method, both of those look promising.
Oh, and brackets, too. Lets try them out.
1 2 3 4 5 6 7 8 |
|
Okay, so value isn’t interesting, but text is, and we seem to be able to access the element’s attributes with brackets like we could do with Nokogiri.
Lets try clicking that link. Notice in the methods it got from
Capybara::Node::Actions
that we have a click_link
method,
and down on the element, we have a click
method. Lets try those out.
1 2 3 4 5 6 7 8 9 |
|
Okay, so we followed the link! Oh, but I forgot to print the links on the last
page. Remember that go_back
method? lets try that out now.
1 2 3 4 5 6 7 8 |
|
Great, we’re back, now lets print the text and href:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
So, it gives me back a Capybara::Result
, but notice that we inherit from
Enumerable. I know how to deal with Enumerable!
1 2 3 4 5 6 7 8 |
|
Great, we did it! The real power of Capybara becomes apparent when you need to fill in forms, follow redirects, be authenticated, upload files and all that jazz.
What About Filling In Forms and Logging In?
Capybara is able to do more than request a page and look at its links. It can fill in forms, too! When you talk to a website, the website sets a header in the HTTP request called a cookie. The browser then sends this back to the site whenever you interact with it. This allows the site to set the cookie to be something that allows them to identify you. Which, in turn, means they can do things like authenticate that you are the user you say you are, and then store that in a cookie. If you didn’t send the cookie back, it would look like you weren’t logged in. So for us to crawl sites, we often need to remember the cookies the site has set, and then re-submit them for all future requests.
If we call our current browsing process a "session", then we will want to remember cookies for this session. Fortunately, Phantom takes care of this for us, it keeps track of the cookies the website has sent, for the duration of our session, and continues to re-submit them. This means that when we fill in a form to log into a website, we will continue to be logged in. You can configure Nokogiri, or other web crawling tools to log into your accounts, and perform tasks for you, collect data for you, whatever it happens to be.
Lets try it out!
Lets try submitting that form. I took a look at Capybara’s readme, under the sections about clicking links and buttons, and the section about interacting with forms, and then used my web inspector to dtermine which things I wanted to fill in and click on.
1 2 3 4 5 6 7 8 9 10 |
|
Give It A Try
Our form won’t require a session, but the process is the same. We’re going to give you a list of ISBNs that identify books about Ruby (for a liberal definition of "Ruby"). Your task is to go to amazon.com, fill the isbn into the form, submit it, and then extract the data from the resulting page into a format of your choice.
You have completed the task when you can show me a text file with all of the data. You can verify your data against this list.
Here are the ISBNS:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Here is an example. You should be able to extract the title, isbn10, isbn13, author, binding, publisher, and published date.
You Know What You Want To Do… But How Do You Do It?
Our pry method works really well, and is usually what I turn to first. But, docs are good too. The Capybara readme has some great examples. And there is also API documentation available.
Things to consider:
- How do you want to store the data?
- What happens if you get 90% finished with the scraping, and then it blows up for some reason? Maybe you should persist the results ASAP, so you can skip work you’ve already done when you fix the error and rerun it.
- What are things that could break your scraper?
Finished and bored?
- Got all the data? Nice. How about reading it in and generating a html page that lists all the books.
- Pull down other useful information, as well. How about the image and the relevant links.