When you would like to retrieve data from a website, you can use an API: an Application Programming Interface. This is a url that emits the data in a paresable format like JSON or XML. When the program doesn’t have an API, however, how do you get it?
Parsing HTML Yourself
Parsing HTML with regular expressions is a losing game. Instead, we’ll use the industry standard nokogiri. In many cases, using nokogiri directly is all you need. Grab the HTML content, get the HTML nodes you want, do your thing.
Install nokogiri from your terminal:
It includes some C code, so it might take a little while to install (it will say building native extensions). If there are problems with the gem install, your development environment isn’t properly configured. For instance, you might have an incompatible version of GCC or be missing some header libraries. Read through Nokogiri’s installation instructions if this is the case.
An HTML-Nokogiri Experiment
This is how you’d get the list of the breaking news links from the Denver post. Drop this code into a Ruby file and execute it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
If you don’t have pry, install it with
What Just Happened
The program made a URI object to parse our uri (you can think a uri as being the same thing as a url).
Then it made a GET request to that uri to get the page’s body.
It gives the body to
Nokogiri::HTML, which parses the HTML
and gives us back a Nokogiri document, an object we can use to interact with the html.
In this case, we use "css" to give it a css selector that will find all links
inside of list elements.
We also stuck a pry at the bottom so that we can play with those objects if we like.
Maybe we’d like to see what we can do with a link, we’ll use Pry’s
ls -v to list out
all the interesting things on it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
We see the link has its own
css method, so we could run another query from within this node.
And we can see the href attribute, it looks like we get attributes by using bracket notation, like a hash.
This is a great way to play around with a new library and learn about it,
You can check out the documentation to learn more about what you can do with Nokogiri, but this isn’t a lesson about Nokogiri, lets get on to Capybara!
Sometimes, though, websites want you to do pesky things like fill in forms and follow redirects. Websites check your cookies and store session data, and Nokogiri isn’t optimized to handle that sort of thing.
Phantom.js, Capybara and Poltergeist
To really interact with the page, we’d need to be in a browser. This is what Phantom.js is, a browser like Opera or Firefox, but it’s specifically intended to be run by programs. It doesn’t pop up a window like the browser you use, it does all the work without ever displaying the pages.
Capybara provides an interface for interacting with websites from Ruby. It is specifically intended for testing, but all the functionality it provides is useful for scraping, too. It leaves the specifics of how to talk to the website to a "driver" This allows you to use it with numerous tools, such as rack-test, Which hooks it into your rack layer, allowing you to navigate your website Without ever loading a server.
Lets save the below code in
1 2 3 4 5 6 7 8 9 10 11
And then we’ll load it up in pry and open Capybara’s documentation with it.
1 2 3 4 5 6 7 8
And now lets head over to the Denver Post and try getting the information we previously had.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Looks like there are some good methods available to us. Methods to
fill_in forms (I would assume), a
go_back method, that’s probably like our
back button in the browser. Oh,
current_url looks like it should tell us where
we are, lets try that.
Yep! Oh, and that
all looks promising, lets try and use it like we did with Nokogiri’s
1 2 3 4 5
Great, and now what can we do with a link?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
text method, a
value method, both of those look promising.
Oh, and brackets, too. Lets try them out.
1 2 3 4 5 6 7 8
Okay, so value isn’t interesting, but text is, and we seem to be able to access the element’s attributes with brackets like we could do with Nokogiri.
Lets try clicking that link. Notice in the methods it got from
Capybara::Node::Actions that we have a
and down on the element, we have a
click method. Lets try those out.
1 2 3 4 5 6 7 8 9
Okay, so we followed the link! Oh, but I forgot to print the links on the last
page. Remember that
go_back method? lets try that out now.
1 2 3 4 5 6 7 8
Great, we’re back, now lets print the text and href:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
So, it gives me back a
Capybara::Result, but notice that we inherit from
Enumerable. I know how to deal with Enumerable!
1 2 3 4 5 6 7 8
Great, we did it! The real power of Capybara becomes apparent when you need to fill in forms, follow redirects, be authenticated, upload files and all that jazz.
What About Filling In Forms and Logging In?
Capybara is able to do more than request a page and look at its links. It can fill in forms, too! When you talk to a website, the website sets a header in the HTTP request called a cookie. The browser then sends this back to the site whenever you interact with it. This allows the site to set the cookie to be something that allows them to identify you. Which, in turn, means they can do things like authenticate that you are the user you say you are, and then store that in a cookie. If you didn’t send the cookie back, it would look like you weren’t logged in. So for us to crawl sites, we often need to remember the cookies the site has set, and then re-submit them for all future requests.
If we call our current browsing process a "session", then we will want to remember cookies for this session. Fortunately, Phantom takes care of this for us, it keeps track of the cookies the website has sent, for the duration of our session, and continues to re-submit them. This means that when we fill in a form to log into a website, we will continue to be logged in. You can configure Nokogiri, or other web crawling tools to log into your accounts, and perform tasks for you, collect data for you, whatever it happens to be.
Lets try it out!
Lets try submitting that form. I took a look at Capybara’s readme, under the sections about clicking links and buttons, and the section about interacting with forms, and then used my web inspector to dtermine which things I wanted to fill in and click on.
1 2 3 4 5 6 7 8 9 10
Give It A Try
Our form won’t require a session, but the process is the same. We’re going to give you a list of ISBNs that identify books about Ruby (for a liberal definition of "Ruby"). Your task is to go to amazon.com, fill the isbn into the form, submit it, and then extract the data from the resulting page into a format of your choice.
You have completed the task when you can show me a text file with all of the data. You can verify your data against this list.
Here are the ISBNS:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Here is an example. You should be able to extract the title, isbn10, isbn13, author, binding, publisher, and published date.
You Know What You Want To Do… But How Do You Do It?
Things to consider:
- How do you want to store the data?
- What happens if you get 90% finished with the scraping, and then it blows up for some reason? Maybe you should persist the results ASAP, so you can skip work you’ve already done when you fix the error and rerun it.
- What are things that could break your scraper?
Finished and bored?
- Got all the data? Nice. How about reading it in and generating a html page that lists all the books.
- Pull down other useful information, as well. How about the image and the relevant links.