As programmers, the essence of your job is to remove tedium from the world.
One way to do that is to write little programs (often called scripts if they’re one-off or ad-hoc things) to take care of things for us. What’s the difference between a script and an application? Typically an application is used by an end-user, while a script is used by either the developer or another internal user.
Typical tasks would be:
- generate a pdf invoice every 2 weeks and email it to your employer
- generate a "hall of fame" list based on a text file of names and upload it to the right place on an FTP server
- automatically scan a given web page for some text and do stuff when a particular phrase is detected (see the sci-fi novel Daemon for an interesting take on this).
- grab the source HTML for a web page and pull out the data that you care about, be it images, css, urls, or only the text.
Web Scripts & HTML
You’re web developers, so the majority of your jobs are going to be centered on the web. We love when applications provide beautiful, RESTful APIs and have well-designed wrapper gems to access them.
But the rest of the time, you’ll have to get the data you want like a browser: making requests, getting HTML responses, and picking through the wall of text for the data.
Parsing HTML Yourself
Almost every developer has tried reading or parsing HTML with regular expressions. Just don’t do it: there are so many things to go wrong.
In Ruby, the typical tool for parsing HTML or XML is nokogiri. In many cases, using nokogiri directly is all you need. Grab the HTML content, get the HTML nodes you want, do your thing.
Install nokogiri from your terminal:
It includes some C code, so it might take a little while to install (it will say building native extensions). If there are problems with the gem install, your development environment isn’t properly configured. For instance, you might have an incompatible version of GCC or be missing some header libraries. Read through Nokogiri’s installation instructions if this is the case.
An HTML-Nokogiri Experiment
This is how you’d get the list of the breaking news links from the Denver post. Drop this code into a Ruby script and execute it:
1 2 3 4 5 6 7 8 9 10 11
What Just Happened
The script called the
open method which is defined by the
open-uri library. This method returns an I/O (input/output) object which reads the data from the specified URL. For instance, you can try this in IRB:
1 2 3 4 5 6
That I/O object was then passed to
HTML is a class method on the
Nokogiri class. Running the experiment so far in IRB:
1 2 3 4 5 6
Then you can poke around on that
1 2 3 4
.css method to find all elements matching the CSS selector you specify:
There are 770 links on the front page. What is the object that comes back from
page.css("a")? We called
count on it, presuming that it acts like an Array…or something similar.
What is a
Nokogiri::XML::NodeSet? We don’t care. As long as it behaves like an Array, then we know how to work with it:
1 2 3 4 5
each element of the collection and execute the block. What methods can you call on
1 2 3 4
Scripting Beyond HTML
Sometimes, though, websites want you to do pesky things like fill in forms and follow redirects. Websites check your cookies and store session data, and nokogiri isn’t optimized to handle that sort of thing.
This is where mechanize comes in. It makes it really easy to do things like fill out forms and click buttons.
It depends on nokogiri. If you managed to get that installed above, you should have no problem getting mechanize working.
Trying It Out
We’ll do the same work as above, only using mechanize rather than using nokogiri directly:
1 2 3 4 5 6 7 8 9 10 11 12
It’s not very different. The
page that mechanize returns gives us direct access to all the nokogiri stuff, and also provides some nice shortcuts and a more intuitive interface.
The real power of mechanize becomes apparent when you need to fill in forms, follow redirects, be authenticated, upload files and all that jazz.
What About Filling In Forms and Logging In?
Mechanize is able to do more than request a page and look at its links. It can click links and fill in forms, too. When you talk to a website, the website sets a header in the HTTP request called a cookie. The browser then sends this back to the site whenever you interact with it. This allows the site to set the cookie to be something that allows them to identify you. Which, in turn, means they can do things like authenticate that you are the user you say you are, and then store that in a cookie. If you didn’t send the cookie back, it would look like you weren’t logged in. So for us to crawl sites, we often need to remember the cookies the site has set, and then re-submit them for all future requests.
If we call our current browsing process a "session", then we will want to remember cookies for this session. Fortunately, Mechanize takes care of this for us, it keeps track of the cookies the website has sent, for the duration of our session, and continues to re-submit them. This means that when we fill in a form to log into a website, we will continue to be logged in. You can configure Nokogiri, or other web crawling tools to log into your accounts, and perform tasks for you, collect data for you, whatever it happens to be.
Give It A Try
Our form won’t require a session, but the process is the same. We’re going to give you a list of ISBNs that identify books about Ruby (for a liberal definition of "Ruby"). Your task is to go to isbnsearch.org, fill the isbn into the form, submit it, and then extract the data from the resulting page into a format of your choice.
You have completed the task when you can show me a text file with all of the data. You can verify your data against this list.
Here are the ISBNS:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Here is an example. You should be able to extract the title, isbn10, isbn13, author, binding, publisher, and published date.
You Know What You Want To Do… But How Do You Do It?
For reference, you can look at the script I used to retrieve this data with:
But lets open up pry and play around with Mechanize a little bit.
I recommend loading pry and playing around with either the site you’re going to parse or another site. This allows you to try things out with much less effort. Here are some examples of playing around with the Denver Post page:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
Use this information to complete the task. If you find yourself needing something not covered here, consider how we used pry just now to find out about the links_with methods. Can you use this tool to answer your questions, too?
Things to consider:
- How do you want to store the data?
- What happens if you get 90% finished with the scraping, and then it blows up for some reason? Maybe you should persist the results ASAP, so you can skip work you’ve already done when you fix the error and rerun it.
- What are things that could break your scraper?
Finished and bored?
- Given that we’ve seen both form and link have search methods that suffix
_withto them, we might hypothesize that there are other such methods. Can you use pry to confirm or deny this hypothesis?
- Got all the data? Nice. How about reading it in and generating a html page that lists all the books.
- Pull down other useful information, as well. How about the image and the relevant links.