Web scraping with Ruby   Oct 29, 2014

Web scraping is a way to fetch a small piece of content from a page on the web and do something with it. The Nokogiri gem for Ruby makes this easy and the ActionMailer gem makes it easy to email the scraped content to yourself. If you run the program from this post using cron or some other task scheduler, you could be receiving Canon SLR Lens deals right to your inbox!

Get the gems

Firstly you'll need to install the Nokogiri and ActionMailer gems. As this post deals specifically with Gmail accounts the following steps are specific to Gmail. You may just be able to use your regular password for other SMTP email servers.

The SimpleMailer class below sets up the SMTP server settings including the app-specific password on line 8 and your gmail account name on line 7. If you're interested you can read more about ActionMailer.

The camera_price_buster.rb file is a demonstration of how you can use nokogiri to scrape the content of a webpage and then utilise SimpleMailer to send the result to an email address. In this case, I'm scraping the Canon SLR Lenses page from camerapricebuster.co.uk to fetch any items that are marked as currently being at their lowest price ever. This is indicated by the presence of the green image on the row in the lens list (see screenshot).


The green images highlight the items with lowest-ever prices.

Screenshot showing lowest-price indicator image lens list on camerapricebuster.co.uk

Get the code


By using an XPath query to pick out the rows containing this image into an array, I loop through the array and extract the pertinent information; Lens name, price and URL to the price comparison for that lens on camerapricebuster.co.uk.

XPath strings

Originally I used the explicit path below but using the one with the // prefix means that if the table is moved around in the structure I'll still be able to locate the data I need so in this instance it's the better choice.

Armed with the XPath for each row containing the lens details, I add each Lens' details to the string variable body and finally I include that as the body of the email I send. You could easily put this on a cron job to run daily and get notified by email of the day's deals. The resulting output that get's emailed is something like that shown below:

Found 9 items with lowest-ever prices

Lens:	Canon EF 16 35mm f4L IS USM Lens
Price:	£840.97
URL:	http://www.camerapricebuster.co.uk/Canon/Canon-EF-lenses/Canon-EF-16-35mm-f4L-IS-USM-Lens

Lens:	Canon EF 24mm f2.8 IS USM Lens
Price:	£412.20
URL:	http://www.camerapricebuster.co.uk/Canon/Canon-EF-lenses/Canon-EF-24mm-f2.8-IS-USM-Lens

Lens:	Canon EF 24 105mm f3.5 5.6 IS STM Lens
Price:	£431.10
URL:	http://www.camerapricebuster.co.uk/Canon/Canon-EF-lenses/Canon-EF-24-105mm-f3.5-5.6-IS-STM-Lens

Lens:	Canon EF 50mm f1.2L Lens
Price:	£1031.40
URL:	http://www.camerapricebuster.co.uk/Canon/Canon-EF-lenses/Canon-EF-50mm-f1.2L-Lens

Lens:	Canon EF 50mm f2.5 Macro Lens
Price:	£182.70
URL:	http://www.camerapricebuster.co.uk/Canon/Canon-EF-lenses/Canon-EF-50mm-f2.5-Macro-Lens

Lens:	Canon EF 85mm f1.2 L USM II Lens
Price:	£1394.10
URL:	http://www.camerapricebuster.co.uk/Canon/Canon-EF-lenses/Canon-EF-85mm-f1.2-L-USM-II-Lens

Lens:	Canon EF 100mm f2.8 Macro USM Lens
Price:	£346.50
URL:	http://www.camerapricebuster.co.uk/Canon/Canon-EF-lenses/Canon-EF-100mm-f2.8-Macro-USM-Lens

Lens:	Canon TSE 24mm Mk II f3.5L Lens
Price:	£1331.10
URL:	http://www.camerapricebuster.co.uk/Canon/Canon-EF-lenses/Canon-TSE-24mm-Mk-II-f3.5L-Lens

Lens:	Canon CN E 50mm T1.3 L F Cine Lens
Price:	£3358.00
URL:	http://www.camerapricebuster.co.uk/Canon/Canon-EF-lenses/Canon-CN-E-50mm-T1.3-L-F-Cine-Lens

If the structure of the webpage changes, then the script may break. Ultimately there is always a risk with web scraping that a change to the source page will mean your scraping logic fails, but that's just something you must accept since you're not in control of the source data.

Tags for this post: