How To Scrape The Web With Ruby

Sean
5 min readOct 5, 2021

Lets start this blog at the end… Why should I learn how to scrape the web?

Where API’s fall short, Web Scraping offers a solution.

Although API’s allow for a neatly formatted package of information to be delivered upon request they often times fall short in the amount of custom information that a website may offer that you’re particularly interested in.

You’re probably wondering what kind of use cases could there possibly be. Here’s a list of some of the most common things Ruby and Nokogiri enable you to accomplish:

1. Price comparison & monitoring

Prices change daily and deals vary. Some businesses list there products on multiple platforms and scraping both can help you execute a deal in real time as a middleman.

2. Social media sentiment analysis

Some media platforms have API’s built in like twitters developer portal that allows you to access some of there data, but there are times when web scraping can provide better insight in real time.

For more information on twitter API look here https://developer.twitter.com/en

3. Machine learning

Aggregating large amounts of data can help machine learning models evolve and improve.

4. Lead Generation

A web scraping software built in an ethical way like from google business listings or yellow pages can help scrape information like emails and phone lists for outreach.

Honorable Mentions:

Sports betting odds
Hotel Pricing
Travel Pricing
Real Estate

So what is web scraping?

Ever wondered why a certain web page looked the way it did? There’s a tool used by developers to metaphorically “pop the hood” and get a look at the html and other inner working of a website. This started with Firefox’s browser offering developer tools and soon after all the other browsers offered there own version of it.

You can pick any site but for our example I decided to navigate to https://cryptocurrencyjobs.co/ and see what was under the hood. By left clicking on the page and scrolling down to “inspect” a new window should open on the right side of the page.

What you are really looking at in the developer tools is actually an HTML representation of what’s appearing on the left aka on the website. Now if you click the arrow icon at the top left of the developer tools you can point it at any element on the page and find exactly where that element lives in the html doc which we are going to call the Doc. This is key to understanding if we want to scrape the website in hand effectively since some elements may be nested deep within the elements in a Doc.

For my own curiosity and the purposes of this blog demonstration, the next parts will attempt to go over using Ruby on Rails to create a database that will be populated using web scraping with jobs available in crypto.

First Steps

Assuming you have rails installed we’re going to be working from the command line at first. Our first steps include

Creating the API

rails new crypto_jobs -- api

Generating a model for the table of data that we can gather from the website we’ll be scraping

rails g model Crypto

This will generate all the necessary files for us to set up a table that will ultimately house our crypto web scraped info. For this simple demonstration we’ll only be targeting the jobs title and a link to a page with more information. So the model will look something like this:

Created using VS Code

Finally we’re going to want to migrate our table over to the database using

rails db:migrate

Setting Up Nokogiri

Now that we have our database set up we’re going to need to add Nokogiri to our Gemfile to allow us to take advantage of its capabilities.

Coding The Model To Actually Do The Web Scraping

Now that we’re set up lets navigate back to your app/models class and lets get to the fun part! We’ll need our code to be able to do a few things so lets break it down.

  1. Collect the url that lists jobs
  2. Navigate to each job listing node
  3. Create an instance of the Crypto job including its title, link, etc

In order to be able to navigate around the web and gather this kind of information we’re going to require a few things at the top of our model. Create & navigate to app/models/scraper.rb
At the top of the page add the following

require ‘nokogiri’
require ’open-uri’
require ’pry’

Open-uri allows us to open websites and store it as html content which we’ll then use in tandem with Nokgiri to turn that into nested nodes that we can navigate through and extract information from. Finally we’ll add a binding pry so that we can run the scraper and walk into our terminal and see the values returned.

Run -> ruby/app/models/scraper.rb

From here we can see the value of our doc variable

Up until this point everything we’ve done has been standard for setting up any kind of web scraping application using Ruby & Nokogiri. We have our values in a nested node and now its just a matter of identifying and iterating through them.

Using CSS Selectors to Grab Our Data

After refactoring our code we can create a variable that looks for the CSS elements using a selector. For our purposes we are looking for a class called ‘ais-Hits-item’ and we found this using our developer tools which if we recall from earlier is accessed by right clicking and inspecting the elements on the Doc.

From here its all about specific elements within the actual page. Next time we’ll take a look at how to put it all together!

--

--