Ruby is a high-level, interpreted general purpose Object Oriented Programming Language. It was created by Yukihiro “Matz” Matsumoto of Japan in 1993.
Ruby has a simple syntax that is easy to understand. It is a powerful programming language and provides a large number of built-in libraries that can be used for web scraping, such as HTTPart, Mechanize, and Nokogiri. This makes it an excellent choice for data extraction from Google.
Web Scraping is the process of extracting valuable data from websites or other sources. It is used for various tasks such as Data mining, price monitoring, lead generation, SEO, etc.
In this blog, we will learn to scrape Google Search Results using Ruby and its libraries. We will also discuss why Ruby can be a preferred choice for scraping Google, the advantages of extracting search engine results, and finally, why the Google Official Search API may not be the best option for this task.
Why Ruby for Scraping Google Search Results?
Ruby is quite popular for web scraping. Its ability to handle complex web scraping tasks by launching multiple threads to scrape data from different parts of websites makes it an ideal choice for web scraping.
It can be used to parse both HTML and XML types of documents and it also provides a rich set of libraries that can help developers automate the process of web scraping.
Overall, Ruby is a high-performance language and has good community support. It doesn’t matter if you want to scrape Google or any other website. Ruby provides a ton of features and libraries that can help you get started with web scraping.
Let’s start Scraping Google Search Results using Ruby
In this post, we will be coding a scraper to extract the first 10 Google Search Results with Ruby using Nokogiri and HTTParty. The returned data would be comprised of the title, link, and description of the organic result. You can use the data for a variety of purposes like SERP Monitoring, Rank Tracking, Keyword Tracking, etc.
Google Search Results Scraping can be divided into two processes:
- Extracting HTML from the target URL.
- Parsing the extracted raw HTML to get the required data.
Requirements:
To scrape Google Search Results, we will be working with these two libraries:
- HTTParty – Used to make HTTP requests and fetch the required data.
- Nokogiri – Used to parse HTML and XML documents.
Set-Up:
If you have not already installed Ruby, I recommend you watch these videos, so we can start with the tutorial.
Process:
So, now we can get started with our project. We will pass this URL as the parameter to scrape the search results.
https://www.google.com/search?q=ruby&gl=us
Let us first require the dependencies we have installed and are going to use in the tutorial.
require "nokogiri"
require "httparty"
Then we will define a method scraper to extract the required information from Google.
def scraper
url = "https://www.google.com/search?q=ruby&gl=us"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
}
unparsed_page = HTTParty.get(url, headers: headers)
parsed_page = Nokogiri::HTML(unparsed_page.body)
results = []
Step-by-step explanation:
- First, we set the URL we want to scrape.
- Then, we set the header to the User Agent, which is used to identify the type of device or browser making the request.
- After that, we made an HTTP request on the URL with the help of
HTTParty
passing User Agent as the header. - In the next line, we used
Nokogiri
to parse the extracted HTML, and then we initialized aresult
array to store the scraped data.
Now, we will search for the tags from the HTML so we can get the relevant data.
If you inspect the Google webpage, you will get to know all our organic results are inside a div
container with the class name g
.
So, we will select or extract all the divs with g
as the class name.
parsed_page.css("div.g")
Then, we will loop through each of the selected divs.
parsed_page.css("div.g").each do |result|
link = result.css(".yuRUbf > a").first
link_href = link.nil? ? "" : link["href"]
result_hash = {
title: result.css("h3").text,
link: link_href,
snippet: result.css(".VwiC3b").text
}
results << result_hash
end
puts results
end
scraper
In the above code, the second line will ensure that we scrape the first anchor tag within the class yuRUbf
.
After that, we will check if the link is null. If yes, then do not store anything in link_href
. If no, then initialize it with the scraped URL present in the href
attribute of the link.
If you inspect the Google webpage again, you will find that the title is within the tag h3
, the link is under the tag .yuRUbf > a
and the snippet is under the tag VwiC3b
.
After executing the code without any error, your results should look like this:
{
:title=>"Ruby Programming Language",
:link=>"https://www.ruby-lang.org/en/",
:snippet=>"A dynamic, open source programming language with a focus on simplicity and productivity. It has an elegant syntax that is natural to read and easy to write."
}
{
:title=>"Ruby - Wikipedia",
:link=>"https://en.wikipedia.org/wiki/Ruby",
:snippet=>"A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum (aluminium oxide). Ruby is one of the most popular traditional ..."
}
{
:title=>"Ruby: #1 Virtual Receptionist & Live Chat Solution for Small ...",
:link=>"https://www.ruby.com/",
:snippet=>"14000+ small businesses trust the virtual receptionists at Ruby to create meaningful connections over the phone and through live chat, 24/7."
}
But if you move forward with this method, Google may block your IP easily. You can use random User Agents for each request to avoid blockage to some extent. Let me show you how to do this:
Initialize an array of User Agents:
headers = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
]
Then, select a random user-agent each time you make a request.
random_header = headers.sample
The sample
method is used for returning random elements from the array.
And then pass it with the URL when you scrape the HTML.
unparsed_page = HTTParty.get(url, headers: {
"User-Agent": random_header
})
So, this is how you can scrape Google Search Results with Ruby.
If you are looking for a more sophisticated and maintenance-free solution, then you can try the Google SERP API for scraping Google Search Results.
Benefits of Scraping Google Search Results
There are various advantages of scraping Google Search Results:
Rank Tracking – It can be used to track your website’s position on the search engine, which can help you stay informed and make decisions accordingly.
Scalable – Scraping Google Search Results allows you to gather a significant amount of data which can be used for various purposes such as keyword tracking, rank tracking, etc.
Price Tracking – Scraping Google Search Results can help you remain well-informed about the pricing of the products sold by your competitors.
Lead Generation – If you want to gather contact information about your potential clients, then scraping Google Search Results could be a great decision.
Real-time data – You can stay up-to-date with the current information as scraping Google Search Results enables you to get access to real-time data.
Inexpensive – Most businesses can’t afford official Google Search API as it can dent their already tight budget, but scraping Google Search Results solves this problem also.
Problems with Offical Google Search API
There are a few reasons why businesses don’t use the official Google Search API:
Expensive: The Official Google Search API costs 5$ for 1k requests, which is the most expensive and non-affordable in the market.
Limited Access: Due to the access to a limited amount of data, businesses consider the web scrapers available in the market which gives them complete control over the scraped results.
Complex Setup: Users with no technical knowledge can find it very difficult to set up the API.
Using Google Search API to Scrape Google
Serpdog is the most trusted Google Scraper API in the market. Using its massive pool of 10M+ Residential Proxies users not only benefit from the no-blockage scraping but also get to experience the robust speed of our SERP API. Serpdog supports all kinds of featured snippets and extracts every piece of information available on the search page.
Get started with scraping Google by signing Up on Serpdog.
After registering successfully, embed the API Key in the below code, and you will be able to scrape Search Engine Results Pages using our API.
require 'net/http'
require 'json'
params = {
:api_key => "APIKEY",
:q=> "coffee",
:gl => "us"
}
uri = URI('https://api.serpdog.io/search')
uri.query = URI.encode_www_form(params)
website_content = Net::HTTP.get(uri)
print(website_content)
Conclusion
In this tutorial, we learned to scrape Google Search Results using Ruby. Please do not hesitate to message me if I missed something. If you think we can complete your custom scraping projects feel free to contact us.
Follow me on Twitter. Thanks for reading!