...

Web Scraping With Python

Web Scraping With Python - A Complete Guide

Python is the most popular language for web scraping. The amount of flexibility it offers while extracting data from websites is one of the main reasons it is a preferred choice for data extraction. Also, it has a couple of high-performance libraries like BeautifulSoup, and Selenium, which can be used to make powerful and efficient scrapers.

Most people reading this article may have heard of the terms “Data Extraction” or “Web Scraping.” If you have not come across this yet, don’t worry, as this article is planned for all types of developers who have just started with web scraping or want to gain more information about it.

Web Scraping is the process of extracting a specific set of information from websites in the form of text, videos, images, and links. In today’s world, web scraping is an important skill to learn, as it can be used for a variety of purposes, such as lead generation, price monitoring, SERP monitoring, etc.

Web Scraping With Python — A Complete Guide
Web Scraping With Python — A Complete Guide

In this tutorial, we will learn web scraping with Python and also explore some of the high-performance libraries that can be used to create an efficient and powerful scraper.

HTTP headers hold great importance in scraping a website. Passing headers with the HTTP request not only affects the response but also the speed of the request. So, before starting with the core tutorial, let us learn about the “HTTP Headers” and their types in-depth. 

Why Python for Web Scraping?

There are many reasons why developers choose Python for web scraping over any other language:

Simple Syntax — Python is one of the simplest programming languages to understand. Even beginners can understand and write scraping scripts due to the clear and easy-to-read syntax.

Extreme Performance — Python provides many powerful libraries for web scraping, such as Requests, Beautiful Soup, Scrapy, Selenium, etc. These libraries can be used for making high-performance and robust scrapers.

Adaptability — Python provides a couple of great libraries that can be utilized for various conditions. You can use Requests for making simple HTTP requests and, on the other end, Selenium for scraping dynamically rendered content.

HTTP Headers

In this section, we will learn about the “Headers” and their importance in web scraping. I will also try to explain the type of headers to you in detail. Let’s get started with it!

Headers are used to provide essential meta-information such as content type, user agent, content length, and much more about the request and response. They are usually represented in text string format and are separated by a colon.

Headers have a significant impact on web scraping. Passing correctly optimized headers not only guarantees accurate data but also reduces the response timings. Generally, website owners implement anti-bot technology to protect their websites from being scraped by scraping bots. However, you can bypass this anti-bot mechanism and prevent your IPs from getting blocked by passing the appropriate headers with the HTTP request.

Headers can be classified into four types:

Types of Headers
Types of Headers
  • Request Headers
  • Response Headers
  • Representation Headers
  • Payload Headers

Let us learn each of them in detail.

Request Headers

The headers sent by the client when requesting data from the server are known as Request Headers. It also helps to recognize the request sender or client using the information in the headers.

Here are some examples of the request headers.

  • authority: en.wikipedia.org
  • method: GET
  • accept-language: en-US, en;q=0.9
  • accept-encoding: gzip, deflate, br
  • upgrade-insecure-requests: 1
  • user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4869.91 Safari/537.36

The user agent indicates the type of software or application used to send the request to the server.

The Accept-Language header tells the server about the desired language for the response. The Accept-Encoding header is a request header sent by the client that indicates the content encoding it can understand.

Note: Not all headers in the request can be specified as request headers. For example —  The Content-Type header is not a request header but a representation header.

Response Header

The headers sent by the server to the client in response to the request headers from the user are known as Response Headers. It is not related to the content of the message. It is sent by the server to convey instructions and information to the client.

Here are some examples of the response headers.

  • content-length: 35408
  • content-type: text/html
  • date: Thu, 13 Apr 2023 14:09:10 GMT
  • server: ATS/9.1.4
  • cache-control: private, s-maxage=0, max-age=0, must-revalidate

The Date header indicates the date on which the response is sent to the client. The Server header informs the client from which server the response is returned, and the Content-Length header indicates the length of the content returned by the server.

Note: The Content-Type header is the representation header.

Representation Header

The headers that communicate the information about the representation of resources in the HTTP response body sent to the client by the server are known as Representation Header. The data can be transferred in several formats, such as JSON, XML, HTML, etc.

Here are some examples of the representation headers.

  • content-encoding: gzip
  • content-length: 35408
  • content-type: text/html

The Content-Encoding header informs the client about the encoding of the HTTP response body.

Payload Headers

The headers that contain information about the original resource representation are known as Payload Headers

Here are some examples of payload headers.

  • content-length: 35408
  • content-range: bytes 200–1000/67589
  • trailer: Expires

The Content-Range header tells the position of the partial message in the full-body message.

Here, we are completed with the Headers section. There can be more headers to be discussed. But, it will make the blog long and deviate from the main topic. 

If you want to know more about headers, you can read this MDN Mozilla Documentation

Web Scraping Libraries in Python

The top web scraping libraries in Python are:

  1. Requests
  2. HTTPX
  3. Beautiful Soup
  4. Scrapy
  5. Selenium
  6. Playwright

Requests

Requests is the most downloaded HTTP request library on Pypi.org, pulling around 30 million downloads every week. From beginners to experts in programming, everyone uses it. 

This library will help us to make an HTTP request on the website server and pull the precious HTML data out of it. It supports all types of requests (GET, POST, DELETE, etc.) and follows a straightforward approach while handling cookies, sessions, and redirects.

Let’s discuss how we can scrape Google Search Results using this library. 

pip install requests

After installing this library in your project folder, import it into your code like this: 

import requests

And then, we will use this library to create a scraping bot that will mimic an organic user.

url = "https://www.google.com/search?q=python+tutorial&gl=us"
headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.2816.203 Safari/537.36"}
resp = requests.get(url, headers=headers).text
print(resp)

First, we declared our target URL. Then, we initialized the header with the User Agent. After that, we made a GET request on the URL and printed the response in the terminal.

Our response should look like this:

Scraped Google Results

Yeah, it is unreadable. Don’t worry! We will parse this extracted HTML using the Beautiful Soup library, which we will cover briefly.

HTTPX

HTTPX provides sync and async APIs and is a completely featured HTTP client for Python3. It is considered to be way more modern than Requests due to its support of features like sync and async APIs, HTTP/2 protocol, connection pooling, Requests, and response pooling, and much more.

This library will also help us to make HTTP requests and extract the HTML content. You will ask then, what is the difference between Requests and HTTPX? The answer is the advantages and features that HTTPX has over Requests. We will also cover that in a bit.

Let us make a scraper with the help of HTTPX and scrape the Google Search Results in the same we have done above.

Install this library:

pip install httpx

Then, we will import the library and let the scraper do its work.

import httpx
import asyncio

url = "https://www.booking.com/hotel/in/casa-saligao.html?checkin=2023-11-24&checkout=2023-11-28&group_adults=2&group_children=0&no_rooms=1&selected_currency=USD"
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.2816.203 Safari/537.36"}


async def fetch_data():
    async with httpx.AsyncClient() as client:
        response = await client.get(url, headers=headers)
        response.raise_for_status()
        return response.text

response_text = asyncio.run(fetch_data())
print(response_text)

Step-by-step explanation:

  1. First, we defined an asynchronous function Fetch_data.
  2. Our function will then use httpx.AsyncClient() as a context manager to handle the HTTP requests.
  3. Then, we used the await keyword with the GET request to wait till the response arrives.
  4. Finally, we printed the data after calling the function

Advantages of Using HTTPX:

  1. HTTPX provides efficient handling of requests and responses due to its support of sync and async APIs
  2. It supports the HTTP/2 protocol which is more modern and well-run than the HTTP/1 protocol used by requests.
  3. It also uses connection Pooling which is an efficient way to reuse any existing connection to a server.
  4. It allows for the steaming of requests and responses.

BeautifulSoup

BeautifulSoup

BeautifulSoup, also known as BS4, is a web parsing library used for parsing HTML and XML documents. It can easily navigate within the HTML structure and allows us to extract data from HTML elements and their attributes.

Let‘s discuss how we can parse the extracted HTML document in the previous section using this library. 

First, install this library in your project folder.

pip install beautifulsoup4 

And then import it into your file.

from bs4 import BeautifulSoup

We will extract the following data points from the target page: 

  1. Hotel Name
  2. Address
  3. Rating and Total Reviews
  4. Facilities

Let us first declare our BeautifulSoup instance to perform this operation.

soup = BeautifulSoup(resp.content, 'html.parser')

Extracting the Hotel Name

Let us inspect the HTML and locate the tag for the Hotel Name.

Inspecting Hotel Name

In the above image, we can conclude that the name is contained inside the h2 tag with class pp-header__title.

Add the following code to your file to extract the Hotel Name.

name = soup.find("h2", class_="pp-header__title").text

Extracting Address

A similar process can be followed to scrape the address.

As you can see the address is under the tag hp_address_subtitle. So, it can be extracted with the following code.

address = soup.find("span", class_ = "hp_address_subtitle").text

Extracting the Hotel Rating and Reviews

Scroll down to the reviews section of the page and then inspect them.

Inspecting Hotel Ratings

So, the rating is stored under the div tag with class d86cee9b25 and the review information is stored inside the span with the class d935416c47.

The following code will extract the review and rating information.

rating = soup.find("div", class_ = "d86cee9b25").text
reviews = soup.find("span", class_="d935416c47").text.replace("·", "").strip()

Extracting the Hotel Facilities

Extracting Hotel Facilities will be a little bit of a long process.

Inspecting Hotel Facilties

The div tag with an attribute data-testid=”property-most-popular-facilities-wrapper” consists of a list having all the facilities.

Let us first select these lists using BS4 and then we will loop over each one of them to extract the text content.

fac_elements = soup.select("div[data-testid='property-most-popular-facilities-wrapper'] ul li")

Then, declare an array to store the elements and run a loop.

facilities = []

for el in fac_elements:
    facilities.append(el.get_text().strip())

You will get repetitive elements if you print this array. We will split this array in half to avoid duplicate data.

middle_index = len(facilities) // 2
facilities = facilities[:middle_index]

Then you can print all these data points to verify if you are getting the desired data. 

Extracted Hotel Details

Advantages of using Beautiful Soup:

  1. Easy to use and beginner-friendly.
  2. Can navigate and parse the DOM faster.
  3. Works on all HTML and XML documents.
  4. Easier to debug.
  5. It allows data extraction from HTML and XML in various flexible ways.

Scrapy

Scrapy

Scrapy is a fast and powerful open-source web crawling framework used to extract data from websites. Developed by Zyte.com, it is easy to use and is designed for creating scalable and flexible Python web scraping projects.

Let‘s design a scraper to scrape all the titles of the listed posts from this ycombinator page.  

YCombinator

Let us first install Scrapy.

pip install scrapy

Then we will create our project folder using scrapy.

scrapy startproject scraper

This will create a folder named “scraper” in your project folder. 

Folder structure

After running the above command in your terminal, you can see two more messages generated by Scrapy.

Use these commands to go inside the scraper folder.

cd scraper
scrapy genspider ycombinator news.ycombinator.com

This will create a file named ycombinator.py in the spider folder. It will automatically create a class and a function for us in the generated file.

import scrapy


class YcombinatorSpider(scrapy.Spider):
    name = "ycombinator"
    allowed_domains = ["news.ycombinator.com"]
    start_urls = ["http://news.ycombinator.com/"]

    def parse(self, response):
        pass

Now, let’s find the tag of the required element we want to scrape. 

Inspecting YCombinator

As you can all our post titles are under the tag .titleline > a.

This makes our code look like this: 

import scrapy
from bs4 import BeautifulSoup

class YcombinatorSpider(scrapy.Spider):
    name = "ycombinator"
    allowed_domains = ["news.ycombinator.com"]
    start_urls = ["https://news.ycombinator.com/"]

    def start_requests(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4690.0 Safari/537.36'  # Replace with the desired User-Agent value
        }
        for url in self.start_urls:
            yield scrapy.Request(url, headers=headers, callback=self.parse)

    def parse(self, response):
        soup = BeautifulSoup(response.text, 'html.parser')
        for el in soup.select(".athing"):
            obj = {}
            try:
                obj["titles"] = el.select_one(".titleline > a").text
            except:
                obj["titles"] = None
            yield obj



if __name__ == "__main__":
    from scrapy.crawler import CrawlerProcess

    process = CrawlerProcess()
    process.crawl(AmazonSpider)
    process.start()

Run this code in your terminal, and the results should look like this:

Parsed titles from YCombinator

Advantages of using Scrapy:

  1. It is designed to be highly scalable.
  2. Extremely fast than most HTTP libraries.
  3. Comes with built-in support like error handling, managing cookies, and much more.

Selenium

In the above sections, we have studied some great frameworks, but none of them is suitable for scraping SPA(single-page applications) or dynamically rendered content. This is where Selenium comes into play!

Selenium is a Python library used for browser automation and testing web applications. It is a powerful tool that can perform various tasks like clicking buttons, navigating through pages, infinite scrolling, and much more. Not only it supports multiple languages, but also multiple browsers.

Let us scrape the list of books from this website using Selenium. 

Books Results

First, install Selenium in your project folder.

pip install selenium

Now, install the Chrome driver from this link. Install the Chrome driver the same as your Chrome browser version.

Next, we will import all the libraries we will be using further in this section.

from selenium.webdriver.chrome.service import Service
from selenium import webdriver
from bs4 import BeautifulSoup

After that, we will set the path where our Chrome Driver is located. 

SERVICE_PATH = "E:\chromedriver.exe"
l=list()
obj={}

headers= {
         "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
         }

url = "https://books.toscrape.com"

In the first line, we set the path where our Chrome Driver is located. Then, we declared a list and object to store the scraped data. After that, we set headers to User Agent and the URL to our target link.

So, let us now navigate to our target URL.

service = Service(SERVICE_PATH)
driver = webdriver.Chrome(service=service)
driver.get(url)

In the above code, we used the web driver to use the Chrome Driver present at the given path to navigate to the target URL.

So, after navigating to the target URL, we will ask the web driver for the HTML content of the page and then close it.

resp = driver.page_source
driver.close()

Closing the web driver is important as it requires a high CPU usage to run which results in high scraping costs.

And then, we will create an instance of BeautifulSoup to parse the HTML.

soup = BeautifulSoup(resp, 'html.parser')

Let us now search for the classes of elements we want to scrape.

Inspecting books.toscrape

If you inspect the HTML, you will find that every product is inside the tag product_pod. And if you further go deep inside, you will find the tag for the title as h3, link to the book as a and the price of the book as price_color and so on.

After searching for all the tags, our parser should look like this: 

for el in soup.select("article.product_pod"):

        obj["title"] = el.select_one("h3").text
        obj["link"] = el.select_one("a")["href"]
        obj["price"] = el.select_one(".price_color").text
        obj["stock_availability"] = el.select_one(".instock").text.strip()
        l.append(obj)
        obj = {}

print(l)

This will give you information about twenty books present on the web page. 

[
 {
  'title': 'A Light in the ...',
  'link': 'catalogue/a-light-in-the-attic_1000/index.html',
  'price': '£51.77',
  'stock_availability': 'In stock'
 },
 {
  'title': 'Tipping the Velvet',
  'link': 'catalogue/tipping-the-velvet_999/index.html',
  'price': '£53.74',
  'stock_availability': 'In stock'
 },
 {
  'title': 'Soumission',
  'link': 'catalogue/soumission_998/index.html',
  'price': '£50.10',
  'stock_availability': 'In stock'
  },
.....

This is how you can use Selenium for completing web scraping tasks.

Advantages of using Selenium

  1. Selenium allows us to perform various tasks like clicking on the web page, scrolling the web page, navigating through pages, and much more.
  2. It can also be used for taking screenshots of web pages.
  3. It supports multiple programming languages.
  4. It can be used to find bugs at the earlier stage of testing.

Disadvantages of using Selenium

  1. It is slower in execution.
  2. It has a high CPU usage.
  3. It can increase scraping costs.

Read More: Scrape Zillow Using Python 

Playwright

Playwright in Python can be used for automating browsers like Chromium and Firefox with a single API. It was developed by Microsoft as an open-source Node JS library. Like Selenium, it can also perform tasks like clicking buttons, navigating through pages, infinite scrolling, and much more.

Let us scrape the same list of books we extracted in the above section using Playwright.

First, install Playwright.

pip install playwright

Now, import the BeautifulSoup library to parse the HTML and playwright.sync_api module so we can control the Playwright service.

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

After that, we will create a function to extract the data from the web page.

def getData():
    browser = sync_playwright().chromium.launch()
    page = await browser.new_page()
    await page.goto("https://books.toscrape.com")
    page_source = await page.content()
    await browser.close()

Step-by-step explanation:

  1. After creating the function, we created a new browser context to interact with the web page.
  2. Then, in the third and fourth lines, we used the browser to navigate to the target URL.
  3. In the fifth line, we extracted the page content or HTML.
  4. Finally, we perform the most important step which is closing the browser.

Then, use BeautifulSoup to extract the required data from the web page.

soup = BeautifulSoup(page_source, 'html.parser')
    books = soup.select("article.product_pod")

    data = []

    for book in books:
        title = book.select_one("h3").text
        link = book.select_one("a")["href"]
        price = book.select_one(".price_color").text
        stock_availability = book.select_one(".instock").text.strip()

        data.append({
            "title": title,
            "link": link,
            "price": price,
            "stock_availability": stock_availability,
        })

    print(data)



getData()

This will return the same data as we got in the above section.

Recap

In the above sections, we learned some of the great web scraping libraries by Python. Requests and Beautiful Soup can be an excellent combination to extract data at scale. But this requires you to handle proxy rotation and IP blockage from your end.

But, if you want to scrape websites like Google or Amazon at scale, you can consider using our Google Search API and Python Web Scraping API, which can handle proxy rotation and blockage on its end using its massive pool of 10M+ residential proxies.

Moreover, Serpdog APIs return results in both HTML and JSON format so that our customers don’t have to manage their valuable time dealing with the complex HTML structure.

I have also prepared some quality tutorials for beginners who want to know more about web scraping in Python:

Regular Expressions

The regex, or regular expressions, describe a sequence of characters to match certain patterns in the body of the text. It can be used to manipulate text strings, and they are common to all programming languages, such as Python, JavaScript, C++, and more.

This is one of the most helpful tools for filtering out a large piece of data from encoded HTML, requiring just a correct regex pattern. 

Learning regex can be difficult at first, but you once understand the concept, it can be a useful tool.

Let us discuss how to verify a 10-digit phone number using the regex pattern.

First, we will import the regex library from Python.

import re

And pattern:

pattern = r'^\d{3}[-\s]?\d{3}[-\s]?\d{4}$'

This regex pattern also checks for any optional dashes or spaces within the phone number.

Then, we will take input from the user and check whether the entered number is a correct 10-digit phone number.

phone_number = input("Enter a 10-digit phone number: ")

if re.match(pattern, phone_number):
        print("Valid phone number.")
else:
        print("Invalid phone number.")

Run this program in your terminal and try to enter any number.

Validating a phone number

Let me try a different number also.

Validating a phone number

This method is quite useful when you want to scrape mobile numbers from a web page and also have to check if the picked-up mobile number is valid or invalid.

Urllib3

Urllib3 is a built-in Python library used for making HTTP requests. It was designed to be used instead of Urllib2 and is also easier to use. It also allows you to make all types of HTTP requests(GET, POST, DELETE, etc.) and comes with features such as connection pooling, proxy support, and SSL/TLS verifications.

It uses various modules, let us discuss some of them:

  • request — Used to open and read URLs.
  • error — Contains the exception classes for exceptions raised by the request module.
  • parse — Used to parse the URLs
  • robotparser — Used to parse robot.txt files.
  • poolmanager — Used to manage HTTP connection at a given URL.

Let us try to make a GET request using Urllib3. It is quite simple.

import urllib3
http = urllib3.PoolManager()
response = http.request(‘GET’, ‘http://httpbin.org/headers')
print(response.data)

In the above code, after importing the library, we created an instance of the pool manager responsible for managing the HTTP connections, and then we made an HTTP request with the help of that instance on our target URL.

GET Request with Urllib3

As discussed above, we can also use it to send POST requests. 

import urllib3
http = urllib3.PoolManager()
response = http.request('POST', 'http://httpbin.org/post', fields = {"Blog": "Web Scraping With Python"})
print(response.data)

We sent a JSON object “fields” to the server, which will return confirming a successful request.

POST Request with Urllib3

This library is not famous for doing web scraping tasks, or there might be negligible use of it. But, it has some advantages associated with it, let’s discuss them.

Advantages of using Urllib3:

  1. Supports connection pooling.
  2. Has proxy support.
  3. SSL/TLS verifications.

Disadvantages of using Urllib3:

  1. Has limited features.
  2. Cookie support is poor.

Socket

Socket is a Python library used for communication between two or more computers over a network. 

Here is how you can make an HTTP GET request using Socket by opening a TCP Socket. 

import socket
HOST = 'www.google.com'
PORT = 80 
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_address = (HOST, PORT)
sock.connect(server_address)
sock.send(b”GET / HTTP/1.1\r\nHost:www.google.com\r\n\r\n")
response = sock.recv(4096)
sock.close()
print(response.decode())

Step-by-step explanation:

  1. After importing the library, we set the HOST URL and the port.
  2. In the next line, we passed two parameters to the socket constructor: socket family and type.
  3. Then, we initialized the server address with the address of the web server.
  4. After that, we used the connect method with the server address as a parameter to establish a connection with the web server.
  5. Then we made a GET request with the bytes object as a parameter.
  6. Now, we get the response from the server using the sock.recv with 4096 as the buffer size.
  7. Finally, we closed the socket and printed the response received from the server.
Scraped Data by Socket

This is how you can create a basic web scraper using Socket.

Mechanical Soup

Mechanical Soup is a Python library that is a combination of BeautifulSoup and Requests libraries. It has various features, such as interacting with web forms, handling redirects, and managing cookies.

Let’s discuss how we can scrape this website using Mechanical Soup.

First, install this library by running this command in your project terminal.

pip install MechanicalSoup

Now, import this library into your project file.

import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()

We created an instance of StatefulBrowser which is assigned to the variable browser, that can now be used to interact with pages.

url= "https://en.wikipedia.org/wiki/Amazon_(company)"
browser.open(url)
print(browser.get_current_page())

browser.open() will open the web page in the session and then browser.get_current_page() will return the HTML content of the web page.

Scrpaed Data from Wikipedia

If you want to learn more about this library, you can refer to this docs page.

Conclusion

In this tutorial, we learned web scraping with Python using various libraries like Requests, Beautiful Soup, etc. I hope this tutorial gave you a complete overview of web scraping in Python

Please do not hesitate to message me if I missed something. If you think we can complete your custom scraping projects, feel free to contact us. Follow me on Twitter. Thanks for reading!

Additional Resources

  1. Web Scraping With JavaScript and Node JS — An Ultimate Guide
  2. Scrape Google Play Store
  3. Web Scraping Google Maps
  4. Scrape Google Shopping
  5. Scrape Google Scholar