...

Scraping Google News Using Python(2024 Updated)

Web Scraping Google News Using Python

Access to real-time news search results is crucial for businesses to discover the latest headlines and updates regarding their brand enabling them to stay informed about the sentiments of the public and media houses and maintain a good reputation to gain a competitive edge in the market.

Python, one of the most popular languages for web scraping, can be used for creating automated web scrapers to extract the precious data available on the internet for various purposes like Data analysis, SEO monitoring, news, and media monitoring. In this tutorial, we will use its capability to scrape data from Google News Results.

What’s the Purpose of Scraping Google News? 

Access to Google News Data can provide you with several benefits, including:

Brand Monitoring — News results can help you monitor the media and public perspective about your brand. It helps to keep a check on any issue or negative publicity about your company that can affect your business.

Keeps You Updated — News Results keep you updated about the current political events occurring worldwide. It also helps you to keep a check on the current advancements taking place in your areas of interest.

Market Research — Google News Results can help you study various historical trends in your industry and the data can also be used for research-based purposes like consumer sentiment, competitor analysis, etc.

Competitor Analysis — You can utilize the news data to monitor your competitor’s latest developments and product launches. You can also study their media strategy to identify any loopholes in your tactics while dealing with media marketing.

Building Awareness – It can also be used to build awareness among the public on particular topics such as political science, GK, economics, etc.

Let’s Start Scraping

In this blog post, we’ll create a Python script to extract the first 100 Google News results including, the title, description, link, source, and date. 

Requirements

We will be installing these two libraries for this project: 

  1. Beautiful Soup — Used for parsing the raw HTML data.
  2. Requests — Used for making HTTP requests.

Or you can directly install these libraries by running the below commands in your terminal:

pip install requests
pip install beautifulsoup4

Process:

Before starting, I assume you have set up your Python project on your device. So, open the project file in your respective code editor and import these two libraries, which we will use in this tutorial. 

import json
import requests
from bs4 import BeautifulSoup

Now, let’s create a function to scrape the Google News Results: 

def getNewsData():
    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
    }
    response = requests.get(
        "https://www.google.com/search?q=amazon&gl=us&tbm=nws&num=100", headers=headers
    )
    soup = BeautifulSoup(response.content, "html.parser")
    news_results = []

First, we set the header to the User Agent, which will help us to make our scraping bot make an organic visit to Google. Then we made an HTTP request on the target URL using the request library we imported above and stored the extracted HTML in the response variable. In the last line, we created an instance of the BeautifulSoup library to parse the HTML data.

Inspecting Google News Results

Let us now search for the tags from the HTML to extract the required data.

If you inspect the HTML file, you will find every result or news article is contained inside this div.SoaBEf tag. So, we will loop every div container with the class SoaBEf to get the required data.

    for el in soup.select("div.SoaBEf"):
        news_results.append(
            {

            }
        )
 
    print(json.dumps(news_results, indent=2))
 
getNewsData()

Then, we will find and locate the tags and classes for the entities that we need to scrape in this tutorial.

Scraping News Title

Let’s find the headline of the news article by inspecting it.

Scraping News Title

As you can see in the above image, the title is under the div container with the class MBeuO.

Add the following code in the append block to get the news title.

                "title": el.select_one("div.MBeuO").get_text(),

Scraping News Source and Link

Similarly, we can extract the News source and link from the HTML.

Scraping News Source and Link

The news link is present as the value of the href attribute of the anchor link, and the source is contained inside the div tag with the class NUnG9d. You can scrape both the source and link using the following code.

                "source": el.select_one(".NUnG9d span").get_text(),
                "link": el.find("a")["href"],

Scraping News Description and Date

The news description is stored inside the div tag with class GI74Re, and the date is present inside the div tag with the class LfVVr.

Scraping News Description and Date

Copy the following code to extract both of them.

                "snippet": el.select_one(".GI74Re").get_text(),
                "date": el.select_one(".LfVVr").get_text(),

Finally, we are done with extracting all the entities.

Complete Code

So, this is how you can scrape Google News results. If you want to extract more information from the HTML, you can make custom changes to your code accordingly. For the current situation, here is the complete code:

import json
import requests
from bs4 import BeautifulSoup

def getNewsData():
    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
    }
    response = requests.get(
        "https://www.google.com/search?q=us+stock+markets&gl=us&tbm=nws&num=100", headers=headers
    )
    soup = BeautifulSoup(response.content, "html.parser")
    news_results = []

    for el in soup.select("div.SoaBEf"):
        news_results.append(
            {
                "link": el.find("a")["href"],
                "title": el.select_one("div.MBeuO").get_text(),
                "snippet": el.select_one(".GI74Re").get_text(),
                "date": el.select_one(".LfVVr").get_text(),
                "source": el.select_one(".NUnG9d span").get_text()
            }
        )
 
    print(json.dumps(news_results, indent=2))
 
getNewsData()

Okay, so let us now run this code in our terminal to see the results:

Google News Results

However, there is a problem. We don’t want to copy these results from top to bottom every time and generate a file to store them in a safer place. It would be much easier to store the data in a CSV file through an automated process.

First, we need to import the CSV library into our program.

import csv

Then, you can replace the print line with the following code.

    with open("news_data.csv", "w", newline="") as csv_file:
        fieldnames = ["link", "title", "snippet", "date", "source"]
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(news_results)
 
print("Data saved to news_data.csv")

This code will allow us to save the scraped data into a CSV file with columns link, title, snippet, date, and source.

Google News Results in CSV

Hurray🥳🥳!!! We have successfully scraped the news data. Let us look at another method to help you scrape the news results without getting blocked.

Using Google News Scraper

Scraping news results can be difficult for an individual with changing HTML structure. Also, one should have a large pool of residential proxies and User Agents to avoid any blockage from Google for smooth scraping. 

What if you were provided with a simple and streamlined solution to scrape Google News Results?

Yes, you heard right! Our Google News API is powered by a network of more than 10M+ proxies and allows you to scrape Google News Results at scale without any fear of blockage.

Google News API

We also offer 100 free requests on the first sign-up.

Serpdog Dashboard

After getting registered on our website, you will get an API Key. Embed this API Key in the code below, and you will be able to scrape Google News Results at a much faster speed.

import requests
payload = {'api_key': 'APIKEY', 'q':'football' , 'gl':'us'}
resp = requests.get('https://api.serpdog.io/news', params=payload)
print (resp.text)

Conclusion:

In this article, we discussed two ways of scraping Google News data using Python. Data collectors looking to have an independent scraper and want to maintain a certain amount of flexibility while scraping data can use Python as an alternative to interacting with the web page.

Otherwise, Google News API is a simple solution that can quickly extract and clean the raw data obtained from the web page and present it in structured JSON format.

We also learned how this extracted data can be used for various purposes, including brand monitoring and competitor analysis.

Feel free to message me anything you need clarification on. Follow me on Twitter. Thanks for reading!

Additional Resources

  1. How to scrape Google Organic Search Results using Node JS?
  2. Scrape Google Images Results
  3. Scrape Google Shopping Results
  4. Scrape Google Maps Reviews

Frequently Asked Questions