...

Scrape Google Scholar Using Python

Scrape Google Scholar Using Python

Google Scholar data can be a great choice for businesses that specifically want to access quality research-based content available on the internet. 

In this tutorial, we will learn to scrape Google Scholar Results using Python and libraries requests and BeautifulSoup.

Let’s start scraping Google Scholar using Python

In this section, we will prepare a basic script using Python to scrape Google Scholar Results, but let us first complete the requirements of this project.

Set-Up

If you have not already installed Python on your device, please consider these videos:

  1. How to install Python on Windows?
  2. How to install Python on MacOS?

Or you can directly install Python from their official website.

Requirements

To scrape Google Scholar Results, we will be using these two Python libraries:

  1. Beautiful Soup — Used for parsing the raw HTML data.
  2. Requests — Used for making HTTP requests.

You can run the below commands in your project terminal to install the libraries.

pip install requests
pip install beautifulsoup4

Google Scholar Organic Results

In this section, we will scrape the organic results from Google Scholar.

Google Scholar Organic Results

Open this link on your desktop. We will scrape the title, link, id, displayed link, snippet, and site links from those results.

Here is the complete code to scrape the Google Organic Scholar Results:

import requests
from bs4 import BeautifulSoup

def getScholarData():
    try:
        url = "https://www.google.com/scholar?q=Quantum+Physics&hl=en"
        headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
        }
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        scholar_results = []

        for el in soup.select(".gs_ri"):
            scholar_results.append({
                "title": el.select(".gs_rt")[0].text,
                "title_link": el.select(".gs_rt a")[0]["href"],
                "id": el.select(".gs_rt a")[0]["id"],
                "displayed_link": el.select(".gs_a")[0].text,
                "snippet": el.select(".gs_rs")[0].text.replace("\n", ""),
                "cited_by_count": el.select(".gs_nph+ a")[0].text,
                "cited_link": "https://scholar.google.com" + el.select(".gs_nph+ a")[0]["href"],
                "versions_count": el.select("a~ a+ .gs_nph")[0].text,
                "versions_link": "https://scholar.google.com" + el.select("a~ a+ .gs_nph")[0]["href"] if el.select("a~ a+ .gs_nph")[0].text else "",
            })

        for i in range(len(scholar_results)):
            scholar_results[i] = {key: value for key, value in scholar_results[i].items() if value != "" and value is not None}

        print(scholar_results)

    except Exception as e:
        print(e)

getScholarData()

You can use this CSS Selectors Gadget to find the tags for the respective elements you want from the HTML results.

This will return the following data from the web page:

[
 {
  'title': '[BOOK][B] Quantum physics',
  'title_link': 'https://books.google.com/books?hl=en&lr=&id=qFtQiVmjWUEC&oi=fnd&pg=PA28&dq=Quantum+Physics&ots=tuFPLNEOmu&sig=tRYUJyK8VECC_j1Lwpbl6bjC5ag',
  'id': 'cU32d0ZoSA0J',
  'displayed_link': 'S Gasiorowicz - 2007 - books.google.com',
  'snippet': '… the Old Quantum Theory. This material appears in every textbook on modern physics in one … the creation of quantum mechanics, and it highlights the differences between classical and …',
  'cited_by_count': 'Cited by 1400',
  'cited_link': 'https://scholar.google.com/scholar?cites=957129572685860209&as_sdt=2005&sciodt=0,5&hl=en',
  'versions_count': 'All 10 versions',
  'versions_link': 'https://scholar.google.com/scholar?cluster=957129572685860209&hl=en&as_sdt=0,5'
 },
 {
  'title': '[BOOK][B] Quantum physics',
  'title_link': 'https://books.google.com/books?hl=en&lr=&id=hBlZu4M51IMC&oi=fnd&pg=PA1&dq=Quantum+Physics&ots=uEsu7Kytk-&sig=54fJyjI177Zm2gC7HZcgM_iCEJU',
  'id': 'ENaNHF5MoQsJ',
  'displayed_link': 'M Le Bellac - 2011 - books.google.com',
  'snippet': '… allow the fundamentals of quantum mechanics to be tested … in quantum physics. It will also allow us to give a small sample of the current ideas on the notion of measurement in quantum …',
  'cited_by_count': 'Cited by 211',
  'cited_link': 'https://scholar.google.com/scholar?cites=838034972757317136&as_sdt=2005&sciodt=0,5&hl=en',
  'versions_count': 'All 4 versions',
  'versions_link': 'https://scholar.google.com/scholar?cluster=838034972757317136&hl=en&as_sdt=0,5'
 },
 ....

Google Scholar Cite Results

Google Scholar Cite Results

Then, we will use the IDs we got from scraping organic results to scrape the cite results.

import requests
from bs4 import BeautifulSoup
def getData():
    try:
        url = "https://scholar.google.com/scholar?q=info:cU32d0ZoSA0J:scholar.google.com&output=cite"
        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, "html.parser")
        cite_results = []
        for el in soup.select("#gs_citt tr"):
            cite_results.append({
                "title": el.select_one(".gs_cith").text.strip(),
                "snippet": el.select_one(".gs_citr").text.strip()
            })
        links = []
        for el in soup.select("#gs_citi .gs_citi"):
            links.append({
                "name": el.text.strip(),
                "link": el.get("href")
            })
        print(cite_results)
        print(links)
    except Exception as e:
        print(e)
getData()

If you look at the URL, after the info we have used an ID of the first organic result we got from the above section.

Our result should look like this:

  [
    {
        'title': 'MLA', 'snippet': 'Gasiorowicz, Stephen. Quantum physics. John Wiley & Sons, 2007.'
    },
    {
        'title': 'APA', 'snippet': 'Gasiorowicz, S. (2007). Quantum physics. John Wiley & Sons.'
    },
    {
        'title': 'Chicago', 'snippet': 'Gasiorowicz, Stephen. Quantum physics. John Wiley & Sons, 2007.'
    },
    {
        'title': 'Harvard', 'snippet': 'Gasiorowicz, S., 2007. Quantum physics. John Wiley & Sons.'
    },
    {
        'title': 'Vancouver', 'snippet': 'Gasiorowicz S. Quantum physics. John Wiley & Sons; 2007 Jan 29.'
    }
  ]
  [
    {
        'name': 'BibTeX',
        'link': 'https: //scholar.googleusercontent.com/scholar.bib?q=info:cU32d0ZoSA0J:scholar.google.com/&output=citation&scisdr=Cm3wRhgwGAA:AGlGAw8AAAAAZFe6DtNdVceWMhOLx32wQFKpqxA&scisig=AGlGAw8AAAAAZFe6DjCj95cYqQaVYG2_H2xJMDY&scisf=4&ct=citation&cd=-1&hl=en'
    },
    {
        'name': 'EndNote',
        'link': 'https://scholar.googleusercontent.com/scholar.enw?q=info:cU32d0ZoSA0J:scholar.google.com/&output=citation&scisdr=Cm3wRhgwGAA:AGlGAw8AAAAAZFe6DtNdVceWMhOLx32wQFKpqxA&scisig=AGlGAw8AAAAAZFe6DjCj95cYqQaVYG2_H2xJMDY&scisf=3&ct=citation&cd=-1&hl=en' 
    },
    {
        'name': 'RefMan',
        'link': 'https://scholar.googleusercontent.com/scholar.ris?q=info:cU32d0ZoSA0J:scholar.google.com/&output=citation&scisdr=Cm3wRhgwGAA:AGlGAw8AAAAAZFe6DtNdVceWMhOLx32wQFKpqxA&scisig=AGlGAw8AAAAAZFe6DjCj95cYqQaVYG2_H2xJMDY&scisf=2&ct=citation&cd=-1&hl=en' 
    },
    { 
        'name': 'RefWorks',
        'link': 'https://scholar.googleusercontent.com/scholar.rfw?q=info:cU32d0ZoSA0J:scholar.google.com/&output=citation&scisdr=Cm3wRhgwGAA:AGlGAw8AAAAAZFe6DtNdVceWMhOLx32wQFKpqxA&scisig=AGlGAw8AAAAAZFe6DjCj95cYqQaVYG2_H2xJMDY&scisf=1&ct=citation&cd=-1&hl=en' 
    }
   ]

Google Scholar Authors

Google Scholar Authors

Now, let’s scrape the profiles of the authors who have published quality content on Quantum Physics.

Here is our code:

import requests
from bs4 import BeautifulSoup

def getScholarProfiles():
    try:
        url = "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=Quantum+Physics"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
        }
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        scholar_profiles = []
        for el in soup.select('.gsc_1usr'):
            profile = {
                'name': el.select_one('.gs_ai_name').get_text(),
                'name_link': 'https://scholar.google.com' + el.select_one('.gs_ai_name a')['href'],
                'position': el.select_one('.gs_ai_aff').get_text(),
                'email': el.select_one('.gs_ai_eml').get_text(),
                'departments': el.select_one('.gs_ai_int').get_text(),
                'cited_by_count': el.select_one('.gs_ai_cby').get_text().split(' ')[2]
            }
            scholar_profiles.append({k: v for k, v in profile.items() if v})
        
        print(scholar_profiles)
    except Exception as e:
        print(e)

getScholarProfiles()

Our results should look like this:

[
  {
    name: 'Georg Kresse',
    name_link: 'https://scholar.google.com/citations?hl=en&user=Pn8ouvAAAAAJ',
    position: 'University of Vienna, Faculty of Physics, Professor for Computational Quantum Mechanics',
    email: 'Verified email at univie.ac.at',
    departments: 'density functional theory first principles calculations many body theory condensed matter physics materials science ',
    cited_by_count: '345869'
  },
  {
    name: 'Manuel Proissl',
    name_link: 'https://scholar.google.com/citations?hl=en&user=ikHSFIkAAAAJ',
    position: 'Data Science Leader, Quantum Computing Technologist, Physicist',
    email: 'Verified email at accenture.com',
    departments: 'quantum computing machine learning deep learning causal inference ',
    cited_by_count: '82357'
  },
    ....

Google Scholar Author Profile

Google Scholar Author Profile

Let us not create a scraper to extract data from Google Author Profile.
First, we will extract some main details about the author then we will move to its published content part.

Google Scholar Author
import requests
from bs4 import BeautifulSoup

def getAuthorProfileData():
    try:
        url = "https://scholar.google.com/citations?hl=en&user=Pn8ouvAAAAAJ"
        headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
        }
        response = requests.get(url, headers=headers)
        print(response.status_code)
        soup = BeautifulSoup(response.text, 'html.parser')
        author_results = {}
        author_results['name'] = soup.select_one("#gsc_prf_in").get_text()
        author_results['position'] = soup.select_one("#gsc_prf_inw+ .gsc_prf_il").text
        author_results['email'] = soup.select_one("#gsc_prf_ivh").text
        author_results['published_content'] = soup.select_one("#gsc_prf_int").text
        print(author_results)
    except Exception as e:
        print(e)

getAuthorProfileData()

Our result should look like this:

{
  name: 'Georg Kresse',
  position: 'University of Vienna, Faculty of Physics, Professor for Computational Quantum Mechanics',
  email: 'Verified email at univie.ac.at - Homepage',
  published_content: 'density functional theoryfirst principles calculationsmany body theorycondensed matter physicsmaterials science'
}

Then, we will scrape the content published by the author.

Google Scholar Author Published Content

Here is our code:

for el in soup.select("#gsc_a_b .gsc_a_t"):
            article = {
                'title': el.select_one(".gsc_a_at").text,
                'link': "https://scholar.google.com" + el.select_one(".gsc_a_at")['href'],
                'authors': el.select_one(".gsc_a_at+ .gs_gray").text,
                'publication': el.select_one(".gs_gray+ .gs_gray").text
            }
            articles.append(article)
        for i in range(len(articles)):
            articles[i] = {k: v for k, v in articles[i].items() if v and v != ""}

And the results should look like this:

[
  {
    title: 'Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set',
    link: 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=Pn8ouvAAAAAJ&citation_for_view=Pn8ouvAAAAAJ:a3BOlSfXSfwC',
    authors: 'G Kresse, J Furthmüller',
    publication: 'Physical review B 54 (16), 11169, 1996'
  },
  {
    title: 'From ultrasoft pseudopotentials to the projector augmented-wave method',
    link: 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=Pn8ouvAAAAAJ&citation_for_view=Pn8ouvAAAAAJ:F9fV5C73w3QC',
    authors: 'G Kresse, D Joubert',
    publication: 'Physical review b 59 (3), 1758, 1999'
  },

Now, we will scrape the Google Scholar Author profile Cited By results in which we will cover citation, h-index, and the i10-index since 2017.

Google Scholar Author Cited By Results

Here is the code:

cited_by = {}
        cited_by['table'] = []
        cited_by['table'].append({})
        cited_by['table'][0]['citations'] = {}
        cited_by['table'][0]['citations']['all'] = soup.select_one("tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std").text
        cited_by['table'][0]['citations']['since_2017'] = soup.select_one("tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std").text
        cited_by['table'].append({})
        cited_by['table'][1]['h_index'] = {}
        cited_by['table'][1]['h_index']['all'] = soup.select_one("tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std").text
        cited_by['table'][1]['h_index']['since_2017'] = soup.select_one("tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std").text
        cited_by['table'].append({})
        cited_by['table'][2]['i_index'] = {}
        cited_by['table'][2]['i_index']['all'] = soup.select_one("tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std").text
        cited_by['table'][2]['i_index']['since_2017'] = soup.select_one("tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std").text

Here are the results:

[
  { citations: { all: '345869', since_2017: '168297' } },
  { h_index: { all: '132', since_2017: '82' } },
  { i_index: { all: '367', since_2017: '276' } }
]

Here is the complete code to scrape the complete profile of an Author:

import requests
from bs4 import BeautifulSoup

def getAuthorProfileData():
    try:
        url = "https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
        }
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        author_results = {}
        articles = []
        author_results['name'] = soup.select_one("#gsc_prf_in").text
        author_results['position'] = soup.select_one("#gsc_prf_inw+ .gsc_prf_il").text
        author_results['email'] = soup.select_one("#gsc_prf_ivh").text
        author_results['departments'] = soup.select_one("#gsc_prf_int").text
        for el in soup.select("#gsc_a_b .gsc_a_t"):
            article = {
                'title': el.select_one(".gsc_a_at").text,
                'link': "https://scholar.google.com" + el.select_one(".gsc_a_at")['href'],
                'authors': el.select_one(".gsc_a_at+ .gs_gray").text,
                'publication': el.select_one(".gs_gray+ .gs_gray").text
            }
            articles.append(article)
        for i in range(len(articles)):
            articles[i] = {k: v for k, v in articles[i].items() if v and v != ""}
        cited_by = {}
        cited_by['table'] = []
        cited_by['table'].append({})
        cited_by['table'][0]['citations'] = {}
        cited_by['table'][0]['citations']['all'] = soup.select_one("tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std").text
        cited_by['table'][0]['citations']['since_2017'] = soup.select_one("tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std").text
        cited_by['table'].append({})
        cited_by['table'][1]['h_index'] = {}
        cited_by['table'][1]['h_index']['all'] = soup.select_one("tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std").text
        cited_by['table'][1]['h_index']['since_2017'] = soup.select_one("tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std").text
        cited_by['table'].append({})
        cited_by['table'][2]['i_index'] = {}
        cited_by['table'][2]['i_index']['all'] = soup.select_one("tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std").text
        cited_by['table'][2]['i_index']['since_2017'] = soup.select_one("tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std").text
        print(author_results)
        print(articles)
        print(cited_by['table'])
    except Exception as e:
        print(e)

getAuthorProfileData()

Using Serpdog’s Google Scholar API for scraping Scholar Data

Scraping Google Scholar can be difficult for a developer with frequent blockage from Google. Also, one has to maintain the scraper accordingly with the changing HTML structure.

Suppose you were provided with a straightforward and efficient solution to scrape Google Scholar Results, wouldn’t that be an excellent choice?

Yes, you heard right! Our Google Scholar API allows businesses to scrape educational content from Google Scholar at scale using our powerful API infrastructure which is powered by a massive pool of 10M+ residential proxies.

We also offer 100 free requests on the first sign-up.

After getting registered on our website, you will get an API Key. Embed this API Key in the below code, you will be able to scrape Google Scholar Results at a much faster speed.

import requests
payload = {'api_key': 'APIKEY', 'q':'quantum+physics'}
resp = requests.get('https://api.serpdog.io/scholar', params=payload)
print (resp.text)

Conclusion

This tutorial taught us to scrape Google Scholar using Python. Feel free to message me if I missed something or if anything you need clarification on. Follow me on Twitter. Thanks for reading!

Additional Resources

  1. Web Scraping With JavaScript and Node JS — An Ultimate Guide
  2. Scrape Google Play Store
  3. Web Scraping Google Maps
  4. Scrape Google Shopping