...

How to Scrape Google Scholar Data

Google Scholar is an excellent source of research articles and study materials published by top-class educators around the world. It is a specifically designed scholarly search engine that collects and indexes academic research from a wide range of sources, making it a valuable tool for students, teachers, and anyone interested in scholarly articles.

However, comprehensive research requires you to look for countless articles, which can be time-consuming if done manually.

Scrape Google Scholar Results
Scrape Google Scholar Results

In This Tutorial, We Will Learn How You Can Automate This Process And Scrape Google Scholar Results Using Node JS.

Why Scrape Google Scholar?

Scraping Google Scholar can provide you with various benefits:

Research-based purposes – Scraping Google Scholar allows you to access vast educational material available on the internet which can be further used for research-based purposes.

Content Access – Google Scholar data allows you to get quality content access present on the internet published by the topmost scientists around the world.

Multidisciplinary – Google Scholar encompasses a wide range of academic articles, from science and technology to humanities and arts.

Requirements for scraping Google Scholar:

Web Parsing with CSS selectors

Searching the tags from the HTML files is not only a difficult thing to do but also a time-consuming process. It is better to use the CSS Selectors Gadget for selecting the perfect tags to make your web scraping journey easier.

This gadget can help you to come up with the perfect CSS selector for your need. Here is the link to the tutorial, which will teach you to use this gadget for selecting the best CSS selectors according to your needs.

User Agents

User-Agent is used to identify the application, operating system, vendor, and version of the requesting user agent, which can save help in making a fake visit to Google by acting as a real user.
You can also rotate User Agents, read more about this in this article: How to fake and rotate User Agents using Python 3.

If you want to further safeguard your IP from being blocked by Google, you can try these 10 Tips to avoid getting Blocked while Scraping Google.

Install Libraries

Before we begin, install these libraries so we can move forward and prepare our scraper.

1. Unirest JS
2. Cheerio JS

Or you can type the below commands in your project terminal to install the libraries:

npm i unirest
npm i cheerio

To extract our HTML data, we will use Unirest JS, and for parsing the HTML data, we will use Cheerio JS.

Scraping Google Scholar Organic Results

Google Scholar Organic Results refers to the organic search results, including scholarly articles, research, thesis, and other academic materials relevant to the user’s search query.

In this section, we will cover the following data points from the Google Scholar Organic Results.

  1. Title and title link
  2. ID
  3. Displayed Link
  4. Snippet
  5. Cited-By Count and Cited-by Link
  6. Version count and version link

Here is the complete code to scrape the Google Organic Scholar Results 👇🏻:

const cheerio = require("cheerio");
const unirest = require("unirest");
    
    
    const getScholarData = async() => {
    try
    {
    const url = "https://www.google.com/scholar?q=IIT+MUMBAI&hl=en";
    
    return unirest
    .get(url)
    .headers({
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
    })
    .then((response) => {
        let $ = cheerio.load(response.body);
    
    let scholar_results = [];
    
    $(".gs_ri").each((i,el) => {
        scholar_results.push({
        title: $(el).find(".gs_rt").text(),
        title_link: $(el).find(".gs_rt a").attr("href"),
        id: $(el).find(".gs_rt a").attr("id")
        displayed_link: $(el).find(".gs_a").text(),
        snippet: $(el).find(".gs_rs").text().replace("\n", ""),
        cited_by_count: $(el).find(".gs_nph+ a").text(),
        cited_link: "https://scholar.google.com" + $(el).find(".gs_nph+ a").attr("href"),
        versions_count: $(el).find("a~ a+ .gs_nph").text(),
        versions_link: $(el).find("a~ a+ .gs_nph").text() ? "https://scholar.google.com" + $(el).find("a~ a+ .gs_nph").attr("href") : "",
        })
    })
    
    for (let i = 0; i < scholar_results.length; i++) {
        Object.keys(scholar_results[i]).forEach(key => scholar_results[i][key] === "" || scholar_results[i][key] === undefined ? delete scholar_results[i][key] : {});  
    }
    
    console.log(scholar_results)
    })
    }
    catch(e)
    {
        console.log(e);
    }
    }
    getScholarData();

Our results should look like this 👇🏻:

[
        {
            title: 'Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study.',
            title_link: 'https://search.ebscohost.com/login.aspx?direct=true&profile=ehost&scope=site&authtype=crawler&jrnl=22295984&AN=108373670&h=bqlRj0gjNNQoSuJb5zZxtrAWRoe7e4cT7cfMNTEYxWbUdYAXdv0An55XKjithW%2FT3A9v3vC8m87cvR3EXu%2BdkA%3D%3D&crl=c',
            id: 'TPhPjzP8H_MJ',
            displayed_link: 'SK Gupta, S Sharma - International Journal of Information …, 2015 - search.ebscohost.com',
            snippet: "The rapid advancement in information technology has changed the resources and services of a library. Now day's libraries are not confined only to print resources and traditional library …",
            cited_by_count: 'Cited by 19',
            cited_link: 'https://scholar.google.com/scholar?cites=17518998373872433228&as_sdt=2005&sciodt=0,5&hl=en',
            versions_count: 'All 5 versions',
            versions_link: 'https://scholar.google.com/scholar?cluster=17518998373872433228&hl=en&as_sdt=0,5'
        },
        {
            title: '[PDF][PDF] Design of Solar powered vehicle. project III, Industrial Design Center, IIT Mumbai',
            title_link: 'https://dsource.in/sites/default/files/case-study/solar-powered-rickshaw/introduction/file/solar-powered-rickshaw.pdf',
            id: '_w_nBYVUe8AJ',
            displayed_link: 'UA Athavankar, SR Singh - 2016 - dsource.in',
            snippet: 'The greatest problem that faces the world today is Global warming. It is more apparent here in India than anywhere else, specially Rajasthan where temperatures over the last few years …',
            cited_by_count: 'Cited by 2',
            cited_link: 'https://scholar.google.com/scholar?cites=13869772407723986943&as_sdt=2005&sciodt=0,5&hl=en'
        },
        ....

Scraping Google Scholar Profiles

We will cover the following data points from the Google Scholar Organic Results.

  1. Author’s Name
  2. Link
  3. Position
  4. Department in the organization
  5. Email
  6. Cited-by count

Here is our code 👇🏻:

const unirest = require("unirest");
const cheerio = require("cheerio")
const getScholarProfiles = async() => {
    
    try
    {
    const url = "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=IIT+MUMBAI";
    return unirest
    .get(url)
    .headers({
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
    })
    .then((response) => {
        let $ = cheerio.load(response.body);
    let scholar_profiles = [];
    $(".gsc_1usr").each((i,el) => {
        scholar_profiles.push({
        name: $(el).find(".gs_ai_name").text(),
        name_link: "https://scholar.google.com" + $(el).find(".gs_ai_name a").attr("href"),
        position: $(el).find(".gs_ai_aff").text(),
        email: $(el).find(".gs_ai_eml").text(),
        departments: $(el).find(".gs_ai_int").text(),
        cited_by_count: $(el).find(".gs_ai_cby").text().split(" ")[2],
        })
    })
    for (let i = 0; i < scholar_profiles.length; i++) {
        Object.keys(scholar_profiles[i]).forEach(key => scholar_profiles[i][key] === "" || scholar_profiles[i][key] === undefined ? delete scholar_profiles[i][key] : {});  
    }
    console.log(scholar_profiles)
    });
    }
    catch(e)
    {
        console.log(e);
    }
    }
    getScholarProfiles();

Our results should look like this 👇🏻:

[
    {
        name: 'Piyali Banerjee',
        name_link: 'https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ',
        position: 'Postdoctoral Researcher in Physics, IIT Bombay',
        email: 'Verified email at iitb.ac.in',
        departments: 'Experimental High Energy Physics Phenomenology ',
        cited_by_count: '230769'
    },
    {
        name: 'Archana Pai',
        name_link: 'https://scholar.google.com/citations?hl=en&user=2Dw4Y9AAAAAJ',
        position: 'IIT Bombay',
        email: 'Verified email at phy.iitb.ac.in',
        departments: 'Gravitational Wave Astronomy Statistical Signal Processing Multimessenger astronomy ',
        cited_by_count: '70703'
    },
    {
        name: 'Krithi Ramamritham',
        name_link: 'https://scholar.google.com/citations?hl=en&user=LFLG5pcAAAAJ',
        position: 'Sai University, Chennai, India (retired from IIT Bombay)',
        email: 'Verified email at iitb.ac.in',
        departments: 'databases real-time systems ICT based  solutions for society ',
        cited_by_count: '23765'
    },
    ....

Scraping Google Scholar Cite Results

We will cover the following data points from the Google Scholar Cite Results.

  1. Title
  2. Snippet
  3. Name
  4. Link

The below block of code will scrape the cite result of an organic scholar search result.

const cheerio = require("cheerio");
const unirest = require("unirest");
    
    const getData = async () => {
        try {
        const url =
            "https://scholar.google.com/scholar?q=info:TPhPjzP8H_MJ:scholar.google.com&output=cite";
    
        return unirest
            .get(url)
            .headers({})
            .then((response) => {
            let $ = cheerio.load(response.body);
    
            let cite_results = [];
    
            $("#gs_citt tr").each((i, el) => {
                cite_results.push({
                title: $(el).find(".gs_cith").text(),
                snippet: $(el).find(".gs_citr").text(),
                });
            });
    
            let links = [];
    
            $("#gs_citi .gs_citi").each((i, el) => {
                links.push({
                name: $(el).text(),
                link: $(el).attr("href"),
                });
            });
    
            console.log(cite_results);
            console.log(links);
    
            });
        } catch (e) {
        console.log(e);
        }
    };
    getData();

If you look at the target URL, after the info, I have used a string that is nothing but just an ID we got from scraping Google Scholar Organic Results.
Our result should look like this 👇🏻:

[
    {
        title: 'MLA',
        snippet: 'Gupta, Sanjay Kumar, and Sanjeev Sharma. "Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study." International Journal of Information Dissemination & Technology 5.1 (2015).'
    },
    {
        title: 'APA',
        snippet: 'Gupta, S. K., & Sharma, S. (2015). Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study. International Journal of Information Dissemination & Technology, 5(1).'
    },
    {
        title: 'Chicago',
        snippet: 'Gupta, Sanjay Kumar, and Sanjeev Sharma. "Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study." International Journal of Information Dissemination & Technology 5, no. 1 (2015).'
    },
    {
        title: 'Harvard',
        snippet: 'Gupta, S.K. and Sharma, S., 2015. Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study. International Journal of Information Dissemination & Technology, 5(1).'
    },
    {
        title: 'Vancouver',
        snippet: 'Gupta SK, Sharma S. Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study. International Journal of Information Dissemination & Technology. 2015 Jan 1;5(1).'
    }
  ]
  [
    {
        name: 'BibTeX',
        link: 'https://scholar.googleusercontent.com/scholar.bib?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=4&ct=citation&cd=-1&hl=en'
    },
    {
        name: 'EndNote',
        link: 'https://scholar.googleusercontent.com/scholar.enw?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=3&ct=citation&cd=-1&hl=en'
    },
    {
        name: 'RefMan',
        link: 'https://scholar.googleusercontent.com/scholar.ris?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=2&ct=citation&cd=-1&hl=en'
    },
    {
        name: 'RefWorks',
        link: 'https://scholar.googleusercontent.com/scholar.rfw?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=1&ct=citation&cd=-1&hl=en'
    }
  ]

Scraping Google Scholar Author Profile

Google Scholar Author Profile is a public profile that showcases an author’s academic research, publications, and citations on Google Scholar. These profiles allow researchers to present their scholarly achievements and make them more discoverable.

In this section, we will cover the following data points:

  1. Name
  2. Position
  3. Email
  4. Departments
const unirest = require("unirest");
    const cheerio = require("cheerio");
    
    const getAuthorProfileData = async () => {
    try {
        const url = "https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ";
        
        return unirest.get(url)
        .headers({
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
        })
        .then((response) => {
        const $ = cheerio.load(response.body)                                
        let author_results = {};
    
        author_results.name = $("#gsc_prf_in").text();
        author_results.position = $("#gsc_prf_inw+ .gsc_prf_il").text();
        author_results.email = $("#gsc_prf_ivh").text();
        author_results.departments = $("#gsc_prf_int").text();
    
        console.log(author_results);
    })
    } catch (e) {
        console.log(e);
    }
    };
    getAuthorProfileData();

Our result should look like this 👇🏻:

{
    name: 'Piyali Banerjee',
    position: 'Postdoctoral Researcher in Physics, IIT Bombay',
    email: 'Verified email at iitb.ac.in',
    departments: 'Experimental High Energy PhysicsPhenomenology'
  }

Now we will scrape the articles written by the author from his profile.

$(".gsc_a_t").each((i,el) => {
        articles.push({
            title: $(el).find(".gsc_a_at").text(),
            link: "https://scholar.google.com" + $(el).find(".gsc_a_at a").attr("href"),
            authors: $(el).find(".gsc_a_at+ .gs_gray").text(),
            publication: $(el).find(".gs_gray+ .gs_gray").text()
        })
    }) 
    
    for (let i = 0; i < articles.length; i++) {
        Object.keys(articles[i]).forEach((key) =>
            articles[i][key] === "" || articles[i][key] === undefined
            ? delete articles[i][key]
            : {}
        );
        }

And the results should look like this:

[
  {
    title: 'Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC',
    link: 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=cOsxSDEAAAAJ&citation_for_view=cOsxSDEAAAAJ:u5HHmVD_uO8C',
    authors: 'G Aad, T Abajyan, B Abbott, J Abdallah, SA Khalek, AA Abdelalim, ...',
    publication: 'Physics Letters B 716 (1), 1-29, 2012'
  },
  {
    title: 'The ATLAS simulation infrastructure',
    link: 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=cOsxSDEAAAAJ&citation_for_view=cOsxSDEAAAAJ:d1gkVwhDpl0C',
    authors: 'G Aad, B Abbott, J Abdallah, AA Abdelalim, A Abdesselam, B Abi, ...',
    publication: 'The European Physical Journal C 70 (3), 823-874, 2010'
  },

Now, we will scrape the Google Scholar Author profile Cited By results in which we will cover citation, h-index, and the i10-index since 2017.

Here is the code 👇🏻:

let cited_by = {};
cited_by.table = [];
    cited_by.table[0] = {};
    cited_by.table[0].citations = {};
    cited_by.table[0].citations.all = $("tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std").text();
    cited_by.table[0].citations.since_2017 = $("tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std").text();
    cited_by.table[1] = {};
    cited_by.table[1].h_index = {};
    cited_by.table[1].h_index.all = $("tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std").text();
    cited_by.table[1].h_index.since_2017 = $("tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std").text();
    cited_by.table[2] = {};
    cited_by.table[2].i_index = {};
    cited_by.table[2].i_index.all = $("tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std").text();
    cited_by.table[2].i_index.since_2017 = $("tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std").text();

And the result for it will look like this 👇🏻:

{
      [
        { citations: { all: '230769', since_2017: '105070' } },
        { h_index: { all: '185', since_2017: '133' } },
        { i_index: { all: '1154', since_2017: '706' } }
      ]
    }

Here is the full code to scrape the complete Google Author Profile Page 👇🏻:

const cheerio = require("cheerio");
    const unirest = require("unirest");
const getAuthorProfileData = async () => {
    try {
    const url = "https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ";
    return unirest
    .get(url)
    .headers({
        "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
    })
    .then((response) => {
        let $ = cheerio.load(response.body);
        let author_results = {};
        let articles = {};
        author_results.name = $("#gsc_prf_in").text();
        author_results.position = $("#gsc_prf_inw+ .gsc_prf_il").text();
        author_results.email = $("#gsc_prf_ivh").text();
        author_results.departments = $("#gsc_prf_int").text();
        $("#gsc_a_b .gsc_a_t").each((i, el) => {
            articles.push({
                title: $(el).find(".gsc_a_at").text(),
                link: "https://scholar.google.com" + $(el).find(".gsc_a_at").attr("href"),
                authors: $(el).find(".gsc_a_at+ .gs_gray").text(),
                publication: $(el).find(".gs_gray+ .gs_gray").text()
            })
        })
        for (let i = 0; i < articles.length; i++) {
            Object.keys(articles[i]).forEach((key) =>
                articles[i][key] === "" || articles[i][key] === undefined
                    ? delete articles[i][key]
                    : {}
            );
        }
        let cited_by = {};
        cited_by.table = [];
        cited_by.table[0] = {};
        cited_by.table[0].citations = {};
        cited_by.table[0].citations.all = $("tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std").text();
        cited_by.table[0].citations.since_2017 = $("tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std").text();
        cited_by.table[1] = {};
        cited_by.table[1].h_index = {};
        cited_by.table[1].h_index.all = $("tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std").text();
        cited_by.table[1].h_index.since_2017 = $("tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std").text();
        cited_by.table[2] = {};
        cited_by.table[2].i_index = {};
        cited_by.table[2].i_index.all = $("tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std").text();
        cited_by.table[2].i_index.since_2017 = $("tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std").text();
        console.log(author_results);
        console.log(articles);
        console.log(cited_by.table);
    })
    } catch (e) {
    console.log(e);
    }
    };
    getAuthorProfileData();

Using Google Scholar API

Scraping Google Scholar can be difficult for a developer with frequent blockage from Google. Also, one has to maintain the scraper accordingly with the changing HTML structure. 

Suppose, if you were provided with a straightforward and efficient solution to scrape Google Scholar Results, wouldn’t that be an excellent choice?

Yes, you heard right! Our Google Scholar API allows businesses to scrape educational content from Googe Scholar at scale using our powerful API infrastructure which is powered by a massive pool of 10M+ residential proxies.

Google Scholar API
Google Scholar API

We also offer 100 free requests on the first sign-up.

After getting registered on our website, you will get an API Key. Embed this API Key in the below code, you will be able to scrape Google Scholar Results at a much faster speed.

const axios = require('axios');

axios.get('https://api.serpdog.io/scholar?api_key=APIKEY&q=physics')
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.log(error);
  });

Conclusion:

In this tutorial, we learned to scrape Google Scholar Results using Node JS. Feel free to message me if I missed something or if anything you need clarification on. Follow me on Twitter Thanks for reading!

Additional Resources

1. How to scrape Google Organic Search Results using Node JS?
2. Scrape Google Images Results
3. Scrape Google News Results
4. Scrape Google Maps Reviews