...

Web Scraping With Node JS – An Ultimate Guide

JavaScript has now become one of the most preferred languages for web scraping. Its ability to extract the data from SPA(Single Page Application) has significantly boosted its popularity. Developers can easily automate their tasks with the help of libraries like Puppeteer and Cheerio, available in JavaScript.

In this blog, we will explore several web scraping libraries in JavaScript, their advantages and disadvantages, and determine the best among them.

Web Scraping With Node JS

Before starting to learn web scraping with Node JS, let us learn some basics of web scraping.

What is Web Scraping?

Web Scraping is the process of extracting data from a single or bunch of websites with the help of HTTP requests on the website’s server to access the raw HTML of a particular webpage and then convert it into a desired format. 

There are various uses of Web Scraping:

SEO – Web Scraping can be used to scrape Google Search Results for various objectives like SERP Monitoring, keyword tracking, rank tracking, SEO, etc.

News Monitoring – Web Scraping can enable access to a large number of articles from various media agencies which can be used to keep track of current news and global events around the world.

Lead Generation – Web Scraping allows businesses to know more about their potential customers, including contact details, job positions, locations, etc.

Price Comparison – Web Scraping can be used to gather product pricing from multiple online retailers for price comparison. For example – you can extract the pricing of a particular product from Amazon.com and Walmart.com, and then you can compare the prices to choose the more affordable retailer.

Best Web Scraping Libraries in Node JS

The best web scraping libraries present in Node JS are:

  1. Unirest
  2. Axios
  3. SuperAgent
  4. Cheerio
  5. Puppeteer
  6. Playwright
  7. Nightmare

Let us start discussing these various web scraping libraries one by one.

HTTP CLIENT

HTTP client libraries interact with website servers by sending requests and retrieving the response. These libraries enable you to connect your JavaScript code running on a web browser or a server with an external API server on HTTP protocol.

Here are some features associated with HTTP client libraries in JavaScript:

Sending HTTP Requests – These libraries allow you to send various types of HTTP requests like GET, POST, PUT, DELETE, and PATCH on the website server.

Managing HTTP Responses – HTTP client libraries in JavaScript are capable of managing the response, including handling the errors and parsing the data into JSON responses.

Asynchronous Requests – JavaScript’s asynchronous nature allows developers to make multiple asynchronous requests without blocking the main thread.

Unirest

Unirest is a lightweight HTTP request library available in multiple languages, built and maintained by Kong. It supports various HTTP methods like GET, POST, DELETE, HEAD, etc., which can be easily integrated into your applications, making it a preferable choice for effortless use cases.

Unirest is one of the most popular JavaScript libraries that can be utilized to extract the valuable data available on the internet.

Let us take an example of how we can do it. Before starting, I am assuming that you have already set up your Node JS project with a working directory.

First, install Unirest JS by running the following command in your project terminal.

npm i unirest 

Now, using Unirest we will request the target URL to extract the raw HTML data.

 
    const unirest = require("unirest");
    const getData = async() => {
    try{
    const response = await unirest.get("https://www.reddit.com/r/programming.json")
    console.log(response.body); // HTML
    }
    catch(e)
    {
    console.log(e);
    }
    }
    getData();

This is how you can create a basic scraper with Unirest.

Advantages:

  1. All HTTP methods are supported, including GET, POST, DELETE, etc.
  2. It can work at a rapid speed and can handle a large amount of load without any hassle.
  3. It allows file transfer over a server in a much simpler way.

Axios

Axios is a promise-based HTTP client for both Node JS and browsers. Axios is widely used among the developer community because of its wide range of methods, simplicity, and active maintenance. It also supports various features like cancel requests, automatic transforms for JSON data, etc.

Install the Axios library by running the following command in your terminal.

npm i axios

Making an HTTP request with Axios is quite simple.

const axios = require("axios");
    const getData = async() => {
    try{
        const response = await axios.get("https://books.toscrape.com/")
        console.log(response.data); // HTML
    }
    catch(e)
    {
        console.log(e);
    }
    }
    getData();                            

Advantages:

  1. It can intercept an HTTP request and can modify it.
  2. It has large community support and is actively maintained by its founders making it a reliable option for making HTTP requests.
  3. It can transform the request and response data.

SuperAgent

SuperAgent is another lightweight HTTP Client library for both Node JS and browser. It supports many high-level HTTP client features. It features a similar API as Axios and supports, both promise and async/await syntax for handling responses.

You can install SuperAgent by running the following command.

npm i superagent

You can make an HTTP request using async/await with SuperAgent like this:

 
    const superagent = require("superagent");
    const getData = async() => {
    try{
        const response = await superagent.get("https://books.toscrape.com/")
        console.log(response.text); // HTML
    }
    catch(e)
    {
        console.log(e);
    }
    }
    getData();        
                                    

Advantages:

  1. SuperAgent can be easily extended via various plugins.
  2. It works in both the browser and node.

Disadvantages:

  1. Fewer features as compared to other HTTP client libraries like Axios.
  2. Detailed documentation is not available.

Web Parsing Libraries

HTML Parsing Libraries are used to filter out the required data from the raw HTML or XML document. There are various web parsing libraries present in JavaScript including Cheerio, JSONPath, html-parse-stringify2, etc. In the following section, we will discuss Cheerio, the most popular web parsing library in JavaScript.

Cheerio

Cheerio is a lightweight web parsing library based on the powerful API of jQuery that can be used to parse and extract data from HTML and XML documents.

Cheerio is blazingly fast in HTML parsing, manipulating, and rendering as it works with a simple consistent DOM model. It is not a web browser as it can’t produce visual rendering, apply CSS, and execute JavaScript. For scraping SPA(Single Page Applications) we need complete browser automation tools like Puppeteer, Playwright, etc which we will discuss in a bit.

Let us scrape the title of the book in the below image. 

Inspecting the book title

First, we will install the Cheerio library.

npm i cheerio

Then, we can extract the title by running the below code.

 const unirest = require("unirest");
    const cheerio = require("cheerio");
    const getData = async() => {
    try{
        const response = await unirest.get("https://books.toscrape.com/catalogue/sharp-objects_997/index.html")
    const $ = cheerio.load(response.body);
        console.log("Book Title: " + $("h1").text()); // "Book Title: Sharp Objects"
    }
    catch(e)
    {
        console.log(e);
    }
    }
    getData();                              

The process is quite similar to what we have done in the Unirest section but with a little difference. In the above code, we load the extracted HTML into a Cheerio constant, and then we use the CSS Selector of the title to extract the required data.

Advantages:

  1. Faster than any other web parsing library.
  2. Cheerio comes with a simple syntax and is similar to jQuery which allows developers to scrape web pages easily.
  3. Cheerio can be used or integrated with various web scraping libraries like Unirest and Axios, which can be an excellent combo for scraping a website.

Disadvantages:

  1. It cannot execute Javascript.

Headless Browsers

Web development has become more advanced, and developers are using JavaScript frameworks at their backend to load dynamic content on their websites. However, the content rendered by JavaScript is not accessible while scraping with a simple HTTP GET request, which can only be used to get the static part of HTML. The only way you can scrape the dynamic content is by using headless browsers.

Browsers that can operate without any Graphical User Interface are known as Headless Browsers. They can be controlled programmatically to perform tasks like submitting forms, clicking buttons, infinite scrolling, etc.

Here are some features associated with Headless Browser:

JavaScript Execution – Headless Browsers can execute JavaScript and extract data from websites heavily dependent on client-side scripting. They can also be utilized for triggering events and loading more content through the AJAX calls on the website.

Captcha Solving – An ordinary GET request can’t be used for solving CAPTCHA on the web, increasing inefficiency and decreasing the success rate of web scrapers. However, Headless Browsers can be integrated with a third-party CAPTCHA-solving service to bypass the challenges implemented by websites as a security measure.

Handling Authentication – Headless Browsers can handle authentication mechanisms like Basic Authentication and Digest Authentication, allowing developers to scrape websites requiring authentication to access the data.

Advanced Navigation – Headless Browsers can be used to travel back and forth between web pages, open new tabs, and manage multiple browser windows. This particular feature is useful for scraping websites that are dedicatedly trying to avoid bots from extracting data from their website.

Let us discuss those libraries that can assist us in scraping the dynamically rendered content.

Puppeteer

Puppeteer is a Node JS library designed by Google that provides a high-level API that to control Chrome or Chromium browsers.

Features associated with Puppeteer JS:

  1. Puppeteer can be used to have better control over Chrome.
  2. It can generate screenshots and PDFs of web pages.
  3. It can be used to scrape web pages using JavaScript to load the content dynamically.

Let us scrape all the book titles and their links on this website.
But first, we will install the puppeteer library.

npm i puppeteer

Now, we will prepare a script to scrape the required information. 

Inspecting the Stock Availability

 
Write the below code in your js file.

 const browser = await puppeteer.launch({
        headless: false,
        });
        const page = await browser.newPage(); 
        await page.goto("https://books.toscrape.com/index.html" , {
        waitUntil: 'domcontentloaded'
        })                            

Step-by-step explanation:

  1. First, we launched the browser with the headless mode set to false, which allows us to see exactly what is happening.
  2. Then, we created a new page in the headless browser.
  3. After that, we navigated to our target URL and waited until the HTML completely loaded.

Now, we will parse the HTML.

 let data = await page.evaluate(() => {
        return Array.from(document.querySelectorAll("article h3")).map((el) => {
        return {
        title: el.querySelector("a").getAttribute("title"),
        link: el.querySelector("a").getAttribute("href"),
        };
      });
    });                               

The page.evaluate() will execute the javascript within the current page context. And then document.querySelectorAll() will select all the elements that identify with article h3 tags. The document.querySelector() is the same, but it chooses a single HTML element.

Great! Now, we will print the data and close the browser.

   console.log(data)
   await browser.close();                   

This will give you 20 titles and links to the books present on the web page.

Advantages:

  1. We can perform various activities on the web page, like clicking on the buttons and links, navigating between the pages, scrolling the web page, etc.
  2. It can be used to take screenshots of web pages.
  3. The evaluate() function in the puppeteer JS helps you to execute Javascript.
  4. You don’t need an external driver to run the tests.

Disadvantages:

  1. It requires very high CPU usage to run.
  2. It currently supports only Chrome web browsers.

Playwright

Playwright is a test automation framework to automate web browsers like Chrome, Firefox, and Safari with an API similar to Puppeteer. It was developed by the same team that worked on Puppeteer. Like Puppeteer, Playwright can also run in the headless and non-headless modes making it suitable for a wide range of uses from automating tasks to web scraping or web crawling.

Major Differences between Playwright and Puppeteer

  1. Playwright is compatible with Chrome, Firefox, and Safari, while Puppeteer only supports Chrome web browsers.
  2. Playwright provides a wide range of options to control the browser in headless mode.
  3. Puppeteer is limited to Javascript only, while Playwright supports various languages like C#, .NET, Java, Python, etc.

Let us install Playwright now.

npm i playwright

We will now prepare a basic script to scrape the prices and stock availability from the same website that we used in the Puppeteer section. 

Inspecting the Stock Availability

The syntax is quite similar to Puppeteer.

 const browser = await playwright['chromium'].launch({ headless: false,});
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.goto("https://books.toscrape.com/index.html");              

The newContext() will create a new browser context.
Now, we will prepare our parser.

  let articles =  await page.$$("article");

    let data = [];                 
    for(let article of articles)
    {
        data.push({
            price: await article.$eval("p.price_color", el => el.textContent),
            availability: await article.$eval("p.availability", el => el.textContent),
        });
    }                            

Then, we will close our browser.

await browser.close();

Advantages:

  1. It supports multiple languages like Python, Java, .Net, and Javascript.
  2. It is faster than any other web browser automation library.
  3. It supports multiple web browsers like Chrome, Firefox, and Safari on a single API.
  4. Documentation is well-written which makes it easy for developers to learn and use.

Nightmare JS

Nightmare is a high-level web automation library designed to automate browsing, web scraping, and other relevant tasks. It uses Electron(similar to Phantom JS, but twice as fast) with a headless browser, making it efficient and easy to use. It is predominantly used for UI testing and crawling.

It can be utilized for mimicking user actions such as navigating to a website, clicking a button or a link, typing, etc., with an API that provides a smooth experience for each script block.
Install Nightmare JS by running the following command.

npm i nightmare

Now, we will search for the results of “Serpdog” on duckduckgo.com.

   const Nightmare = require('nightmare')
    const nightmare = Nightmare()
        nightmare
    .goto('https://duckduckgo.com')
    .type('#search_form_input_homepage', 'Serpdog')
    .click('#search_button_homepage')
    .wait('.nrn-react-div')
    .evaluate(() =>
    {
        return Array.from(document.querySelectorAll('.nrn-react-div')).map((el) => {
        return {
        title: el.querySelector("h2").innerText.replace("\n",""),
        link: el.querySelector("h2 a").href
        }
    })
    })
        .end()
        .then((data) => {
        console.log(data)
        })
        .catch((error) => {
        console.error('Search failed:', error)
        })                            

In the above code, first, we declared an instance of Nightmare. Then, we navigated to the Duckduckgo search page.

Then, we used the type() method to type Serpdog in the search field, and submit the form by clicking the search button on the homepage using the click() method. We will make our scraper wait till the search results are loaded, after that we will extract the search results present on the web page with the help of their CSS selectors.

Advantages:

  1. It is faster than Puppeteer.
  2. Fewer resources are needed to run the program.

Disadvantages:

  1. It doesn’t have good community support like Puppeteer. Also, some undiscovered issues exist on Electron, which can allow a malicious website to execute code on your computer.

Other libraries

In this section, we will discuss some alternatives to the previously discussed libraries.

Node Fetch

Node Fetch is a lightweight library that brings Fetch API to Node JS, allowing HTTP requests in an efficient manner in the Node JS environment.

Features:

  1. It allows the use of promises and async functions.
  2. It implements the Fetch API functionality in Node JS.
  3. Simple API that is maintained regularly, and is easy to use and understand.

You can install Node Fetch by running the following command.

npm i node-fetch

Here is how you can use Node Fetch for web scraping.

  const fetch = require("node-fetch")
    const getData = async() => {
    const response = await fetch('https://en.wikipedia.org/wiki/JavaScript');
    const body = await response.text();

    console.log(body);
    }
    getData();                            

Osmosis

Osmosis is a web scraping library used for extracting HTML or XML documents from the web page.

Features:

  1. It has no large dependencies like jQuery and Cheerio.
  2. It has a clean promise-like interface.
  3. Fast parsing and small memory footprint.

Advantages:

  1. It supports retries and redirects limits.
  2. Supports single and multiple proxies.
  3. Supports form submission, session cookies, etc.

Is Node.js a Suitable Choice for Web Scraping Purposes?

Scalable and highly efficient libraries like Axios and Unirest allow Node JS to be the preferred choice among the web scraping community. Also, the ease of data extraction from dynamically rendered content has made it a significant option for web scraping tasks.

Moreover, the community support for Node JS on platforms like StackOverflow, Discord, and Reddit is commendable and offers solutions to every type of problem you can face in your web scraping journey.

Let us discuss some advantages of using Node JS for web scraping:

Highly Scalable – Node JS can handle large chunks of data without any trouble, which is a required feature for web scraping, making it a highly scalable language for web scraping purposes.

Simple Syntax – Node JS has a simple syntax, making it easy to learn for beginners.

Vast Community Support – Node JS has excellent community support and a large community of active developers who can help you out of a problem or provide guidance to solve your issues, which will help you progress in your web scraping journey.

Conclusion

In this tutorial, we learned about various libraries in Node JS which can be used for scraping, we also learned their advantages and disadvantages.

If you think we can complete your web scraping tasks and help you collect data, feel free to contact us.

I hope this tutorial gave you a complete overview of web scraping with Node JS and JavaScript. Please do not hesitate to message me if I missed something. Follow me on Twitter. Thanks for reading!

Additional Resources

I have prepared a complete list of blogs for scraping Google on Node JS which can give you an idea of how to gather data from advanced websites like Google.

  1. Scrape Google Maps Reviews
  2. Scrape Google Shopping Results
  3. Scrape Google Scholar Results
  4. Web Scraping Google News Results