Add Your Heading Text Here

Web Scraping with Javascript and Nodejs (2025 Guide)

web scraping javascript

Web scraping has become very crucial when it comes to data collection. No matter what kind of data you need, you must have the skills to collect data from any target website.

Web scraping has endless use cases, making it a “must-have” skill. You can use it to collect travel data or product pricing from e-commerce websites like Amazon, Walmart, etc.

In this article, we will show you how it can be done with JavaScript.

This tutorial will be divided into two sections:

  • In the first section, we will scrape a website with a normal GET request, and then, using cheerio, We are going to parse the data.
  • In the next section, we will scrape another website that only loads data after Javascript execution.

using two methods for javascript web scraping

 

Prerequisites

I am assuming that you have already installed Node.js on your machine. If not then you can do so from here.

For this tutorial, we will mainly need three Node.js libraries.

  1. Unirest– For making XHR requests to the website we are going to scrape.
  2. Cheerio– To parse the data we got by making the request through unirest.
  3. Puppeteer– It is a nodejs library that will be used to control and automate headless Chrome or Chromium browsers. We will learn more about this later.

Before we install these libraries, let’s create a folder to keep our javascript files.

 

mkdir nodejs_tutorial
npm init

 

npm init will initialize a new Node.js Project. This command will create a package.json() file. Now, let’s install all of these libraries.

 

npm i unirest cheerio puppeteer

 

This step will install all the libraries in your project. Create a .js file inside this folder by any name you like. I am using first_section.js.

Now, we are all set to create a web scraper. So, for the first section, we are going to scrape this page from books.toscrape.com.

 

Step-by-step tutorial on web scraping in Javascript or Node.js with Cheerio

 

Our first job would be to make a GET request to our host website using unirest. Let’s test our setup first by making a GET request. If we get a response code of 200 then we can proceed ahead with parsing the data.

 
//first_section.js
const unirest = require('unirest');
const cheerio = require('cheerio');
async function scraper(){
  let target_url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
  let data = await unirest.get(target_url)
  return {status:data.status}
}
scraper().then((data) => {
  console.log(data)
}).catch((err) => {
  console.log(err)
})

 

The code is pretty simple, but let me explain you step-by-step.

  • The unirest module is imported to make HTTP requests and retrieve the HTML content of a web page.
  • The cheerio module is imported to parse and manipulate the HTML content using jQuery-like syntax.
  • The scraper function is an async function, which means it can use the await keyword to pause execution and wait for promises to resolve.
  • Inside the function, a target URL (https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html) is assigned to the target_url variable.
  • The unirest.get() method is used to make an HTTP GET request to the target_url.
  • The await keyword is used to wait for the request to complete and retrieve the response object.
  • The HTML content of the response is accessed through data.body.
  • The scraper function returns an object with the status code of the HTTP response (data.status).

Invoking the scraper function:

  • The scraper function is called asynchronously using scraper().then() syntax.
  • The resolved data from the function is logged into the console.
  • Any errors that occur during execution are caught and logged into the console.

I hope you have got an idea of how this code is actually working. Now, let’s run this and see what status we get. You can run the code using the below command.

 
node first_section.js

 

When I run, I get a 200 status code.

{ status: 200 }

 

It means my code is ready, and I can proceed ahead with the parsing process.

What are we going to scrape?

It is always great to decide this thing in advance what exact information you want to extract from the target page.

We are going to scrape five data elements from this page:

  1. Product Title
  2. Product Price
  3. In stock with quantity
  4. Product rating
  5. Product image.
 

We will start by making a GET request to this website with our HTTP agent unirest and once the data has been downloaded from the website, we can use Cheerio to parse the required data.

With the help of .find() function of Cheerio, we are going to find each data and extract its text value.

Before making the request, we are going to analyze the page and find the location of each element inside the DOM.

Quick Tip – One should always do this exercise to identify the location of each element.

We are going to do this by simply using the chrome developer tool. This can be accessed by simply right-clicking on the target element and then clicking on the inspect. This is the most common method, you might already know this.

Identifying the location of each element

Let’s start by searching for the product title.

extracting title tag from bookstoscrape website to scrape

The title is stored inside h1 tag. So, this can be easily extracted using Cheerio.

const $ = cheerio.load(data.body);
obj["title"]=$('h1').text()

 

  • The cheerio.load() function is called, passing in data.body as the HTML content to be loaded.
  • This creates a Cheerio instance, conventionally assigned to the variable $, which represents the loaded HTML and allows us to query and manipulate it using familiar jQuery-like syntax.
  • HTML structure has an <h1> element, the code uses $('h1') to select all <h1> elements in the loaded HTML. But in our case, there is only one.
  • .text() is then called on the selected elements to extract the text content of the first <h1> element found.
  • The extracted title is assigned to the obj["title"] property.

Find the price tag of the product.

 

finding the price tag in booktoscrape books

 

This price data can be found inside the p tag with class name price_color.

obj[“price_of_product”]=$(‘p.price_color’).text().trim()

Finding the stock data tag

 

finding the availability of books in bookstoscrape

Stock data can be found inside p tag with the class name instock.

 
obj[“stock_data”]=$(‘p.instock’).text().trim()

Finding the star rating tag

finding the star rating tag

 

Here, the star rating is the name of the class. So, we will first find this class by the name star-rating, and then we will find the value of this class attribute using .attr() function provided by the cheerio.

 

obj[“rating”]=$(‘p.star-rating’).attr(class).split(“ “)[1]

Finding the image tag

 

finding the image tag

 

The image is stored inside an img tag, which is located inside the div tag with id product_gallery.

obj[“image”]=”https://books.toscrape.com"+$('div#product_gallery').find('img').attr('src').replace("../..","")

By adding “https://books.toscrape.com” as a pretext, we are completing the URL.

Everything is ready; now, let’s run the code and see what data we get.

 

 

As you can see, we were able to extract all the data we were looking for from the target website using unirest and cheerio.

Complete Code

You can extract many other things from the page, but my main motive was to show you how the combination of any HTTP agent(UnirestAxios, etc) and Cheerio can make web scraping super simple.

The code will look like this.

 
const unirest = require('unirest');
const cheerio = require('cheerio');

async function scraper(){
  let obj={}
  let arr=[]
  let target_url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
  let data = await unirest.get(target_url)
  const $ = cheerio.load(data.body);
  obj["title"]=$('h1').text()
  obj["price_of_product"]=$('p.price_color').text().trim()
  obj["stock_data"]=$('p.instock').text().trim()
  obj["rating"]=$('p.star-rating').attr('class').split(" ")[1]
  obj["image"]="https://books.toscrape.com"+$('div#product_gallery').find('img').attr('src').replace("../..","")
  arr.push(obj)
  return {status:data.status,data:arr}
}
scraper().then((data) => {
  console.log(data.data)
}).catch((err) => {
  console.log(err)
})

Web scraping with Javascript and Puppeteer

What is Puppeteer?

Puppeteer is a Node.js library developed by Google that provides a high-level API to control headless versions of the Chrome or Chromium web browsers.

It is widely used for automating and interacting with web pages, making it a popular choice for web scraping, automated testing, browser automation, and other web-related tasks.

 

Why do we need a headless browser to scrape a website?

  1. Rendering JavaScript– Many modern websites rely heavily on JavaScript to load and display content dynamically. Traditional web scrapers may not execute JavaScript, resulting in incomplete or inaccurate data extraction. Headless browsers can fully render and execute JavaScript, ensuring that the scraped data reflects what a human user would see when visiting the site.
  2. Handling User Interactions– Some websites require user interactions, such as clicking buttons, filling out forms, or scrolling, to access the data of interest. Headless browsers can automate these interactions, enabling you to programmatically navigate and interact with web pages as needed.
  3. CAPTCHAs and Bot Detection– Many websites employ CAPTCHAs and anti-bot mechanisms to prevent automated scraping. Headless browsers can be used to solve CAPTCHAs and mimic human-like behavior, helping you bypass bot detection measures.
  4. Screenshots and PDF Generation– Headless browsers can capture screenshots or generate PDFs of web pages, which can be valuable for archiving or documenting web content. You can read more about generating PDFs with a headless chrome browser here.

 

How does it work?

1. Installation:

  • Start by installing Puppeteer in your Node.js project using npm or yarn.
  • You can do this with the following command:
npm install puppeteer

2. Import Puppeteer:

  • In your Node.js script, import the Puppeteer library by requiring it at the beginning of your script.
const puppeteer = require('puppeteer');

3. Launching a Headless Browser:

  • Use Puppeteer’s puppeteer.launch() method to start a headless Chrome or Chromium browser instance.
  • Headless means that the browser runs without a graphical user interface (GUI), making it more suitable for automated tasks.
  • You can customize browser options during the launch, such as specifying the executable path or starting with a clean user profile.

4. Creating a New Page:

  • After launching the browser, you can create a new page using the browser.newPage() method.
  • This page object represents the tab or window in the browser where actions will be performed.

5. Navigating to a Web Page:

  • Use the page.goto() method to navigate to a specific URL.
  • Puppeteer will load the web page and wait for it to be fully loaded before proceeding.

6. Interacting with the Page:

  • Puppeteer allows you to interact with the loaded web page, including:
  • Clicking on elements.
  • Typing text into input fields.
  • Extracting data from the page’s DOM (Document Object Model).
  • Taking screenshots.
  • Generating PDFs.
  • Evaluating JavaScript code on the page.
  • These interactions can be scripted to perform a wide range of actions.

7. Handling Events:

  • You can listen for various events on the page, such as network requests, responses, console messages, and more.
  • This allows you to capture and handle events as needed during your automation.

8. Closing the Browser:

  • When your tasks are complete, you should close the browser using browser.close() to free up system resources.
  • Alternatively, you can keep the browser open for multiple operations if needed.

9. Error Handling:

  • It’s important to implement error handling in your Puppeteer scripts to gracefully handle any unexpected issues.
  • This includes handling exceptions, network errors, and timeouts.

I think this much information is enough for now on Puppeteer and I know you are eager to build a web scraper using Puppeteer. Let’s build a scraper.

Collect dynamic data with Puppeteer

We have selected Facebook because it loads its data through JavaScript execution. We are going to scrape this page using Puppeteer. The page looks like this.

As usual, we should test our setup before starting with the scraping and parsing process.

 

Downloading the raw data

We will write a code that will open the browser and then open the Facebook page that we want to scrape. Then, it will close the browser once the page is loaded completely.

 
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

async function scraper(){
  let obj={}
  let arr=[]
  let target_url = 'https://www.facebook.com/nyrestaurantcatskill/'
  const browser = await puppeteer.launch({headless:false});
  const page = await browser.newPage();
  await page.setViewport({ width: 1280, height: 800 });
  let crop = await page.goto(target_url, {waitUntil: 'domcontentloaded'});
  let data = await page.content();
  await page.waitFor(2000)
  await browser.close();
  return {status:crop.status(),data:data}
}
scraper().then((data) => {
console.log(data.data)
}).catch((err) => {
console.log(err)
})

This code snippet demonstrates an asynchronous function named scraper that uses Puppeteer for automating web browsers to scrape data from a specific Facebook page.

Let’s break down the code step by step:

  1. The function scraper is declared as an asynchronous function. It means that it can use the await keyword to wait for asynchronous operations to complete.
  2. Two variables, obj and arr, are initialized as empty objects and arrays, respectively. These variables are not used in the provided code snippet.
  3. The target_url variable holds the URL of the Facebook page you want to scrape. In this case, it is set to 'https://www.facebook.com/nyrestaurantcatskill/'.
  4. puppeteer.launch({headless:false}) launches a Puppeteer browser instance with the headless option set to false. This means that the browser will have a visible UI when it opens. If you set headless to true, the browser will run in the background without a visible interface.
  5. browser.newPage() creates a new browser tab (page) and assigns it to the page variable.
  6. page.setViewport({ width: 1280, height: 800 }) sets the viewport size of the page to 1280 pixels width and 800 pixels height. This simulates the screen size for the scraping operation.
  7. page.goto(target_url, {waitUntil: 'domcontentloaded'}) navigates the page to the specified target_url. The {waitUntil: 'domcontentloaded'} option makes the function wait until the DOM content of the page is fully loaded before proceeding.
  8. The crop variable stores the result of the page.goto operation, which is a response object containing information about the page load status.
  9. page.content() retrieves the HTML content of the page as a string and assigns it to the data variable.
  10. page.waitFor(2000) pauses the execution of the script for 2000 milliseconds (2 seconds) before proceeding. This can be useful to wait for dynamic content or animations to load on the page.
  11. browser.close() closes the Puppeteer browser instance.
  12. The function returns an object with two properties: status and data. The status property contains the status code of the crop response object, indicating whether the page load was successful or not. The data property holds the HTML content of the page.

Once you run this code you should see this.

If you see this, then your setup is read,y and we can proceed with data parsing using Cheerio.

What are we going to scrape?


 

We are going to scrape these five data elements from the page.

  • Address
  • Phone number
  • Email address
  • Website
  • Rating

Now, as usual, we are going to first analyze their location inside the DOM. We will take support of Chrome dev tools for this. Then using Cheerio We are going to parse each of them.

 

Identifying the location of each element

Let’s start with the address first and find out its location.

 

Once you inspect it, you will find that all the information we want to scrape is stored inside this div tag with the class name xieb3on. And then, inside this div tag, we have two more div tags out of which we are interested in the second one because the information is inside that.

 

 

Let’s find this out first.

$('div.xieb3on').first().find('div.x1swvt13').each((i,el) => {
  if(i===1){
}
})

We have set a condition that only if it is 1 it will go inside the condition. With this, we have cleared our intention that we are only interested in the second div block. Now, the question is how to extract an address from this. Well, it has become very easy now.

The address can be found inside the first div tag with the class x1heor9g. This div tag is inside the ul tag.

$('div.xieb3on').first().find('div.x1swvt13').each((i,el) => {
  if(i===1){
    obj["address"] = $(el).find('ul').find('div.x1heor9g').first().text().trim()
  }
})

Let’s find the email, website, phone number, and the rating

All of these four elements are hidden inside div tags with classes xu06os2. All these four div tags are also inside the same ul tag as address.

 

$(el).find('ul').find('div.xu06os2').each((o,p) => {
        let value =  $(p).text().trim()
        if(value.includes("+")){
          obj["phone"]=value
        }else if(value.includes("Rating")){
          obj["rating"]=value
        }else if(value.includes("@")){
          obj["email"]=value
        }else if(value.includes(".com")){
          obj["website"]=value
        }
      })
arr.push(obj)
obj={}

 

  1. .find('div.xu06os2') is used to find all <div> elements with the class xu06os2 that are descendants of the previously selected <ul> elements.
  2. .each((o,p) => { ... }) iterates over each of the matched <div> elements, executing the provided callback function for each element.
  3. let value = $(p).text().trim() extracts the text content of the current <div> element (p) and trims any leading or trailing whitespace.
  4. The subsequent if conditions check the extracted value for specific patterns using the .includes() method:

a. If the value includes the “+” character, it is assumed to be a phone number, and it is assigned to the obj["phone"] property.

b. If the value includes the word “Rating”, it is assumed to be a rating value, and it is assigned to the obj["rating"] property.

c. If the value includes the “@” character, it is assumed to be an email address, and it is assigned to the obj["email"] property.

d. If the value includes the “.com” substring, it is assumed to be a website URL, and it is assigned to the obj["website"] property.

5. arr.push(obj) appends the current obj object to the arr array.

6. obj={} reassigns an empty object to the obj variable, resetting it for the next iteration.

Complete Code

Let’s take a look at the complete code along with the response it returns after execution.

$(el).find('ul').find('div.xu06os2').each((o,p) => {
        let value =  $(p).text().trim()
        if(value.includes("+")){
          obj["phone"]=value
        }else if(value.includes("Rating")){
          obj["rating"]=value
        }else if(value.includes("@")){
          obj["email"]=value
        }else if(value.includes(".com")){
          obj["website"]=value
        }
      })
arr.push(obj)
obj={}

Let’s run it and see the response.

We successfully scraped a dynamic website using JavaScript rendering through Puppeteer. If you’re looking to deepen your understanding of Puppeteer, don’t miss our detailed guide on web scraping with Puppeteer, a powerful tool for handling JavaScript-heavy sites.

Scraping without getting blocked with Scrapingdog

Scrapingdog is a web scraping API that uses a new proxy on every new request. Once you start scraping any website at scale, you will face two challenges.

  • Puppeteer will consume too much CPU. Your machine will get super slow.
  • Your IP will be banned in no time.

 

scrapingdog homepage

 

With Scrapingdog, you can resolve both issues very easily. It uses headless Chrome browsers to render websites, and every request goes through a new IP.

 
const cheerio = require('cheerio');
const unirest = require('unirest');


async function scraper(){
  let obj={}
  let arr=[]
  let api_key='your-api-key'
  let target_url = 'https://www.facebook.com/nyrestaurantcatskill/'
  let data = await unirest.get(`https://api.scrapingdog.com/scrape?api_key=${api_key}&url=${target_url}`)
  const $ = cheerio.load(data)

  $('div.xieb3on').first().find('div.x1swvt13').each((i,el) => {
    if(i===1){
      obj["address"] = $(el).find('ul').find('div.x1heor9g').first().text().trim()
      $(el).find('ul').find('div.xu06os2').each((o,p) => {
        let value =  $(p).text().trim()
        if(value.includes("+")){
          obj["phone"]=value
        }else if(value.includes("Rating")){
          obj["rating"]=value
        }else if(value.includes("@")){
          obj["email"]=value
        }else if(value.includes(".com")){
          obj["website"]=value
        }
      })
      arr.push(obj)
      obj={}
    }
  })
  return {data:arr}
}
scraper().then((data) => {
  console.log(data.data)
}).catch((err) => {
  console.log(err)
})

As you can see, you just have to make a simple GET request, and your job is done. Scrapingdog will handle everything from headless chrome to retries for you.

You can start with free 1000 credits to try how Scrapingdog works for your use case. 

Why You Should Prefer Javascript for Web Scraping?

JavaScript, and consequently Node.js, is designed to be non-blocking and asynchronous.

In the web scraping context, this means that Node.js can initiate tasks (such as making HTTP requests) and continue executing other code without waiting for those tasks to complete.

This non-blocking nature allows Node.js to efficiently manage multiple operations concurrently.

Node.js uses an event-driven architecture. When an asynchronous operation, such as a network request or file read, is completed, it triggers an event.

The event loop (discussed in next section) listens for these events and dispatches callback functions to handle them. This event-driven model ensures that Node.js can manage multiple tasks simultaneously without getting blocked.

With its event-driven and non-blocking architecture, Node.js can easily handle concurrency in web scraping.

It can initiate multiple HTTP requests to different websites concurrently, manage the responses as they arrive, and process them as needed.

This concurrency is essential for scraping large volumes of data efficiently.

What is the Event Loop in Javascript?

The event loop in JavaScript is a key component that allows Node.js to efficiently manage large-scale web scraping tasks.

Unlike languages like C or C++, which use multiple threads to perform tasks concurrently, JavaScript operates on a single-threaded model and handles tasks asynchronously through its event loop.

javascript event loop

A key distinction in JavaScript is that it doesn’t run code in parallel but rather concurrently.

Confused?

Multiple tasks may start at the same time (like API calls or file reads), but JavaScript processes their results one at a time using the event loop.

It doesn’t execute all functions simultaneously, but it gives the illusion of multitasking by juggling asynchronous operations efficiently..

How does Event Loop work?

Here’s a step-by-step procudure that happens: –

  • The event loop continuously checks if the call stack is empty.
  • If the call stack is empty, the event loop checks the event queue for pending events.
  • If there are events in the queue, the event loop dequeues an event and executes its associated callback function.
  • The callback function may perform asynchronous tasks, such as reading a file or making a network request.
  • When the asynchronous task is completed, its callback function is placed in the event queue.
  • The event loop continues to process events from the queue as long as events are waiting to be executed.

Conclusion

JavaScript is a powerful language for web scraping, and this tutorial proves it. While you might need to dig into some basics to get started, it offers speed and reliability once you do.

In this tutorial, we explored how JavaScript can be used for scraping both static and dynamic websites. When dealing with dynamic content, web scraping with Playwright is a powerful alternative to Puppeteer, offering great flexibility and reliability.

I just wanted to give you an idea of how Javascript can be used for web scraping, and I hope with this tutorial you will get a little clarity.

Happy Scraping !!

Additional Resources

My name is Manthan Koolwal and I am the founder of scrapingdog.com. I love creating scraper and seamless data pipelines.
Manthan Koolwal

Web Scraping with Scrapingdog

Scrape the web without the hassle of getting blocked

Recent Blogs

web scraping javascript

Web Scraping with Javascript and Nodejs (2025 Guide)

This article explains how you use Javascript & Node.js to scrape a website. We have discussed 2 methods in it to extract data.
web scraping for lead generation

Web Scraping For Lead Generation in 2025

Web scraping with many use cases have one in generating leads. We have explained here, how you can use it to generate leads.