Maintaining a consistent flow of event data can be beneficial for marketers, event management companies, and researchers to track the events happening around the world.
In this tutorial, we will learn how to scrape Google Events Results using Node JS with Puppeteer JS.
Let’s start Scraping Google Events
Tracking events or occasions celebrated at a particular location becomes much more manageable when you can access Google Events results in bulk. In this section, I will show you how Node JS can be used to scrape Google Events results.
Let’s start our project by completing the requirements.
Web Parsing with CSS selectors
Searching the tags from the HTML files is not only a difficult thing to do but also a time-consuming process. It is better to use the CSS Selectors Gadget to select the perfect tags to make your web scraping journey easier.
This gadget can help you to come up with the perfect CSS selector for your need. Here is the link to the tutorial, which will teach you how to use this gadget to select the best CSS selectors according to your needs.
User Agents
User-Agent is used to identify the application, operating system, vendor, and version of the requesting user agent, which can save help in making a fake visit to Google by acting as a real user.
You can also rotate User Agents, Read more about this in this article: How to fake and rotate User Agents using Python 3.
If you want to further safeguard your IP from being blocked by Google, you can try these 10 Tips to avoid getting Blocked while Scraping Google.
Install Libraries
Before we begin, install these libraries so we can move forward and prepare our scraper.
Or you can type the below commands in your project terminal to install the libraries:
npm i puppeteer
Process
Open the URL In Your Browser, So We Can Start The Scraping Process.
https://www.google.com/search?q=events+in+delhi&ibp=htl;events&hl=en&gl=in
Our query is “Events in Delhi”. Then, we have the Google Search Parameter for displaying event results, ibp=htl; events. After that, we have the hl as a language parameter. You can set any language. I have used English here because it is common for most of us to read the tutorial. Then we have the geolocation parameter, which again can have a different value for a different country. For example, for the USA, it would be gl=us.
We will use the Puppeteer Infinite Scrolling Method to scrape the Google Events Search Results. So, let us start preparing our scraper.
First, let us create a primary function that will launch the browser and navigate to the target URL.
const getEventsData = async () => {
browser = await puppeteer.launch({
headless: false,
args: ["--disabled-setuid-sandbox", "--no-sandbox"],
});
const [page] = await browser.pages();
await page.goto(
"https://www.google.com/search?q=events+in+delhi&ibp=htl;events&hl=en&gl=in",
{
waitUntil: "domcontentloaded",
timeout: 60000,
}
);
await page.waitForTimeout(5000);
let data = await scrollPage(page, ".UbEfxe", 20);
console.log(data);
await browser.close();
};
Step-by-step explanation:
puppeteer.launch()
– This will launch the Chromium browser with the options we have set in our code. In our case, we are launching our browser in non-headless mode.browser.newPage()
– This will open a new page or tab in the browser.page.setExtraHTTPHeaders()
– It is used to pass HTTP headers with every request the page initiates.page.goto()
– This will navigate the page to the specified target URL.page.waitForTimeout()
– It will cause the page to wait for 3 seconds to do further operations.scrollPage()
– At last, we called our infinite scroller to extract the data we needed with the page, the tag for the scrollerdiv
, and the number of items we want as parameters.
Now, let us prepare the infinite scroller.
const scrollPage = async(page, scrollContainer, itemTargetCount) => {
let items = [];
let previousHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
while (itemTargetCount > items.length) {
items = await extractItems(page);
await page.evaluate(`document.querySelector("${scrollContainer}").scrollTo(0, document.querySelector("${scrollContainer}").scrollHeight)`);
await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight > ${previousHeight}`);
await page.waitForTimeout(2000);
}
return items;
}
Step-by-step explanation:
previousHeight
– Scroll the height of the container.extractItems()
– Function to parse the scraped HTML.- In the next step, we just scrolled down the container to a height equal to
previousHeight
. - Finally, in the last step, we waited for the container to scroll down until its height got more significant than the previous height.
And, at last, we will talk about our parser.
const extractItems = async(page) => {
let events_results = await page.evaluate(() => {
return Array.from(document.querySelectorAll("li.PaEvOc")).map((el) => {
return{
title: el.querySelector(".YOGjf")?.textContent,
timings: el.querySelector(".cEZxRc")?.textContent,
date: el.querySelector(".gsrt")?.textContent,
address: Array.from(el.querySelectorAll(".zvDXNd")).map((el) => {
return el.textContent
}),
link: el.querySelector(".zTH3xc")?.getAttribute("href"),
thumbnail: el.querySelector('.wA1Bge')?.getAttribute("src"),
location_link: el.querySelector(".ozQmAd") ? "https://www.google.com" + el.querySelector(".ozQmAd")?.getAttribute("data-url") : "",
tickets: Array.from(el.querySelectorAll('.RLN0we[jsname="CzizI"] div[data-domain]')).map((el) => {
return {
source: el?.getAttribute("data-domain"),
link: el.querySelector(".SKIyM")?.getAttribute("href"),
}
}),
venue_name: el.querySelector(".RVclrc")?.textContent,
venue_rating: el.querySelector(".UIHjI")?.textContent,
venue_reviews: el.querySelector(".z5jxId")?.textContent,
venue_link: el.querySelector(".pzNwRe a") ? "" + el.querySelector(".pzNwRe a").getAttribute("href") : ""
}
})
})
for(let i =0; i events_results[i][key] === undefined || events_results[i][key] === "" || events_results[i][key].length === 0 ? delete events_results[i][key] : {});
}
return events_results;
}
Step-by-step explanation:
document.querySelectorAll()
– It will return all the elements that match the specified CSS selector. In our case, it isNv2PK
.getAttribute()
-This will return the attribute value of the specified element.textContent
– It returns the text content inside the selected HTML element.split()
– Used to split a string into substrings with the help of a specified separator and return them as an array.trim()
– Removes the spaces from the starting and end of the string.replaceAll()
– Replaces the specified pattern from the whole string.
As you can see in the above image, all the data is under this parent tag li.PaEvOc
.
return Array.from(document.querySelectorAll("li.PaEvOc")).map((el) => {
The below piece of data can be scraped easily with the help CSS selector gadget and some basic parsing skills.
title: el.querySelector(".YOGjf")?.textContent,
timings: el.querySelector(".cEZxRc")?.textContent,
date: el.querySelector(".gsrt")?.textContent,
address: Array.from(el.querySelectorAll(".zvDXNd")).map((el) => {
return el.textContent
}),
link: el.querySelector(".zTH3xc")?.getAttribute("href"),
thumbnail: el.querySelector('.wA1Bge')?.getAttribute("src"),
The address property contains more than one element in its container, so these elements are stored in a list format in the array. A similar process is followed for scraping tickets.
I have also added an extra string to the location_link
, https://www.google.com
because the scraped URL will be incomplete.
location_link: el.querySelector(".ozQmAd") ? "https://www.google.com" + el.querySelector(".ozQmAd")?.getAttribute("data-url") : "",
That ternary operator says that if the element exists, scrape it by selecting the target selector otherwise leave it blank.
Similarly, with the above explanation, you can now scrape the venue details. Here is the complete code:
const puppeteer = require('puppeteer');
const extractItems = async(page) => {
let events_results = await page.evaluate(() => {
return Array.from(document.querySelectorAll("li.PaEvOc")).map((el) => {
return{
title: el.querySelector(".YOGjf")?.textContent,
timings: el.querySelector(".cEZxRc")?.textContent,
date: el.querySelector(".gsrt")?.textContent,
address: Array.from(el.querySelectorAll(".zvDXNd")).map((el) => {
return el.textContent
}),
link: el.querySelector(".zTH3xc")?.getAttribute("href"),
thumbnail: el.querySelector('.wA1Bge')?.getAttribute("src"),
location_link: el.querySelector(".ozQmAd") ? "https://www.google.com" + el.querySelector(".ozQmAd")?.getAttribute("data-url") : "",
tickets: Array.from(el.querySelectorAll('.mi3HuEAU05x__visible-container div')).map((el) => {
return {
source: el?.getAttribute("data-domain"),
link: el.querySelector(".SKIyM")?.getAttribute("href"),
}
}),
venue_name: el.querySelector(".RVclrc")?.textContent,
venue_rating: el.querySelector(".UIHjI")?.textContent,
venue_reviews: el.querySelector(".z5jxId")?.textContent,
venue_link: el.querySelector(".pzNwRe a") ? "" + el.querySelector(".pzNwRe a").getAttribute("href") : ""
}
})
})
for(let i =0; i events_results[i][key] === undefined || events_results[i][key] === "" || events_results[i][key].length === 0 ? delete events_results[i][key] : {});
}
return events_results;
}
const scrollPage = async(page, scrollContainer, itemTargetCount) => {
let items = [];
let previousHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
while (itemTargetCount > items.length) {
items = await extractItems(page);
await page.evaluate(`document.querySelector("${scrollContainer}").scrollTo(0, document.querySelector("${scrollContainer}").scrollHeight)`);
await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight > ${previousHeight}`);
await page.waitForTimeout(2000);
}
return items;
}
const getEventsData = async () => {
browser = await puppeteer.launch({
headless: false,
args: ["--disabled-setuid-sandbox", "--no-sandbox"],
});
const [page] = await browser.pages();
await page.goto("https://www.google.com/search?q=events+in+delhi&ibp=htl;events&hl=en&gl=in" , {
waitUntil: 'domcontentloaded',
timeout: 60000
})
await page.waitForTimeout(5000)
let data = await scrollPage(page,".UbEfxe",20)
console.log(data)
await browser.close();
};
getEventsData();
Results:
Our result should look like this 👇🏻:
{
title: 'Armaan Malik',
timings: 'Sun, 7–10 pm',
date: '27Nov',
address: [
'DLF Avenue Saket, A4, Press Enclave Marg, Saket District Centre, District Centre, Sector 6, Pushp Vihar',
'New Delhi, Delhi'
],
link: 'https://insider.in/steppinout-presents-armaan-malik-next-2-you-india-tour-delhi-nov27-2022/event',
thumbnail: 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRHJ2mFEDeFEz-J7OqksfK1TBg_HTNtwKYPnscewHm1gQ&s=10',
location_link: 'https://www.google.com/maps/place//data=!4m2!3m1!1s0x390ce1f4d9f62005:0x3aee569514ba9326?sa=X&hl=en&gl=in',
tickets: [
{
source: 'Songkick.com',
link: 'http://www.songkick.com/concerts/40751584-armaan-malik-at-dlf-avenue-saket?utm_medium=organic&utm_source=microformat'
},
{
source: 'Insider.in',
link: 'https://insider.in/steppinout-presents-armaan-malik-next-2-you-india-tour-delhi-nov27-2022/event'
}
],
venue_name: 'DLF Avenue Saket',
venue_rating: '4.4',
venue_reviews: '39,064 reviews',
venue_link: 'https://www.google.com/search?hl=en&gl=in&q=DLF+Avenue+Saket&ludocid=4246426696954843942&ibp=gwp%3B0,7'
}
......
Conclusion
In this tutorial, we learned to scrape Google Events Results using Node JS. Feel free to message me if I missed something. Follow me on Twitter. Thanks for reading!