menu

Questions & Answers

WEB SCRAPPING using node.js & Puppeteer - Opening tab in same browser instance

Here's my script. I have a .csv file that contain URL adress. For each URL, i'm scrapping text in the element with the class "sc-kIKDeO.hIofms". But, when i run my script, it opens a new browser windows for each URL, which is not really effective & quick. I need to open a tab in the same browser instance for each URL. I tried differents things but i still don't manage to do this... If someone could help me please ?

const puppeteer = require('puppeteer');
const fs = require('fs');
const urlList = fs.readFileSync('sample.csv', 'utf-8').split('\n');

urlList.forEach(async (url) => {
const browser = await puppeteer.launch({ executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe', headless:true, arg: ['--incognito'] });
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', request => {
    if (request.resourceType() === 'image' || request.resourceType() === 'stylesheet' || request.resourceType() === 'font' /*|| request.resourceType() === 'script' && !request.url().includes('.sc-kIKDeO.hIofms')*/) {
        request.abort();
    } else {
        request.continue();
    }
});
await page.goto(url, { waitUntil: 'networkidle0', timeout: 60000 });
try{
await page.waitForSelector('.sc-kIKDeO.hIofms')
const text = await page.evaluate(() => document.querySelector('.sc-kIKDeO.hIofms').textContent);
console.log(text);
} catch (error){
console.log('error:${error}');
}
await browser.close();
});

Trying to scrap datas from differents URL by opening each URL in differents tab of the same browser's instance.

Answers(1) :

You just need to create the browser before the for loop. Since you create the browser in the for loop you will create one for every url. You only need one browser instance on which you will create new tabs (pages in puppeteer).

Move this line before the loop:

const browser = await puppeteer.launch({ executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe', headless:true, arg: ['--incognito'] });

and this line after the loop:

await browser.close();

And wrap everything in an anonymous async function, otherwise async await won't work.

Should look like this:

const puppeteer = require('puppeteer');
const fs = require('fs');
const urlList = fs.readFileSync('sample.csv', 'utf-8').split('\n');

(async()=>{
const browser = await puppeteer.launch({ executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe', headless:true, arg: ['--incognito'] });
urlList.forEach(async (url) => {
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', request => {
    if (request.resourceType() === 'image' || request.resourceType() === 'stylesheet' || request.resourceType() === 'font' /*|| request.resourceType() === 'script' && !request.url().includes('.sc-kIKDeO.hIofms')*/) {
        request.abort();
    } else {
        request.continue();
    }
});
await page.goto(url, { waitUntil: 'networkidle0', timeout: 60000 });
try{
await page.waitForSelector('.sc-kIKDeO.hIofms')
const text = await page.evaluate(() => document.querySelector('.sc-kIKDeO.hIofms').textContent);
console.log(text);
} catch (error){
console.log('error:${error}');
}
});
await browser.close();
})();