I’d recommend installing Puppeteer with npm
, as it’ll also include the stable up-to-date Chromium version
that is guaranteed to work with the library.
Run this command in your project root directory:
npm i puppeteer --save
Note: This might take a while as Puppeteer will need to download and install Chromium in the background.
Okay, now that we are all set and configured, let the fun begin!
Let’s start our Puppeteer tutorial with a basic example. We’ll write a script that will cause our headless browser to take a screenshot of a website of our choice.
Create a new file in your project directory named screenshot.js
and open it in your favorite code
editor.
First, let’s import the Puppeteer library in your script:
const puppeteer =
require(
'puppeteer');
Next up, let’s take the URL from command-line arguments:
const url = process.argv[
2];
if (!url) {
throw
"Please provide a URL as the first argument";
}
Now, we need to keep in mind that Puppeteer is a promise-based
library: It performs asynchronous calls to the headless Chrome instance under
the hood. Let’s keep the code clean by using async/await. For that, we
need to define an async
function
first and put all the Puppeteer code in there:
async
function run () {
const browser =
await puppeteer.launch();
const page =
await browser.newPage();
await page.goto(url);
await page.screenshot({
path:
'screenshot.png'});
browser.close();
}
run();
Altogether, the final code looks like this:
const puppeteer =
require(
'puppeteer');
const url = process.argv[
2];
if (!url) {
throw
"Please provide URL as a first argument";
}
async
function run () {
const browser =
await puppeteer.launch();
const page =
await browser.newPage();
await page.goto(url);
await page.screenshot({
path:
'screenshot.png'});
browser.close();
}
run();
You can run it by executing the following command in the root directory of your project:
node screenshot.js https://github.com
Wait a second, and boom! Our headless browser just created a file
named screenshot.png
and you can
see the GitHub homepage rendered in it. Great, we have a working Chrome web
scraper!
Let’s stop for a minute and explore what happens in our run()
function above.
First, we launch a new headless browser instance, then we open a new page (tab) and navigate to the URL provided in the command-line argument. Lastly, we use Puppeteer’s built-in method for taking a screenshot, and we only need to provide the path where it should be saved. We also need to make sure to close the headless browser after we are done with our automation.
Now that we’ve covered the basics, let’s move on to something a bit more complex.
For the next part of our Puppeteer tutorial, let’s say we want to scrape down the newest articles from Hacker News.
Create a new file named ycombinator-scraper.js
and paste in the following code snippet:
const puppeteer =
require(
'puppeteer');
function run () {
return
new
Promise(
async (resolve, reject) => {
try {
const browser =
await puppeteer.launch();
const page =
await browser.newPage();
await page.goto(
"https://news.ycombinator.com/");
let urls =
await page.evaluate(
() => {
let results = [];
let items =
document.querySelectorAll(
'a.storylink');
items.forEach(
(item) => {
results.push({
url: item.getAttribute(
'href'),
text: item.innerText,
});
});
return results;
})
browser.close();
return resolve(urls);
}
catch (e) {
return reject(e);
}
})
}
run().then(
console.log).catch(
console.error);
Okay, there’s a bit more going on here compared with the previous example.
The first thing you might notice is that the run()
function now returns a promise so
the async
prefix has
moved to the promise function’s definition.
We’ve also wrapped all of our code in a try-catch block so that we can handle any errors that cause our promise to be rejected.
And finally, we’re using Puppeteer’s built-in method called evaluate()
. This method lets us run custom JavaScript code
as if we were executing it in the DevTools console. Anything returned from that
function gets resolved by the promise. This method is very handy when it comes
to scraping information or performing custom actions.
The code passed to the evaluate()
method is pretty basic JavaScript that builds an array of
objects, each having url
and text
fields that represent the story URLs we see on https://news.ycombinator.com/.
The output of the script looks something like this (but with 30 entries, originally):
[ { url: 'https://www.nature.com/articles/d41586
-018-05469-3',
text: 'Bias detectives: the researchers striving to make algorithms fair' },
{ url: 'https://mino-games.workable.com/jobs/
415887',
text: 'Mino Games Is Hiring Programmers in Montreal' },
{ url: 'http://srobb.net/pf.html',
text: 'A Beginner\'s Guide to Firewalling with pf' },
// ...
{ url: 'https://tools.ietf.org/html/rfc8439',
text: 'ChaCha20 and Poly1305 for IETF Protocols' } ]
Pretty neat, I’d say!
Okay, let’s move forward. We only had 30 items returned, while there are many more available—they are just on other pages. We need to click on the “More” button to load the next page of results.
Let’s modify our script a bit to add a support for pagination:
const puppeteer =
require(
'puppeteer');
function run (pagesToScrape) {
return
new
Promise(
async (resolve, reject) => {
try {
if (!pagesToScrape) {
pagesToScrape =
1;
}
const browser =
await puppeteer.launch();
const page =
await browser.newPage();
await page.goto(
"https://news.ycombinator.com/");
let currentPage =
1;
let urls = [];
while (currentPage <= pagesToScrape) {
let newUrls =
await page.evaluate(
() => {
let results = [];
let items =
document.querySelectorAll(
'a.storylink');
items.forEach(
(item) => {
results.push({
url: item.getAttribute(
'href'),
text: item.innerText,
});
});
return results;
});
urls = urls.concat(newUrls);
if (currentPage < pagesToScrape) {
await
Promise.all([
await page.click(
'a.morelink'),
await page.waitForSelector(
'a.storylink')
])
}
currentPage++;
}
browser.close();
return resolve(urls);
}
catch (e) {
return reject(e);
}
})
}
run(
5).then(
console.log).catch(
console.error);
Let’s review what we did here:
1. We added a single argument called pagesToScrape
to our main run()
function. We’ll use this to limit how many pages our script
will scrape.
2. There is one more new variable named currentPage
which represents the number of the page of results are we
looking at currently. It’s set to 1
initially. We also wrapped our evaluate()
function in a while
loop, so that it keeps running as long as currentPage
is less than or equal to pagesToScrape
.
3. We added the block for moving to a new page and waiting for the
page to load before restarting the while
loop.
You’ll notice that we used the page.click()
method to have the headless browser click on the “More”
button. We also used the waitForSelector()
method to
make sure our logic is paused until the page contents are loaded.
Both of those are high-level Puppeteer API methods ready to use out-of-the-box.
One of the problems you’ll probably encounter during scraping with Puppeteer is waiting for a page to load. Hacker News has a relatively simple structure and it was fairly easy to wait for its page load completion. For more complex use cases, Puppeteer offers a wide range of built-in functionality, which you can explore in the API documentation on GitHub.
This is all pretty cool, but our Puppeteer tutorial hasn’t covered optimization yet. Let’s see how can we make Puppeteer run faster.