How I’ve been Using Puppeteer

Joel Ramos
ramosly | blog
Published in
10 min readMay 11, 2020

--

Hey all 👋

Today I thought I would go over how I’ve been using Puppeteer’s… maybe less obvious features.

If you’ve made it here, you probably already know that Puppeteer is a Node JS library for automating Chrome / Chromium. And I do use it to navigate to pages, click things, and take screenshots.

However!

Puppeteer gives you access to a lot more of the browser, like the console, the browser requests, redirect chains, the TLS certificate, etc. These are the Puppeteer powers I’ve been using more lately, rather than just to imitate user behaviors.

If you’re looking to get started with Puppeteer, maybe take a look at my other post.

What follows will be more about looking under the hood of the browser and verifying that things are working as we expect.

Under the Hood 🔧

As I mentioned at the start, I find myself using Puppeteer for stuff that Selenium WebDriver can’t do.

For example, maybe we want to ensure that a certain response has a certain header when users navigate to a page.

Or, maybe you want to check that a page is rendered on the server, and so certain HTTP requests should not be present when you navigate directly to a url.

Accessing this information is relatively simple to do with Puppeteer, so let’s do it!

Inspecting HTTP Responses 💌

So there are actually two approaches to inspecting responses.

  1. You can grab the response from the call to page.goto() which is the return value. Then you can inspect that.
  2. You can also intercept all the requests the browser makes as you navigate around. You can then use JavaScript logic to pluck out the ones you want, grab values from them, or return a mocked response instead, etc.

The first one is simpler, so let’s start with that.

If we store the result of the call to page.goto() in a variable, we can access the functions of the Response object we got back.

// index.js
...
// navigate to a page
const response = await page.goto('https://www.google.com')
console.log('Response headers: ', response.headers())
...

When you run the script with node index.js this time, you should see the response headers logged to the console.

Gif of running the script and seeing the response headers get logged to the terminal
Response headers logged to the console. (The “headless” option was removed for the gif so that the browser would not display).

🎉

Note: you could just make a request without a browser to inspect the response, but it can be useful to inspect the browser requests while a UI test runs.

Now that we have access to the headers, we can verify things about the headers being returned in the response, or that the header value is one that we expect. And we don’t need any extra tools! 😅

Let’s look at how to do #2 now.

If the approach from #1 was like tossing a request to the server and catching the response, this approach is more like listening in on the browser’s conversation.

For #2, we need to tell Puppeteer that any time the Page gets a response from a request the webpage sent, we wanna know about it. So, we’ll add a call to page.on() and tell it we’re interested in responses.

Here’s the whole thing. Notes below.

const puppeteer = require('puppeteer');(async () => {    // Puppeteer stuff
const browser = await puppeteer.launch()
try {
const page = await browser.newPage()
// create a string of repeating '=' equal signs
// for separation when logging to the console
const separator = Array(156).join('=')
// Tell Puppeteer what to do with intercepted responses
await page.on('response', async response => {
// Ignore OPTIONS requests
if(response.request().method() !== 'POST') return
if (response.url().includes('/graphql')) {
console.log('\n 🚀 We got one!: ', response.url())
const data = await response.json()
console.log(separator)
console.table(data.data.ships)
console.log('\nHeaders: \n', response.headers())
console.log()
console.log(separator)
}
})
// navigate to a page
const pageUrl = 'https://spacex-ships.now.sh'
await page.goto(
pageUrl, {waitUntil: 'networkidle0'}
)
} finally {
console.log('Closing the browser...')
await browser.close()
}
})();

Note: for this example, I wasn’t sure about the ethical implications of hitting someone’s production website and logging out data from their website’s requests, so I ended up making a small react site using create-react-app that makes a single request for Ships from the SpaceX GraphQL API.

You can find the code for the SpaceX Ships site on GitLab: https://gitlab.com/joel.mariano.ramos/spacex-ships

Let’s take this apart and go through it:

  1. page.on() takes two parameters. The first is the name of the event to listen for. The second is the function you want to execute when the event occurs.
    I.e., page.on('event to listen for', functionToCall(event))
    Thus, our version was more like this:
    page.on('response', function(response) { // stuff } )
    We used the arrow-function syntax for the function, so:
    page.on('response', response => { // stuff } )
    And then since we wanted to grab the data from the response, which is an asynchronous operation, we needed to use await. So our function also needed to be an async function.
    page.on('response', async (response) => { await response.json() } )
  2. I know my app makes a request to https://api.spacex.land/graphql/ when the page loads. So inside of our “callback” function, I have a check to see, if(response.url().includes('/graphql')) and if so, grab the response data and log it. I also log the response headers, just to show that we can.
  3. I’ve also passed the {waitUntil: ‘networkidle0’} options object to page.goto() to tell Puppeteer to wait for all the page requests to finish before continuing.

Mid-way through all this I decided to start putting these examples in their own folders, lol. So I’m going to run this one with node example3/index.js

As you can see, we used console.table() to log the Ships data returned by the SpaceX GraphQL API. The data is returned as JSON though, so each column is a field from the response that was returned.

The JSON for the first ship result from the API looks like this:

And we can see that this matches the first Ship displayed on the page.

Inspecting outgoing requests is a similar process. Instead of listening for the 'response' event, you listen for the 'request' event in the page.on() function.

Altering Outgoing Requests 🛠

We can also add our own request headers to be sent with browser requests. This is useful if you need to pass feature flags, or certain headers to verify the right things are happening at edge servers, etc.

Note that there’s actually a page.setUserAgent() function specifically for setting the user-agent.

Here we’ll add the headers to the Page object and we’ll make sure to do so before navigating.

// example4/index.js
...
await page.setExtraHTTPHeaders({
'x-test-header': 'much wow!'
})
...
// navigate to a page
const pageUrl = 'https://spacex-ships.now.sh'
const response = await page.goto(pageUrl, {waitUntil: 'networkidle2'})
console.log(separator)
console.log(response.request().headers())
console.log(separator)
...

Now when we run our script, we get:

Additionally, we can avoid hitting the API altogether and just return our own response instead with mock data.

For this, we need the following:

  1. We need to setRequestInterception(true)
  2. Then we need to listen for 'request' events like we did for responses earlier.
  3. Once we’ve intercepted a target request, we need to tell Puppeteer to return our own response rather than sending the request to the API.

For #1, we just add one line, and this can be done once we have a Page object.

...
const page = await browser.newPage()
// Enable request interception
await page.setRequestInterception(true)

For #2, we’ll usepage.on('request') to intercept /graphql POST requests, and tell Puppeteer to send any other request on its way.

First let’s define our mock data. In the screenshot of the browser’s network tag from a previous example we can see that the shape of the response JSON looks like this:

{
data: {
ships: [
{
active: true,
class: 15252765,
name: 'GO Ms Tree',
image: 'https://i.imgur.com/MtEgYbY.jpg',
id: 'GOMSTREE',
year_built: 2015,
type: 'High Speed Craft'
},
...
]
}
}

Thus, if we want our response to succeed it needs to be in the same shape and with the same types.

const mockData = JSON.stringify({
data: {
ships: [{
active: true,
class: 12345678,
name: 'Puppeteer is not a SpaceX Ship!!',
image: 'https://bit.ly/2xP93Pd',
id: 'PUPPETEER',
year_built: 2017,
type: 'Browser Automation Tool'
}]
}
})

For our purposes, we just return one mock “ship” and notice that we JSON.stringify() the JavaScript object.

For #3, we can now tell Puppeteer to respond with our mockData when the page tries to request the ships data from the SpaceX API.

await page.on('request', async request => {    const isGraphQL = request.url().includes('/graphql')
const isPOST = request.method() === 'POST'
if (isGraphQL && isPOST) {
console.log('\n 🚀 We got one!: ', request.url())
await request.respond({
status: 200,
contentType: 'application/json',
body: mockData
})
} else {
request.continue()
}
})

The else is important here because we setRequestInterception(true) This means Puppeteer will block the requests unless you tell Puppeteer what to do with them.

Thus, we not only tell Puppeteer to send our mock data back for /graphql requests, but if Puppeteer intercepts a request that we’re not interested in, then we tell Puppeteer to send it on its way with request.continue() .

Let’s also have Puppeteer take a screenshot after the page loads so we can verify our mockData got rendered.

await page.screenshot({path: ‘./puppeteer-example5.png’})

The full code example looks like this:

const puppeteer = require('puppeteer');(async () => {    // Puppeteer stuff
const browser = await puppeteer.launch({headless: false})
try {
const page = await browser.newPage()
// Step1: Enable request interception
await page.setRequestInterception(true)
// Step 2: define mock data
const mockData = JSON.stringify({
data: {
ships: [{
active: true,
class: 12345678,
name: 'Puppeteer is not a SpaceX Ship!!',
image: 'https://bit.ly/2xP93Pd',
id: 'PUPPETEER',
year_built: 2017,
type: 'Browser Automation Tool'
}]
}
})
// Step 3: Intercept GQL requests and return mockData
await page.on('request', async request => {
const isGraphQL = request.url().includes('/graphql')
const isPOST = request.method() === 'POST'
if (isGraphQL && isPOST) {
console.log('\n 🚀 We got one!: ', request.url())
await request.respond({
status: 200,
contentType: 'application/json',
body: mockData
})
} else {
request.continue()
}
})
// navigate to a page
const pageUrl = 'https://spacex-ships.now.sh'
await page.goto(pageUrl, {waitUntil: 'networkidle0'})
await page.waitForSelector('img')
await page.screenshot({path: './puppeteer-example5.png'})
} finally {
console.log('Closing the browser...\n')
await browser.close()
}
})();

And let’s check our screenshot to see what we got!

Screenshot taken by Puppeteer

🎉

Note: I’ve noticed some weirdness between headless and headful modes. It seems like my page doesn’t want to render the mocked response in headless mode.

I’ve also seen some issues on the Puppeteer GitHub regarding differences between headless and headful modes. So just keep an eye out. In this case, the screenshots were always capturing the loading spinner no matter how long I told Puppeteer to wait.

Anyway… onward! What else can we look at?

Certificate Details

Browsers nowadays warn you when you’re navigating to a site that does not have a valid certificate. This can happen if the page is served over HTTP and not HTTPS (i.e., does not have a cert and thus is not encrypting the web traffic), or if the site has a cert but it’s not signed by a known Certificate Authority (CA).

Most sites that you browse should have a valid certificate that you can view by clicking the lock 🔒 icon in your browser’s address bar, usually to the left of the URL.

Certificate for https://spacex-ships.now.sh

I’ve deployed the site via Vercel (formerly Zeit), and looks like they’ve used Let’s Encrypt as the certificate authority for *.now.sh domain names.

Puppeteer can let us inspect some of this information, so let’s update our script.

We can get rid of a lot of what was there for inspecting network requests, and instead grab the response from the navigation to the page. From there, we can grab the securityDetails

// example6/index.js
...
const page = await browser.newPage()
// navigate to a page
const pageUrl = 'https://spacex-ships.now.sh'
const response = await page.goto(pageUrl, {waitUntil: 'networkidle0'})
const cert = await response.securityDetails()
console.log(cert)

...

This will yield:

Certificate information for https://spacex-ships.now.sh

We can also log those timestamps as actual Date objects if we want.

const cert = await response.securityDetails()
console.log({
...cert,
_validFrom: new Date(cert._validFrom * 1000),
_validTo: new Date(cert._validTo * 1000)
})

Conclusion

At this point you should be able to navigate your way around Puppeteer. Listening for console messages is very similar to listening for network requests. You listen for 'console' events rather than 'request' or 'response' events, and get to filtering. There are a number of other events which can be listened for, so be sure to check out the Puppeteer documentation.

I’m also keeping my eye on Playwright. I read somewhere that the folks that made Puppeteer left Google, joined Microsoft, and made Playwright with almost the same API as Puppeteer, and looks like it can work with Safari in addition to Chrome and Firefox. And it looks like version 1.0 has been released!

Anyway, hope this helps! Let me know if you have questions 🙂

--

--

• Economics grad from UC San Diego • QA Engineer • Pythonista • Always learning!