Scraping Single Page Apps

I do a lot of scraping with PHP. Typically it's really easy, the HTML is rendered in a consistant format, and templates are used to make all the pages the same.

There's a current trend of moving towards javascript-rendered pages (either back-end or front-end), which means that traditional means of scraping just don't work.

Instead, you need a browser to render the page first, and then do your normal extraction.

Google's Chrome browser, plus a headless API known as Puppeteer to the rescue!

To set it all up on a CentOS 8 environment, you'll need to do the following:

Install Google Chrome

sh -c 'echo -e "[google-chrome]\nname=google-chrome - 64-bit\nbaseurl=http://dl.google.com/linux/chrome/rpm/stable/x86_64\nenabled=1\ngpgcheck=1\ngpgkey=https://dl-ssl.google.com/linux/linux_signing_key.pub" >> /etc/yum.repos.d/google-chrome.repo'
yum update
yum install google-chrome-stable

Install Puppeteer (a node module)

yum install nodejs
npm install -g puppeteer --unsafe-perm=true

Install PHP

yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
yum install https://rpms.remirepo.net/enterprise/remi-release-8.rpm
yum install yum-utils

dnf module reset php
dnf module install php:remi-7.4
yum -y install php php-pecl-memcache php-pecl-memcached php-pecl-mysql php-fpm php-opcache httpd php-gd rsync

Install the Chrome-PHP via composer

composer require helloiamlukas/chrome-php

You can now run the following PHP code to get your fully-rendered page.

include_once("vendor/autoload.php");
use ChromeHeadless\ChromeHeadless;
$html = ChromeHeadless::url("https://www.google.com/")->getHtml();

Waiting until the page is fully rendered

For some pages, you'll need to let the Chrome browser fully load the page, and then render all the appropriate components.
To enable this to happen, you'll need to amend the const response = await page.goto line...

const response = await page.goto(options.url, {'waitUntil':'networkidle2'});

Enjoy :)

Want to get in touch? mail@adsar.co.uk