Skip to content

Playwright content fetcher

rm76 edited this page May 28, 2024 · 23 revisions

Fetching content using Playwright

You can configure changedetection.io to fetch pages using the excellent and very fast Playwright backend https://docs.browserless.io/docker/docker-quickstart (otherwise it will fetch using a plain non-JS built in browser)

The official hosted version also comes with 1 preconfigured Chrome browser (and you can add more!) see https://changedetection.io

See docker-compose.yml for more examples

Docker Compose based

In docker-compose.yml uncomment PLAYWRIGHT_DRIVER_URL under environment, and the playwright-chrome section under services.

Docker based

docker run -d --name browserless \ 
   -e "DEFAULT_LAUNCH_ARGS=[\"--window-size=1920,1080\"]" \
   --rm  -p 3000:3000 \
   --shm-size="2g" \
  dgtlmoon/sockpuppetbrowser:latest

Pip install based

This assumes Playwright is being installed and run on the same server as changedection.io - if running on a different server adjust changedetection.io variables accordingly - ensure firewall ports are open. Process below tested and working on Debian 11.

Install the nodejs 16 repo

curl -fsSL https://deb.nodesource.com/setup_lts.x | sudo -E bash -

Install the dependencies

sudo apt install python3-dev python3-pip nodejs build-essential ca-certificates curl dumb-init ffmpeg fontconfig fonts-freefont-ttf fonts-gfs-neohellenic fonts-indic fonts-ipafont-gothic fonts-kacst fonts-liberation fonts-noto-cjk fonts-noto-color-emoji fonts-roboto fonts-thai-tlwg fonts-ubuntu fonts-wqy-zenhei gconf-service git libappindicator1 libappindicator3-1 libasound2 libatk-bridge2.0-0 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgbm-dev libgbm1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 locales lsb-release msttcorefonts pdftk unzip wget xdg-utils xvfb

Install playwright via Pip (especially if you get the error "No module named 'playwright'")

python3 -m pip install playwright

Clone this git repo to a folder of your choice (eg. /opt/)

git clone https://github.com/browserless/chrome /opt/browserless

cd into folder git cloned into then run

npm install
npm run build
npm prune production

Systemd service configs (/etc/systemd/system/)

Example browserless.service:

[Unit]
Description=browserless service
After=network.target

[Service]
Environment=APP_DIR=/opt/browserless
Environment=PLAYWRIGHT_BROWSERS_PATH=/opt/browserless
Environment=CONNECTION_TIMEOUT=60000
Environment=HOST=127.0.0.1
Environment=LANG="C.UTF-8"
Environment=NODE_ENV=production
Environment=PORT=3000
Environment=WORKSPACE_DIR=/opt/browserless/workspace
WorkingDirectory=/opt/browserless
ExecStart=/opt/browserless/start.sh
SyslogIdentifier=browserless

[Install]
WantedBy=default.target

Example changedetection.service

[Unit]
Description=changedetection.io service
After=network.target browserless.service
Wants=browserless.service

[Service]
Environment=PLAYWRIGHT_DRIVER_URL=ws://127.0.0.1:3000/?stealth=1&--disable-web-security=true
ExecStart=/usr/local/bin/changedetection.io -d /opt/change-detection -p 80
SyslogIdentifier=change-detection

[Install]
WantedBy=default.target

Enable services:

systemctl enable browserless.service
systemctl enable changedetection.service

Manual control:

systemctl start [service]
systemctl stop [service]

Playwright memory leak

There seems to be some memory leak in playwright https://github.com/microsoft/playwright/issues/6319 , as yet there does not seem to be a solution, this can easily consume 200Mb->several gigabytes, restarting the service seems to be very fast and so far the best way to mitigate this

Crontab every x minutes..

#!/bin/bash
# Check if >240Mb and kill
# @todo - you need to find a way to restart :)
ps  -C 'python ./changedetection.py -d /datastore' u|grep -v PID|awk '$6 > 240000 {print $2};'|while read pid
do

  kill -9 $pid
  # add your restart line here
  # or use docker restart changedetection.io
done

If you followed the Pip install guide:

Create a file named restart-changedetection.sh with your favorite text editor, copy/paste (and edit if you need to) the script, save it to any folder you want (eg. /opt) and chmod it to 755:

#!/bin/bash
# Check if >240Mb, kill and restart the service
ps -C changedetection u|grep -v PID|awk '$6 > 240000 {print $2};'|while read pid
do
  kill -9 $pid
  systemctl restart changedetection.service
done

Use crontab to run it every few minutes; run crontab -e, add something like the following code to a new line on the bottom and save:

*/5 * * * * /opt/restart-changedetection.sh >/dev/null 2>&1

The upper code makes the script run every 5th minute of the hour (eg. 02:10, 02:15, 02:20...) and doesn't show any output.