Andrew B. Collier / @datawookie

What’s New in Firefox 147?

2026-01-24 Firefox browser web scraping

Firefox 147 is a fairly quiet release for web scraping, but there are a few changes that affect how automated browsing behaves, how servers see your requests and how Firefox behaves on Linux.

Read More →

What’s New in Firefox 146?

2025-12-10 Firefox browser web scraping

Firefox version 146 is out! What updates are relevant to web scraping (specifically features that apply to Linux users)?

Read More →

Chrome 143 includes a long list of updates. This post focuses on the changes that are relevant to web scraping in a dynamic, browser-driven environment (the kinds of sites where you rely on Playwright, Puppeteer or Selenium to execute JavaScript, interact with the DOM and extract rendered content). If you’re using simple HTTP requests to scrape static pages these changes won’t affect you. But for anyone running large-scale automation or dealing with modern, JavaScript-heavy websites, a few of the updates in Chrome 143 are worth paying attention to.

Read More →

Camoufox without a Context Manager

2025-12-02 Camoufox Playwright

A common pattern when working with Camoufox in particular and Playwright in general is to use a context manager to handle the browser lifecycle. It’s clean, safe and ensures everything is torn down properly when your script completes. However, sometimes it’s not possible to wrestle all of your code into a context manager. There’s an alternative though.

Read More →

Installing Tor Browser on Ubuntu

2025-12-01 Tor Ubuntu

It’s been a while since I last used the Tor Browser. So long, in fact, that it’s not even installed on my machine. Let’s fix that.

Read More →

What’s New in Firefox 145?

2025-11-22 Firefox web scraping

There’s a skulk of updates in the new 145 version of Firefox. I’ll only be looking at changes that are relevant to web scraping.

Read More →

Evaluating a New Job Market Data Feed

2025-11-10 R Python

I’ve recently been given early access to a service that provides data on job listings published by a wide range of companies. The dataset offers a near real-time view of hiring activity, broken down at the company level. This is a potentially valuable signal for tracking labour market trends, gauging corporate growth or powering job intelligence tools.

Read More →

Downloading from SharePoint

2025-10-25 web scraping SharePoint

Downloading content from SharePoint can be tricky. It might appear that a Microsoft login is required. You might attempt to automate the login process but run into other challenges.

If you are lucky though it might not be all that hard.

For the purpose of illustration I’ll document the process I went through for downloading a CSV document. The URL for the document is stored as URL in a module called const.py.

Read More →

Accelerating BeautifulSoup Encoding Detection

2025-10-18 BeautifulSoup web scraping Python

I’ve noticed that some of my scraper tests are significantly slower than others. On closer examination I discovered that most of the delay is being incurred when BeautifulSoup is parsing HTML. And a significant proportion of that time is spent checking character encoding.

Read More →

Cycles in Lynx Numbers

2025-10-16 R

The paper The Ten-Year Cycle in Numbers of the Lynx in Canada by Charles Elton and Mary Nicholson is a foundational work in population ecology and wildlife dynamics. They documented the regular, roughly ten-year population cycle of the Canadian Lynx (Lynx canadensis), observed in fur trade records from the Hudson’s Bay Company (HBC).

Read More →

What’s happening on the Framebuffer?

2025-07-13

Read More →

Camoufox in Docker

2025-06-11 Camoufox Docker

My scrapers often run in a serverless environment. If I’m using Camoufox then that needs to be baked into my Docker image too.

Read More →

Playwright Browser Footprint

2025-06-06 Playwright web scraping

An elephant footprint in muddy ground with footprints of smaller animals.

Playwright launches a browser. And browsers can be resource hungry beasts.

I often run Playwright on small, resource constrained virtual machines or in a serverless environment. These normally don’t have a lot of memory or disk space. Running out of either of these resources will cause Playwright (and potentially other processes) to fall over.

Is it possible to prune Playwright so that it plays better in a resource constrained environment? Let’s see.

Read More →

Hasler Statistics

2025-05-05 kayak R

An image of three kayak paddlers in triangle formation viewed from in front of the foremost paddler.

The distances of Hasler kayak races for various divisions are nominally 4, 8 and 12 miles. However, the actual distances vary to some degree from one race venue to another. This makes it difficult to compare race times across different races. Using data from Paddle UK I attempt to estimate the actual distances.

Read More →

Get Cookies from Chrome or Firefox

2025-05-03 Selenium Playwright cookies web scraping

Sometimes you’ll want to initiate a Selenium or Playwright session with an existing set of cookies. My approach to this is to retrieve those cookies using a browser and save them to a file so that I can easily load them into my script.

Read More →

Headless Browser Hacks

2025-05-02 Selenium Playwright web scraping framebuffer anti-bot

Sometimes a site will work fine with Selenium or Playwright. Until you try headless mode… Then it might fling up some anti-both mechanism. Or just stop responding altogether. Fortunately there are some simple things that you can do to work around this.

These are the approaches that have worked for me, in order of increasing complexity.

Read More →

Webshare Proxies

2025-04-03 proxy web scraping

The front desk in a stylish hotel. There's a lamp and a vase of yellow flowers.

I previously looked at the NetNut proxies. This post reviews the Webshare proxy service.

Read More →

Test a Playwright Web Scraper

2025-04-02 web scraping testing Playwright Web Scraper Testing

A crash test dummy sitting at a computer in a cosy study.

In the previous post we considered a few approaches to testing a Selenium web scraper. Now we’ll do the same for web scrapers using Playwright.

Read More →

Test a Selenium Web Scraper

2025-03-31 web scraping testing Selenium Selenium Wire Web Scraper Testing

In previous posts we considered a few approaches for testing scrapers targeting static sites. Sometimes you won’t be able to get away with these static tools and you’ll be forced to use browser automation. In this post I’ll look at some options for testing a Selenium web scraper.

Read More →

Iterating over a Paginated List of Links

2025-02-25 web scraping

A grader working on a construction site.

A common web crawler requirement is to iterate over a paginated list of links, following each link to retrieve detailed data. For example:

iterating over a list of profiles on LinkedIn and then retrieving employment history from each profile;
traversing a list of products on Amazon and then retrieving reviews for each product; or
navigating the list of items on BBC News and retrieving the full articles.

Read More →

Handling HTML Entities and Unicode

2025-02-08 web scraping HTML Unicode

A robot hand using a paint scraper to remove paint from a wall.

What if your text data is contaminated with Unicode characters and HTML entities? Ideally you want your persisted data to be pristine. Metaphorically it should be prêt à manger (ready to eat). In principle I also want my text to be as simple as possible: ASCII characters, nothing else. This is sometimes achievable without the loss of too much information.

Read More →

Scraping JSON-LD Data

2025-02-07 web scraping

JSON-LD (JavaScript Object Notation for Linked Data) is a lightweight, flexible and standardised format intended to provide context and meaning to the data on a webpage. It’s easy and convenient for both humans and machines to read and write.

Read More →

Test a Web Scraper using Patching

2025-01-30 web scraping testing pytest Web Scraper Testing

The previous post in this series considered the mocking capabilities in the unittest package. Now we’ll look at what it offers for patching.

Read More →

Test a Web Scraper using Mocking

2025-01-29 web scraping testing pytest Web Scraper Testing

Previous posts in this series used the responses and vcr packages to mock HTTP responses. Now we’re going to look at the capabilities for mocking in the unittest package, which is part of the Python Standard Library. Relative to responses and vcr this functionality is rather low-level. There’s more work required, but as a result there’s potential for greater control.

Read More →

Test a Web Scraper using VCR

2025-01-28 web scraping testing Web Scraper Testing

In the previous post I used the responses package to mock HTTP responses, producing tests that were quick and stable. Now I’ll look at an alternative approach to mocking using VCR.py.

Read More →

Test a Web Scraper using Responses

2025-01-27 web scraping testing Web Scraper Testing

As mentioned in the introduction to web scraper testing, unit tests should be self-contained and not involve direct access to the target website. The responses package allows you to easily mock the responses returned by a website, so it’s well suited to the job. The package is stable and well documented.

Read More →

Web Scraper Testing

2025-01-26 web scraping testing Web Scraper Testing

Site evolution. DOM drift. Selector decay. XPath extinction. Scraper rot. CAPTCHA catastrophe. Anti-bot apocalypse.

Inevitably even a carefully crafted web scraper will fail because the target site has changed in some way. Regular systematic testing is vital to ensure that you don’t lose valuable data.

Read More →

Zyte API Sessions

2025-01-25 web scraping Zyte

The Zyte API implements session management, which makes it possible to emulate a browser session when interacting with a site via the API.

Read More →

Zyte API Cookie Management

2025-01-22 web scraping cookies Zyte

In a previous post I looked at various ways to use the Zyte API to retrieve web content. Now I’m going to delve into options for managing cookies via the Zyte API.

Read More →

Web Scraping with the Zyte API

2025-01-14 web scraping Zyte

Zyte is a data extraction platform, useful for web scraping and data processing at scale. It’s intended to simplify data collection and, based on my experience certainly does!

Read More →

Installing CPLEX

2024-12-11 optimisation

Quick notes on the process of installing the CPLEX optimiser.

Read More →

Installing MOSEK

2024-12-11 optimisation

Quick notes on the process of installing the MOSEK optimiser.

Read More →

Optimisation with Pyomo

2024-12-10 Python optimisation

Pyomo is another flexible Open Source optimisation modelling language for Python. It can be used to define, solve, and analyse a wide range of optimisation problems, including Linear Programming (LP) and Mixed-Integer Programming (MIP), nonlinear programming (NLP), and differential equations.

📢 The book Hands-On Mathematical Optimization with Python (available free online) is an excellent resource on optimisation with Python and Pyomo.

Read More →

Optimisation with CVXPY

2024-12-09 Python optimisation

CVXPY is a powerful, Open Source optimization modelling library for Python. It provides an interface for defining, solving, and analysing a wide range of convex optimization problems, including Linear Programming (LP), Quadratic Programming (QP), Second-Order Cone Programming (SOCP), and Semidefinite Programming (SDP).

Read More →

Optimisation with SciPy

2024-12-08 Python SciPy optimisation

A concrete water tank and wind pump surrounded by arid scenery and Quiver Trees typical of the Karoo.

SciPy is a general-purpose scientific computing library for Python, with an optimize module for optimisation.

Read More →

Global versus Sequential Optimisation

2024-12-07 optimisation

We will be considering two types of optimisation problems: sequential optimisation and global optimisation. These approaches can be applied to the same problem but will generally yield distinctly different results. Depending on your objective one or the other might be the best fit for your problem.

Read More →

Optimisation Reference Problem

2024-12-06 optimisation

A concrete water tank surrounded by arid scenery typical of the Karoo.

I’m evaluating optimisation systems for application to a large scale solar energy optimisation project. My primary concerns are with efficiency, flexibility and usability. Ideally I’d like to evaluate all of them on a single, well defined problem. And, furthermore, that problem should at least resemble the solar energy project.

Read More →

Scraping and Not Modified Responses

2024-12-03 Python web scraping status: 304

A gloved hand holding a paint scraper that is being used to scrape flakey cream coloured paint off a wall.

In a previous post I looked at the HTTP request headers used to manage browser caching. In this post I’ll look at a real world example. It’s a rather deep dive into something that’s actually quite simple. However, I find it helpful for my understanding to pick things apart and understand how all of the components fit together.

Read More →

NetNut Proxies

2024-11-12 proxy

In this post I’ll be testing the proxy service provided by NetNut. For a bit of context take a look at my What is a Proxy? post.

Read More →

What is a Proxy?

2024-11-11 proxy

A proxy is a server or software that acts as an intermediary between a client (often a web browser) and one or more servers, typically on the internet. Proxies are used for a variety of purposes, including improving security, enhancing privacy, managing network traffic, and bypassing restrictions.

Read More →

Migrating from GitLab Pages to Vercel

2024-11-10 GitLab Vercel {blogdown}

A bird with a white body and dark cap against a sky with scattered clouds.

I recently migrated this blog from GitLab Pages to Vercel. There were two main reasons for the move:

The blog was taking too long to build on GitLab Pages, which hindered efficient updates and added unnecessary delays to my workflow. Admittedly, this was partially my own doing since my build process was far too complicated.
I want to have greater control over redirects (specifically the ability to redirect URLs that didn’t end in a slash to ones that did, which was apparently important for SEO purposes).

Read More →

Scraping the NYSE Composite Index

2024-10-25 R web scraping trading

A composite image including elements of financial charts, flags, Wall Street and the NYSE.

For a side project I needed to scrape data for the NYSE Composite Index going back as far as possible.

Read More →

Asset Price Data

2024-10-24 R {alpacar}

In a previous post I looked at retrieving a list of assets from the Alpaca API using the {alpacar} R package. Now we’ll explore how to retrieve historical and current price data.

Read More →

Listing Alpaca Assets

2024-10-21 R {alpacar}

How to list assets available to trade via the Alpaca API using the {alpacar} R package.

Read More →

Authenticate with Alpaca API

2024-10-13 R {alpacar} trading

The {alpacar} package for R is a wrapper around the Alpaca API. API documentation can be found here. In this introductory post I show how to install and load the package, then authenticate with the API and retrieve account information.

Read More →

Earnings Calendar

2024-10-12 BASH web scraping requests Python trading

An abstract image with charts and symbols with an overall economic or financial theme.

A few days ago I wrote about a scraper for gathering economic calendar data. Well, I’m back again to write about another aspect of the same project: acquiring earnings calendar data.

Read More →

Caching & Avoiding Duplication

2024-10-10 web scraping requests HTTP Python

An image of a computer on a wooden desk. On either side of the computer is a table lamp. The computer screen has an image of two cows.

Avoiding data duplication is a persistent challenge with acquiring data from websites or APIs. You can try to brute force it: pull the data again and then compare it locally to establish whether it’s fresh or stale. But there are other approaches that, if supported, can make this a lot simpler.

Read More →

Downloading Files with Selenium

2024-10-05 web scraping Selenium Python

A picture in a style similar to that of Constable depicting cargo being offloaded from a sailing ship.

If you use Selenium for browser automation then at some stage you are likely to need to download a file by clicking a button or link on a website. Sometimes this just works. Other times it doesn’t.

Read More →

Economic Calendar

2024-10-02 web scraping Playwright Python R trading

I needed an offline copy of an economic calendar with all of the major international economic events. After grubbing around the internet I found the Economic Calendar on Myfxbook which had everything that I needed.

Read More →

Your Life in Weeks

2024-08-13 R {ggplot2}

An hourglass in a landscape imitating style of Paul Cézanne.

A few months ago I listened to an episode on the Founder’s Journal podcast that reviewed an essay, The Opportunity Cost of Everything, by Jack Raines. If you haven’t read it, then I suggest you invest 10 minutes in doing so. It will be time well spent.

Read More →