5 worst bugs I've seen on production #2: the infinite crawler

Published about 1 month ago

A looping, maze-like pattern symbolizing an infinite crawler — A mislabelled 'next' button quietly sent the crawler back to start.

This surfaced after a normal rollout to a new client. Our scraper had been stable for months, but suddenly machines were busy and the load average crept up, with no obvious errors. The crawler’s logic was simple: find the “next” button by selector, read its href, and follow that URL to get the next page.

What is it?

A web crawler is a program that automatically follows links to fetch pages. If a loop sends it back to the start and it keeps following links, it can get stuck in an infinite loop. See: Web crawler (Wikipedia).

Problem

We identified the next page by querying the “next” button and following its link, roughly like this:

const next = document.querySelector('button.next')
if (!next) done = true
else url = next.getAttribute('href')

On this site, the last page didn’t remove the button. Instead, the same selector matched a “back to start” button that pointed to page 1. So when we reached the end, we jumped back to the beginning and continued forever.

Impact

Workers stayed busy. Each lap took about a second, so graphs looked healthy. The database told the truth: thousands of “pages” saved for an article that had ~50 real pages.

Signals that gave it away:

Repeated URL pattern (page query flickering between last and 1)
Duplicate content hashes for consecutive pages
Page count exceeding a sensible cap
Load average trending up without matching throughput gain

Selector sketch

We used something like $('button.next') (or equivalent). On the last page, the same selector matched “back to start”, which sent us to page 1 again.

Solution

We added multiple signals before advancing: URL patterns, page counters, and content hashes to detect repeats. We set a hard page cap and made writes idempotent so re‑processing wouldn’t create new rows. We also tightened the selector logic to require a forward page number in the URL.

A safer crawl loop looked like this (pseudocode):

const seenUrls = new Set<string>()
const seenHashes = new Set<string>()
let page = 1
const maxPages = 200

while (true) {
  const html = await fetch(url)
  const hash = stableContentHash(html)

  if (seenUrls.has(url) || seenHashes.has(hash) || page > maxPages) {
    log.warn('stop: loop detected or cap hit', { url, page })
    break
  }

  seenUrls.add(url)
  seenHashes.add(hash)
  await upsertPage({ url, html }) // idempotent write

  const nextEl = selectNextButton()
  const nextUrl = nextEl?.getAttribute('href')

  if (!nextUrl) break
  if (!isForwardLink(url, nextUrl)) break // ensure page number increases

  url = toAbsoluteUrl(nextUrl)
  page += 1
}

Lesson learned: add clear info logs (which page you’re crawling and which loop iteration you’re on), keep DB access handy to spot anomalies, and know your system’s flow—this bug was silent while machines crawled forever and load quietly climbed. Watching load‑average anomalies helps too; you can pinpoint the day and the commit that changed behavior and narrow the search fast.

Prevention checklist:

Hard stop: enforce a strict page cap
Loop detection: track seen URLs and content hashes
Forward‑only: validate the next URL actually advances
Idempotency: upsert writes to avoid duplication
Telemetry: log page index, next URL, and decision reason

Read previous