Building Price Stalker's scraping pipeline

When I started building Price Stalker, I assumed scraping would be the easy part: fetch the page, find the price, done. It turned out to be the part that taught me the most about production software.

One tool doesn’t fit all storefronts

E-commerce sites fall into rough categories. Static pages parse fine with Jsoup — fast and cheap. JavaScript-heavy storefronts render prices client-side, so those targets get Selenium and a real browser. For crawl-style collection across category pages, Scrapy does the fan-out. Forcing one tool to handle all three cases meant the worst of every world, so the pipeline picks a strategy per site.

Scheduling is the easy part — trust is the hard part

Spring’s task scheduling drives the runs:

@Scheduled(cron = "0 0 */6 * * *")
public void scrapeTrackedProducts() {
    for (TrackedProduct product : trackedProductRepository.findAllActive()) {
        scrapeQueue.publish(new ScrapeJob(product));
    }
}

The real lesson was what happens after the cron fires. Markup drifts. Sites rate-limit. A selector that worked for weeks suddenly matches nothing. A scraper that fails loudly is fine; one that silently returns empty results poisons your price history. The fix: defensive parsing, empty-result alerting, and treating scrapers as production services with health signals — not scripts.

Why a queue sits in the middle

Scrape jobs publish to RabbitMQ instead of calling the notification logic directly. Runs are bursty, downstream work (diffing price history, sending alert emails) is not, and a queue between them means retries and back-pressure come for free.

If you’re building anything similar, my advice is simple: design for the day the scrape returns garbage, because that day is coming.