How We Vibe-Coded a Large-Scale Prospect Enrichment Pipeline in a Weekend

How We Vibe-Coded a Large-Scale Prospect Enrichment Pipeline in a Weekend

• vibe-coding, AI, scraping, pipeline

The Business Problem

My employer generates leads for SMEs — painters, plumbers, roofers, installers. Their CRM holds hundreds of thousands of prospect records, each with a company name, an email address, and not much else.

The gap: no address data. No street, no city, no postcode, no phone number. For a lead generation business that routes jobs by geographic area, this is a real operational problem. Better location data means better campaign segmentation and cleaner cross-referencing against business registries.

The obvious fix — buy enriched data — costs thousands per batch and goes stale fast. The smarter fix: derive it ourselves from the data we already have.

Every prospect has an email. Every business email has a domain. Every domain (usually) has a website. And most SME websites list their address, phone number, and opening hours right on the homepage or contact page.

So we built a scraper.


The Goal

Given a CRM export, produce an enriched CSV with 21 new columns:

All of this derived from the domain alone, in a resumable batch job that could run overnight on a cloud server.


How It Was Built (Vibe-Coded with Claude)

We didn't start with a spec and build to it. We started with a question — "how do we make this scraper succeed more?" — and iterated from there with Claude Code as the primary developer.

The Core Pipeline

Comparison of Dutch and Belgian postcode formats showing potential ambiguities A sequential pipeline showing domain extraction → scraper worker pool → postcode

Four steps wired together in scripts/enrichment/enrich_prospects.js:

  1. Domain extraction — parse each email's domain, skip 30+ free-mail providers (gmail, hotmail, and regional ISPs), deduplicate. Hundreds of thousands of records collapse to a much smaller set of unique business domains.

  2. Scraper worker pool — fetch homepage + contact pages concurrently (configurable workers, default 5). Extract address data via schema.org JSON-LD first (structured markup that sites declare explicitly), then fall back to postcode regex patterns. Respects robots.txt, uses a polite Bot/1.0 User-Agent, enforces 200ms inter-request delay.

  3. Postcode resolver — validate every scraped postcode against in-memory reference tables (NL: 460k entries, BE: 2,500+ entries from Geopostcodes CSVs). Dutch postcodes are 1234 AB, Belgian are 1234 — the 4-digit-only Belgian format is genuinely ambiguous (is it a year? a phone fragment?), so we cross-validate against the reference table before accepting it.

  4. Writers — CSV writer appends the 21 enriched columns to the original rows. SQLite writer (better-sqlite3) persists to data/results.db for multi-run aggregation and the GUI to query.

What We Iterated On

Exponential backoff — websites return 429/503 when we hit them too fast. We added fetchWithBackoff: 1s → 2s → 4s → 8s → 32s delays on rate-limited responses, up to 4 retries. Drop rate from bot-blocking fell noticeably.

Fuzzy postcode matching — typos in website data like 4811AB when the real postcode is 4811AA. We generate all 1-character substitution variants of the scraped postcode and check each against the reference map. If a variant exists, we return the canonical version, not the typo.

Employee count + founding year — schema.org numberOfEmployees and foundingDate fields are present on a surprising number of SME sites (often auto-generated by their CMS). We extract these and surface them as enriched columns.

Cross-source validation — we also import structured data from national business registries. When both the scraper and a registry agree on the city name for a domain, confidence upgrades from medium to high.

Multi-source adapters — beyond scraping, we built importers for Dutch and Belgian business registries, the French SIRENE API, UK Companies House, and Google Places.

DNS pre-filter — bulk DNS resolution before the main run pre-marks dead domains in the checkpoint, saving hours of scraper time on domains that don't resolve.

The Electron GUI

Diagram of the GUI's architecture showing Live Run, Results, History, and Map ta

The scraper runs headless on a remote server, but we wanted to watch it in real time from the laptop. We built a small Electron app that connects via SSE (Server-Sent Events) to the remote pipeline and shows:

The GUI connects to whichever server you point it at — localhost for dev, a remote server for production runs.


What We Learned

Scraping success rates vary widely. A meaningful share of records will hit bot-blocking (Cloudflare, CAPTCHA walls), JS-heavy single-page apps with no crawlable content, or sites with no contact page at all. Plan for it — checkpoint your progress, make the pipeline resumable, and measure confidence rather than treating all output as equal quality.

Schema.org is underrated. A meaningful percentage of SME websites — especially those built on WordPress with an SEO plugin, or Wix/Squarespace — declare their address in JSON-LD <script> blocks in the <head>. This gives us clean, structured data without regex gymnastics.

Belgian postcodes are a trap. A 4-digit number on a .nl domain site could be a year, a product code, a phone fragment, or an actual Belgian postcode. Always cross-validate against the reference table.

better-sqlite3 is excellent for this workload. Synchronous API, WAL mode, fast upserts. No async complexity, no connection pooling. For a single-writer pipeline it's perfect.

Windows packaging with native modules is annoying. better-sqlite3 is a native Node.js addon compiled against a specific Node ABI. Electron ships its own Node.js with a different ABI than system Node. Every time you package the app on Windows you need to run @electron/rebuild against the root node_modules first, or the packaged app throws ABI mismatch errors at runtime. We added this to the build script.

Vibe-coding works for pipeline tools. The pipeline went from "basic scraper" to "multi-source, confidence-scored, resumable, GUI-monitored, geocoded" through a series of natural conversations — describe the problem, get a working implementation, observe what breaks, iterate. No upfront architecture document, no story pointing. The tradeoff: you need to understand the code well enough to sanity-check what's produced, or you'll ship subtle bugs (e.g. the Belgian postcode false-positive problem wasn't in the initial version).


Stack

Layer Technology
Runtime Node.js 20, ES modules
Scraping node-fetch, cheerio, custom schema.org extractor
Phone libphonenumber-js (E.164 normalisation + mobile/landline/voip classification)
Storage better-sqlite3 (WAL mode)
Geocoding OpenStreetMap Nominatim (strict 1 req/sec)
GUI Electron 29, Leaflet.js, MarkerCluster
Tile layer OpenStreetMap
Packaging electron-builder + @electron/rebuild
Hosting Cloud server
AI pair Claude Code (Sonnet 4.6)

Built in-house, March 2026.