Vibe-Coded Prospect Enrichment Pipeline

The Business Problem

My employer generates leads for SMEs — painters, plumbers, roofers, installers. Their CRM holds hundreds of thousands of prospect records, each with a company name, an email address, and not much else.

The gap: no address data. No street, no city, no postcode, no phone number. For a lead generation business that routes jobs by geographic area, this is a real operational problem. Better location data means better campaign segmentation and cleaner cross-referencing against business registries.

The obvious fix — buy enriched data — costs thousands per batch and goes stale fast. The smarter fix: derive it ourselves from the data we already have.

Every prospect has an email. Every business email has a domain. Every domain (usually) has a website. And most SME websites list their address, phone number, and opening hours right on the homepage or contact page.

So we built a scraper.

The Goal

Given a CRM export, produce an enriched CSV with 21 new columns:

enriched_domain, enriched_website
enriched_company_name
enriched_street, enriched_city, enriched_postcode, enriched_country
enriched_phone, enriched_phone_type, enriched_whatsapp
enriched_email
enriched_social_linkedin, enriched_social_facebook
enriched_vat_number, enriched_kvk_number
enriched_business_type
enriched_opening_hours, enriched_website_language
enriched_employee_count, enriched_founding_year
enriched_confidence (high / medium / low / failed)
enriched_source

All of this derived from the domain alone, in a resumable batch job that could run overnight on a cloud server.

How It Was Built (Vibe-Coded with Claude)

We didn't start with a spec and build to it. We started with a question — "how do we make this scraper succeed more?" — and iterated from there with Claude Code as the primary developer.

The Core Pipeline

Four steps wired together in scripts/enrichment/enrich_prospects.js:

Domain extraction — parse each email's domain, skip 30+ free-mail providers (gmail, hotmail, and regional ISPs), deduplicate. Hundreds of thousands of records collapse to a much smaller set of unique business domains.
Scraper worker pool — fetch homepage + contact pages concurrently (configurable workers, default 5). Extract address data via schema.org JSON-LD first (structured markup that sites declare explicitly), then fall back to postcode regex patterns. Respects robots.txt, uses a polite Bot/1.0 User-Agent, enforces 200ms inter-request delay.
Postcode resolver — validate every scraped postcode against in-memory reference tables (NL: 460k entries, BE: 2,500+ entries from Geopostcodes CSVs). Dutch postcodes are 1234 AB, Belgian are 1234 — the 4-digit-only Belgian format is genuinely ambiguous (is it a year? a phone fragment?), so we cross-validate against the reference table before accepting it.
Writers — CSV writer appends the 21 enriched columns to the original rows. SQLite writer (better-sqlite3) persists to data/results.db for multi-run aggregation and the GUI to query.

What We Iterated On

Exponential backoff — websites return 429/503 when we hit them too fast. We added fetchWithBackoff: 1s → 2s → 4s → 8s → 32s delays on rate-limited responses, up to 4 retries. Drop rate from bot-blocking fell noticeably.

Fuzzy postcode matching — typos in website data like 4811AB when the real postcode is 4811AA. We generate all 1-character substitution variants of the scraped postcode and check each against the reference map. If a variant exists, we return the canonical version, not the typo.

Employee count + founding year — schema.org numberOfEmployees and foundingDate fields are present on a surprising number of SME sites (often auto-generated by their CMS). We extract these and surface them as enriched columns.

Cross-source validation — we also import structured data from national business registries. When both the scraper and a registry agree on the city name for a domain, confidence upgrades from medium to high.

Multi-source adapters — beyond scraping, we built importers for Dutch and Belgian business registries, the French SIRENE API, UK Companies House, and Google Places.

DNS pre-filter — bulk DNS resolution before the main run pre-marks dead domains in the checkpoint, saving hours of scraper time on domains that don't resolve.

The Electron GUI

The scraper runs headless on a remote server, but we wanted to watch it in real time from the laptop. We built a small Electron app that connects via SSE (Server-Sent Events) to the remote pipeline and shows:

Live Run tab — progress bar, speed (rolling 2-min average), ETA, confidence breakdown, live domain feed
Results tab — queryable view of the enrichment database, filterable by confidence
History tab — per-run summary table (records processed, confidence breakdown, runtime)
Map tab — Leaflet.js + OpenStreetMap, clustered CircleMarkers colored by confidence (green=high, amber=medium, grey=low, red=failed), filterable by country and business type. Powered by Nominatim geocoding.

The GUI connects to whichever server you point it at — localhost for dev, a remote server for production runs.

What We Learned

Scraping success rates vary widely. A meaningful share of records will hit bot-blocking (Cloudflare, CAPTCHA walls), JS-heavy single-page apps with no crawlable content, or sites with no contact page at all. Plan for it — checkpoint your progress, make the pipeline resumable, and measure confidence rather than treating all output as equal quality.

Schema.org is underrated. A meaningful percentage of SME websites — especially those built on WordPress with an SEO plugin, or Wix/Squarespace — declare their address in JSON-LD <script> blocks in the <head>. This gives us clean, structured data without regex gymnastics.

Belgian postcodes are a trap. A 4-digit number on a .nl domain site could be a year, a product code, a phone fragment, or an actual Belgian postcode. Always cross-validate against the reference table.

better-sqlite3 is excellent for this workload. Synchronous API, WAL mode, fast upserts. No async complexity, no connection pooling. For a single-writer pipeline it's perfect.

Windows packaging with native modules is annoying. better-sqlite3 is a native Node.js addon compiled against a specific Node ABI. Electron ships its own Node.js with a different ABI than system Node. Every time you package the app on Windows you need to run @electron/rebuild against the root node_modules first, or the packaged app throws ABI mismatch errors at runtime. We added this to the build script.

Vibe-coding works for pipeline tools. The pipeline went from "basic scraper" to "multi-source, confidence-scored, resumable, GUI-monitored, geocoded" through a series of natural conversations — describe the problem, get a working implementation, observe what breaks, iterate. No upfront architecture document, no story pointing. The tradeoff: you need to understand the code well enough to sanity-check what's produced, or you'll ship subtle bugs (e.g. the Belgian postcode false-positive problem wasn't in the initial version).

Stack

Layer	Technology
Runtime	Node.js 20, ES modules
Scraping	`node-fetch`, `cheerio`, custom schema.org extractor
Phone	`libphonenumber-js` (E.164 normalisation + mobile/landline/voip classification)
Storage	`better-sqlite3` (WAL mode)
Geocoding	OpenStreetMap Nominatim (strict 1 req/sec)
GUI	Electron 29, Leaflet.js, MarkerCluster
Tile layer	OpenStreetMap
Packaging	`electron-builder` + `@electron/rebuild`
Hosting	Cloud server
AI pair	Claude Code (Sonnet 4.6)

Built in-house, March 2026.