Vibe-Coded Prospect Enrichment Pipeline
Flagship 2026
Internal Tooling

Vibe-Coded Prospect Enrichment Pipeline

We needed address data for hundreds of thousands of SME prospects. Instead of buying enriched data, we built a scraper — and vibe-coded the whole thing with Claude Code in a weekend.

100k+ Records
6+ Sources
Weekend Build time
Vibe-CodingClaude CodeWeb ScrapingNode.jsElectronSQLiteData PipelineAI

The Business Problem

My employer generates leads for SMEs — painters, plumbers, roofers, installers. Their CRM holds hundreds of thousands of prospect records, each with a company name, an email address, and not much else.

The gap: no address data. No street, no city, no postcode, no phone number. For a lead generation business that routes jobs by geographic area, this is a real operational problem. Better location data means better campaign segmentation and cleaner cross-referencing against business registries.

The obvious fix — buy enriched data — costs thousands per batch and goes stale fast. The smarter fix: derive it ourselves from the data we already have.

Every prospect has an email. Every business email has a domain. Every domain (usually) has a website. And most SME websites list their address, phone number, and opening hours right on the homepage or contact page.

So we built a scraper.


The Goal

Given a CRM export, produce an enriched CSV with 21 new columns:

All of this derived from the domain alone, in a resumable batch job that could run overnight on a cloud server.


How It Was Built (Vibe-Coded with Claude)

We didn't start with a spec and build to it. We started with a question — "how do we make this scraper succeed more?" — and iterated from there with Claude Code as the primary developer.

The Core Pipeline

Four steps wired together in scripts/enrichment/enrich_prospects.js:

  1. Domain extraction — parse each email's domain, skip 30+ free-mail providers (gmail, hotmail, and regional ISPs), deduplicate. Hundreds of thousands of records collapse to a much smaller set of unique business domains.

  2. Scraper worker pool — fetch homepage + contact pages concurrently (configurable workers, default 5). Extract address data via schema.org JSON-LD first (structured markup that sites declare explicitly), then fall back to postcode regex patterns. Respects robots.txt, uses a polite Bot/1.0 User-Agent, enforces 200ms inter-request delay.

  3. Postcode resolver — validate every scraped postcode against in-memory reference tables (NL: 460k entries, BE: 2,500+ entries from Geopostcodes CSVs). Dutch postcodes are 1234 AB, Belgian are 1234 — the 4-digit-only Belgian format is genuinely ambiguous (is it a year? a phone fragment?), so we cross-validate against the reference table before accepting it.

  4. Writers — CSV writer appends the 21 enriched columns to the original rows. SQLite writer (better-sqlite3) persists to data/results.db for multi-run aggregation and the GUI to query.

What We Iterated On

Exponential backoff — websites return 429/503 when we hit them too fast. We added fetchWithBackoff: 1s → 2s → 4s → 8s → 32s delays on rate-limited responses, up to 4 retries. Drop rate from bot-blocking fell noticeably.

Fuzzy postcode matching — typos in website data like 4811AB when the real postcode is 4811AA. We generate all 1-character substitution variants of the scraped postcode and check each against the reference map. If a variant exists, we return the canonical version, not the typo.

Employee count + founding year — schema.org numberOfEmployees and foundingDate fields are present on a surprising number of SME sites (often auto-generated by their CMS). We extract these and surface them as enriched columns.

Cross-source validation — we also import structured data from national business registries. When both the scraper and a registry agree on the city name for a domain, confidence upgrades from medium to high.

Multi-source adapters — beyond scraping, we built importers for Dutch and Belgian business registries, the French SIRENE API, UK Companies House, and Google Places.

DNS pre-filter — bulk DNS resolution before the main run pre-marks dead domains in the checkpoint, saving hours of scraper time on domains that don't resolve.

The Electron GUI

The scraper runs headless on a remote server, but we wanted to watch it in real time from the laptop. We built a small Electron app that connects via SSE (Server-Sent Events) to the remote pipeline and shows:

The GUI connects to whichever server you point it at — localhost for dev, a remote server for production runs.


What We Learned

Scraping success rates vary widely. A meaningful share of records will hit bot-blocking (Cloudflare, CAPTCHA walls), JS-heavy single-page apps with no crawlable content, or sites with no contact page at all. Plan for it — checkpoint your progress, make the pipeline resumable, and measure confidence rather than treating all output as equal quality.

Schema.org is underrated. A meaningful percentage of SME websites — especially those built on WordPress with an SEO plugin, or Wix/Squarespace — declare their address in JSON-LD <script> blocks in the <head>. This gives us clean, structured data without regex gymnastics.

Belgian postcodes are a trap. A 4-digit number on a .nl domain site could be a year, a product code, a phone fragment, or an actual Belgian postcode. Always cross-validate against the reference table.

better-sqlite3 is excellent for this workload. Synchronous API, WAL mode, fast upserts. No async complexity, no connection pooling. For a single-writer pipeline it's perfect.

Windows packaging with native modules is annoying. better-sqlite3 is a native Node.js addon compiled against a specific Node ABI. Electron ships its own Node.js with a different ABI than system Node. Every time you package the app on Windows you need to run @electron/rebuild against the root node_modules first, or the packaged app throws ABI mismatch errors at runtime. We added this to the build script.

Vibe-coding works for pipeline tools. The pipeline went from "basic scraper" to "multi-source, confidence-scored, resumable, GUI-monitored, geocoded" through a series of natural conversations — describe the problem, get a working implementation, observe what breaks, iterate. No upfront architecture document, no story pointing. The tradeoff: you need to understand the code well enough to sanity-check what's produced, or you'll ship subtle bugs (e.g. the Belgian postcode false-positive problem wasn't in the initial version).


Stack

Layer Technology
Runtime Node.js 20, ES modules
Scraping node-fetch, cheerio, custom schema.org extractor
Phone libphonenumber-js (E.164 normalisation + mobile/landline/voip classification)
Storage better-sqlite3 (WAL mode)
Geocoding OpenStreetMap Nominatim (strict 1 req/sec)
GUI Electron 29, Leaflet.js, MarkerCluster
Tile layer OpenStreetMap
Packaging electron-builder + @electron/rebuild
Hosting Cloud server
AI pair Claude Code (Sonnet 4.6)

Built in-house, March 2026.