ProFix data-acquisition pipeline — how we add contractors

Public documentation of how ProFix Directory discovers, dedups, enriches, verifies, queues, and publishes Ohio home-services contractors. Six pipeline steps, every source named, refresh cadence and cost per source, manual review gate, and the explicit list of what ProFix does not scrape. For journalists, replicators, partner integrations, and state agency-relations contacts.

Public documentation6 pipeline stepsManual review gateCC BY 4.0
TL;DR
  • ProFix runs a six-step pipeline — discovery, dedup, enrichment, verification, queue, publish — and gates every contractor through manual review before publication.
  • Discovery is registry-first: OCILB, ODA, ODH, State Fire Marshal, and the Ohio Secretary of State business search are the canonical identity sources. Google Places is used for conditional photo/rating/hours enrichment only after a state-registry match.
  • The full source registry lives at /sources and /api/sources.json; the per-profile license evidence lives at /api/license-evidence.json. Methodology is at /methodology and /verification.

Sources

Every data source ProFix pulls from is listed below with refresh cadence, cost per refresh, and license terms. The machine-readable companion is at /api/sources.json and the homeowner-facing index is at /sources. The full editorial argument for why ProFix swapped paid Google Places scrapes for state-agency direct pulls lives at /research/free-public-records-moat-2026.

SourceRefresh cadenceCost per refreshLicense termsFields used
Ohio eLicense Center (OCILB)Daily for status delta, weekly for new entries$0 — public records under ORC 149.43Ohio public records — redistribution permitted under state public-records lawLicense number, status, expiration, disciplinary history, trade category, DBA
Ohio Department of Agriculture — Pesticide ApplicatorsQuarterly bulk pull, rolling additions$0 — public roster; CSV exports via public-records deskOhio public recordsApplicator name, license number, category, expiration, disciplinary record
Ohio Department of Health — Lead Hazard AbatementQuarterly bulk pull$0 — public rosterOhio public recordsContractor/inspector name, license number, certification class, expiration
Ohio Department of Health — Private Water SystemsQuarterly bulk pull$0 — public rosterOhio public recordsContractor name, license number, drilling/pumping endorsement, expiration
Ohio State Fire Marshal — Fire Protection ContractorsAnnual renewal cycle, rolling additions$0 — public rosterOhio public recordsContractor name, license number, certification class, NICET cross-walk, expiration
Ohio Secretary of State business searchDaily for filings + status delta$0 — public recordsOhio public recordsEntity name, entity number, registered agent, filing date, status
Google Places APIQuarterly enrichment after registry confirmation~$0.07–$0.10 per record at scale (Text Search + Place Details)Google Maps Platform Terms of Service — display + attribution rules applyPhotos (reference URLs), aggregate rating, review count, hours of operation
Ohio county building-department permit feedsDaily for live counties (Lucas, Hancock); other counties variable$0 — public recordsPer-county public-records posturePermit number, issued date, address, contractor of record, project class

Pipeline steps

The six steps below are the production pipeline. Each step runs against the next; no step is optional. Every step is documented internally with the same prose published here.

  1. 1. Discovery

    Every contractor record enters the pipeline through a state registry — OCILB, ODA, ODH, SFM, or the Ohio Secretary of State business search. The licensed-trade record is the canonical identity. Google Places is used for discovery only when supplementing breadth in non-licensed categories (roofing, concrete, tree service, appliance repair, restoration) where the substitute-verification stack documented at /research/ohio-licensing-moat-2026 applies. State-agency direct pulls are documented in /research/free-public-records-moat-2026.

  2. 2. Dedup

    Each candidate record runs through three dedup keys — normalised business name, license number (when present), and slug. Slug normalisation collapses whitespace, lowercases, strips suffixes (LLC, Inc., Co.), and reconciles common alternates (Heating & Cooling vs Heating and Cooling). Records that match an existing slug are merged with conflict resolution favouring the state-registry source over the commercial source. The /research/directory-data-quality-2026 audit documents the failure modes — ghost businesses, lapsed registrations, duplicate slugs — the dedup pass catches.

  3. 3. Enrichment

    Once a contractor has been confirmed in a state registry, Google Places is queried to attach photos, aggregated rating, review count, and hours of operation. Records that do not surface in a state registry are not enriched — Places does not get to invent contractors the state has not licensed. Enrichment runs quarterly rather than on every refresh to control API spend; the /research/free-public-records-moat-2026 piece documents the cost-and-trust-signal trade-off.

  4. 4. Verification

    Every licensed-trade record is cross-checked against the originating state registry — the OCILB eLicense Center for plumbing/HVAC/electrical/hydronics, ODA for pesticide applicators, ODH for lead abatement / well drilling / septic, the State Fire Marshal for fire protection, and the Secretary of State for the business entity. The verification step writes the live status (active, expired, suspended, revoked) into /api/license-evidence.json. The methodology is documented at /methodology and /verification; homeowners can run the same lookup via /verify.

  5. 5. Queue

    Verified records land in data/queue/*.json as draft records pending human review. This is a deliberate gate, not a vestige — every contractor goes through manual eyeball before publication. The queue carries the state-registry evidence, the Places enrichment payload (if any), the dedup-decision log, and the verification status. The reviewer signs off in the queue file with a timestamp before the record promotes to production.

  6. 6. Publish

    Once the manual review gate signs off, the record promotes from data/queue into the live dataset and onto profixdirectory.com. The publication step writes JSON-LD onto the /pro/{slug} page, mirrors the record into /api/pros.json and /api/all.json, attaches the state-registry evidence to /api/license-evidence.json, and emits a verification-delta entry in /api/verification-feed.json. Sitemap entries refresh on the next build. The contractor's slug becomes part of the public catalog at /pros and the trade hub at /trades/{trade}.

Manual review gate

Every queued contractor is reviewed by a human before publication. The reviewer confirms three things: that the state-registry evidence matches the business entity claimed; that the Places enrichment (if any) attaches to the correct entity rather than a same-named neighbour; and that the trade classification matches the contractor's actual scope of work, not just the registry's category code. Sign-off writes a timestamped reviewer-id into the queue file before promotion. Records that fail manual review either get re-queued for additional registry evidence or get rejected with a reason code recorded in the queue.

The manual gate is a deliberate cost. Fully-automated publication would be faster and cheaper — but the directory-data-quality audit at /research/directory-data-quality-2026 documents the failure modes (ghost businesses, dead phones, lapsed licences, duplicate slugs) that the gate catches before they reach homeowners. ProFix Editorial Team treats the manual gate as a load-bearing trust signal, not a temporary measure.

Refresh cadence

ProFix runs four refresh tiers. The cadence below is honest about which surfaces refresh daily, weekly, monthly, or quarterly — and why each cadence was picked. The aspirational targets are documented at /methodology when they differ from current production.

Daily

OCILB status deltas (active/expired/suspended/revoked). Ohio Secretary of State filings + status changes. Live county permit feeds (Lucas, Hancock). Outage-status document at /api/outage-status.

Weekly

New OCILB entries (new license issuances). New OCILB disciplinary actions. New permit pulls aggregated into the per-trade and per-county leaderboards at /permits-leaderboard. Trust-score recomputation per profile.

Monthly

Hugging Face dataset republication. ODA / ODH / SFM bulk roster refresh. Verification-feed compaction. /api/quality-stats.json and /api/coverage-stats.json snapshot.

Quarterly

Google Places enrichment refresh (photos, ratings, hours). Cost-guide repricing. Per-source provenance audit against /sources and /api/sources.json.

What we don't scrape

The list below is the explicit set of surfaces ProFix declines to ingest. Each item is a deliberate editorial decision rather than an engineering omission. Where ProFix links to a third party (Yelp, BBB, Angi profiles) but does not aggregate the underlying content, the link is the integration point — not the data import.

  • Yelp review text — ProFix links to public profile pages but does not republish review prose. Yelp's terms-of-service plus the FTC's 2024 fake-reviews rule make wholesale ingestion of review text a load-bearing legal and editorial risk.
  • BBB complaint narratives — ProFix links to BBB profile pages but does not aggregate the underlying complaint text. The /vs/bbb comparison page documents the editorial reasoning.
  • Contractor complaint records from the Ohio Attorney General — ProFix links to the AG's consumer-protection enforcement pages but does not republish complainant or respondent narratives.
  • Homeowner-submitted lead form payloads — these never enter the public dataset and never leave the routing pipeline. /api/lead-feed.json publishes only aggregate metrics with zero PII.
  • Photos scraped from third-party sites without explicit licence. ProFix uses Google Places photo references (with attribution per the Maps Platform terms) and accepts contractor-submitted photos with explicit consent. We do not scrape Yelp, Angi, Thumbtack, HomeAdvisor, or Nextdoor photography.
  • Phone numbers cross-walked to personal devices. ProFix carries the business contact line as published in the state registry or the contractor's own listing; we do not look up personal mobile numbers via reverse-lookup or data-broker services.

Tools in /tools/

The pipeline above is implemented as a set of TypeScript scripts under tools/ in the ProFix repository. Each tool is the canonical authority for one source or one pipeline step. The registry-pulls workstream and the enrichment workstream run in parallel; the queue and publish steps are coordinated from a single reviewer console.

tools/scrape-toledo.ts

Original Toledo metro discovery scraper. Combines Google Places Text Search with Hancock/Lucas county building-department queries.

tools/scrape-metro.ts

Parameterised metro scraper — Cleveland, Cincinnati, Columbus, Dayton, Findlay, Toledo, and the rural-county batches. Drives statewide Places enrichment after a registry-tier seed.

tools/scrape-ocilb.ts

OCILB eLicense Center direct pull. The canonical identity source for plumbing, HVAC, electrical, and hydronics contractors. Writes license_number + status + expiration into the queue. (Built by the registry-pulls workstream — referenced here, not authored here.)

tools/scrape-oda-pest.ts

Ohio Department of Agriculture pesticide-applicator roster pull. Categories: general pest, termite, lawn, wood-destroying organism. (Built by the registry-pulls workstream — referenced here, not authored here.)

tools/scrape-odh-water-well.ts

ODH Private Water Systems roster pull — well drillers and pump installers. (Built by the registry-pulls workstream — referenced here, not authored here.)

tools/scrape-sfm-fire-protection.ts

Ohio State Fire Marshal Bureau of Fire Prevention roster pull — sprinkler, alarm, suppression, NICET cross-walk. (Built by the registry-pulls workstream — referenced here, not authored here.)

How to contribute a missing source

Spotted a public registry ProFix should pull from but doesn't? An Ohio agency with a roster we missed? A county building department with a downloadable permit feed? Use the contact form at /contact and include the source URL, the refresh cadence you observed, and the public-records license posture. ProFix Editorial Team reviews source-addition requests on a weekly cycle.

Journalists and state agency-relations contacts: the methodology is documented under CC BY 4.0, so reproducing the pipeline in another state requires no permission — only attribution. The portable playbook lives in /research/free-public-records-moat-2026.

License

This methodology page is published under CC BY 4.0. The underlying state-registry records remain Ohio public records under ORC 149.43; Google Places enrichment data remains subject to the Google Maps Platform Terms of Service. Cite ProFix Directory for the editorial assembly and the upstream agency for the underlying records.

Related ProFix surfaces: /methodology, /verification, /sources, /trust-score, and /data-sources. Machine-readable feeds: /api/sources.json, /api/license-evidence.json, /api/verification-feed.json.

Emergency