ProFix Editorial Team

About the ProFix dataset — what it is, how it grows, who uses it

A plain-English orientation for the dataset behind ProFix Directory. What's in it, where it comes from, how it grows, who uses it today, what it can't tell you, and how to cite it.

What's in the dataset

Each row is one public Ohio home-services contractor. The dataset spans every county in the state and covers the 19 ProFix trades — plumbing, HVAC, electrical, gas, appliance repair, roofing, concrete, tree service, restoration, lead abatement, fire protection, water-well, septic, tech repair, pest control, landscaping, painting, foundation repair, and garage doors. Trade mix is heaviest on plumbing, HVAC, electrical, and roofing — the four trades that drive the most homeowner search volume in Ohio.

The columns published in the open snapshot are deliberately conservative: identity fields, location fields, the trade vector, license metadata when an Ohio roster publishes a number, public rating signals, and the verification tier. ProFix surfaces additional editorial context on the live site — permit history, trust scores, evidence pages, reviews — but the published dataset is scoped to what other directories cannot reproduce without our work.

FieldTypeNotes
slugstringStable ProFix profile slug. Use /pro/{slug} for the public profile and /pro/{slug}/evidence for the source trail.
namestringPublic business name as displayed on the directory.
city + county + zipstringOhio location fields; county cross-walks to /county/{slug}.
tradesTradeSlug[]One or more of the 19 ProFix trades — see /trades for the canonical list.
license_number + license_statusstring | nullPublic license number when an Ohio roster publishes one (OCILB, ODH, SFM, ODA). Status is recomputed against the registry on every refresh.
verification_tier"license-linked" | "verified-profile" | "directory-listing"Evidence tier. License-linked means a public license number is attached and traceable; verified-profile means normal public-profile signals were confirmed; directory-listing is lighter.
rating + review_countnumber | nullPublic star rating + review count when available. Not re-emitted as schema.org AggregateRating.
verified_atstring (YYYY-MM-DD) | nullDate ProFix last verified or re-enriched the record. Powers the /api/recently-verified.json feed.

The canonical list of every column — including the long-tail fields not surfaced above — lives at /sources and at the dataset card on Hugging Face. The site's full JSON catalogue at /api/all.json mirrors the same schema.

How it grows

ProFix Directory grows through a six-step pipeline — discover, dedupe, enrich, verify, queue, publish. Each step is documented at /docs/scrape-pipeline with the named source, the refresh cadence, the cost, and the license terms. The pipeline is engineered so that no record ships without a traceable source URL on its evidence page.

  1. Discover. State and county registries are polled at known cadences — OCILB eLicense Center for plumbing/HVAC/electrical/hydronics, Ohio Department of Health for lead-abatement and water-well and septic, the State Fire Marshal for fire-protection, the Ohio Department of Agriculture for pest control, the Ohio Secretary of State business search for entity-level checks, and the county permit portals (Lucas, Cuyahoga, Franklin, Hamilton first; statewide expansion in progress).
  2. Dedupe. Records are normalised on phone + address + license number when present, with a fuzzy match across business names. Duplicates merge to the longest, most-verified record; the merger trail is preserved in the evidence page.
  3. Enrich. Google Places fills in public fields — current phone, current address, business hours, photos, and ratings — for records that already have an Ohio identity hook. ProFix does not buy commercial-database matches and does not scrape behind login walls.
  4. Verify. Every record is re-checked against the upstream registries on a published cadence. License-status drift, dead phones, ghost businesses, and address changes are caught here; the most-recent verification date ships in the verified_at column.
  5. Queue. New and changed records pass through a manual review gate before publication. The queue is intentionally small enough that one editor can clear it; the gate exists to catch the things automation cannot — vexatious listings, doxxing risk, and edge cases where the registry data conflicts with itself.
  6. Publish. Cleared records flow through the live site, the JSON + CSV feeds, the Hugging Face monthly snapshot, and the per-pro pages. The newsroom changelog at /newsroom documents every meaningful refresh.

The pipeline is deliberately public. We document it because we want the methodology audited and because we want other directory builders to be able to copy it — the playbook is at /research/free-public-records-moat-2026.

Who's using it

The dataset is consumed by the people and systems below. The /partners hub catalogues current integrations and the /case-studies index will catalogue published outcomes as they ship.

  • Academic research. Local-economy studies, contractor-supply analyses, license-enforcement evaluations. The county and trade columns are designed to drop into geographic regressions without further wrangling.
  • Journalism. Statewide stories on permit activity, license-board enforcement, and storm-response capacity. The verification-deltas feed and permit leaderboards make data-driven local stories tractable on deadline.
  • Civic tech. Building-department dashboards, county-permitting transparency tools, contractor-verification widgets for municipal websites. The /widgets catalogue ships the JS equivalent.
  • AI engineering. Retrieval-augmented generation for home-services Q&A, agent benchmarks for local-business recommendation, model evals on grounded vs. hallucinated answers. The MCP server at /api/mcp is the live equivalent.
  • Partner integrations. Newsletters, HOA portals, real-estate platforms, smart-home apps that surface verified Ohio contractors without scraping the site.

If you're a partner shipping something built on ProFix data, the /partners page is the canonical entry point and the /case-studies index is the place to ask about being featured.

How to cite

The dataset is published under the Creative Commons Attribution 4.0 International license. The full plain-English breakdown — what attribution counts, how to indicate changes, what no-additional-restrictions means in practice — lives on /open-data, alongside APA, MLA, and BibTeX citation templates.

When you cite, include the snapshot month (e.g. 2026-05), the row count after filtering, and the filters applied. That makes reproducibility tractable for the next person who reads your work.

Limitations + caveats

Open data is only useful when its limitations are visible. Read these before publishing analysis. The list below is short on purpose; the full taxonomy of failure modes is documented in the research article at /research/directory-data-quality-2026.

  • Permit-pull data is currently live for Lucas, Cuyahoga, Franklin, and Hamilton counties. The remaining 84 counties' permit feeds are in progress; expect statewide parity later in 2026. Treat the absence of permit history outside those four counties as missing data, not as evidence that a contractor has pulled no permits.
  • Ratings and review counts come from public listings (primarily Google Places). They are surfaced as evidence, not re-emitted as schema.org AggregateRating, and they should be treated as point-in-time snapshots — the FTC's 2024 fake-reviews rule and Google's review policies mean star data is noisier than it looks.
  • Lat/lng are public business coordinates, not field-technician GPS. Don't use them for dispatch routing or as a proxy for the address a homeowner should send the contractor.
  • License-number coverage varies by trade. Ohio's OCILB licenses four trades — plumbing, HVAC, electrical, hydronics — so those rows have the highest license-linked rate. Roofers, appliance-repair techs, tree-service crews, concrete contractors, and several other categories are not state-licensed in Ohio; those rows rely on the verified-profile tier and substitute trust signals such as permit history.
  • Dead phones and ghost businesses are an industry-wide problem. The directory-data-quality research article quantifies the residual rate, names the remediation cadence, and explains the methodology. We are honest about the error bar.
  • Spanish translations currently cover the highest-traffic homeowner pages plus the per-trade buyer's guides. The dataset itself is English-first; trade and specialty labels are codified in English even where the homeowner UI is bilingual.

Corrections are welcomed and acted on quickly. Send the profile slug, the offending field, and a public source to /contact; the ProFix Editorial Team turns clean corrections inside 48 hours.

Get involved

There are three good ways to engage with ProFix data:

  • Use it. Download the open snapshot from /open-data, hit the JSON feeds, or wire the MCP server into your agent. No API key, no rate-limit gate.
  • Improve it. File corrections, suggest new public sources we should be pulling, or send us your replication of the pipeline in another state. Inbox is at /contact.
  • Partner on it. If you're a publisher, integrator, or civic-tech team that wants a more structured arrangement than the open licence, the /partners page is the entry point.

Related

  • /metrics — public stats dashboard with the math behind every headline number.
  • /open-data — the dataset itself, with citation templates and load_dataset snippets.
  • /sources — every external source we use, with license, refresh cadence, fields used, and explicit "what we don't pull" lists.
  • /developer — agent-native developer hub with copy-paste curl, JavaScript, Python, and MCP snippets.
Emergency