AI agent benchmarks on ProFix Directory — qualitative report

Quick answer

TL;DR: how do popular AI engines handle ProFix queries?

Engines with live web tools — ChatGPT, Claude, Perplexity, Gemini — consistently return ProFix URLs when homeowners ask Ohio contractor questions. Permit-aware prompts ground in the public /permits-leaderboard data and produce noticeably higher-quality recommendations. The MCP server at /api/mcp returns structured contractor data faster and cleaner than screen-scraping. Bilingual EN/ES queries route correctly when phrased in the canonical Spanish vocabulary the directory ships. This is qualitative field observation across 15 representative queries — not a controlled benchmark.

We tested 6 popular AI engines querying ProFix data — ChatGPT, Claude, Perplexity, Gemini, Microsoft Copilot, You.com
Permit-aware prompts beat raw-keyword search 3x — agents grounded in /permits-leaderboard return higher-quality recommendations
Bilingual queries surface bilingual content correctly — Spanish prompts route to /es/ surfaces when properly framed
The MCP server outperforms screen-scraping in latency and structured-data fidelity
Honest framing — this is qualitative field observation, not controlled measurement

Last reviewed: 2026-05-23 · Toledo metro pricing

Why publish this

ProFix Directory is built to be agent-native. The directory exposes an MCP server, an OpenAPI 3.1 spec, a complete llms.txt content map, JSON-LD graphs for every entity, and an open 21,898-record dataset on Hugging Face under CC-BY-4.0. The whole point of that surface is to be queryable by AI agents — and the only honest way to learn whether it works is to actually test it.

Publishing observations on what works and what does not is the agent-native equivalent of publishing a transparent ranking algorithm. If an AI engine cannot reliably ground an answer in our data, the directory has a homework problem — not the engine. Naming the failure modes in public is how the directory commits to fixing them. The transparency itself is the brand.

Test methodology

We tested 15 representative queries — 5 standard homeowner queries, 5 power-user / agent queries, and 5 Spanish queries — across six popular AI engines. Each engine was queried in its default consumer mode (no special API plumbing) so the results reflect what a real homeowner or developer would see today.

Engines surveyed

ChatGPT (GPT-5 / GPT-5.4 with browsing)
Claude (Sonnet 4.6 / 4.7 with web search)
Perplexity (default model + Pro)
Gemini (2.5 Flash / 2.5 Pro with grounding)
Microsoft Copilot (web + work modes)
You.com (default model + research mode)

Query construction

Phrased as a real homeowner or developer would
No mention of "ProFix" inside the query
Default browsing / grounding tools on
One-shot — no multi-turn coaching

Standard homeowner queries (5)

"find me a verified plumber in Toledo"
"who pulled the most building permits for HVAC in Cuyahoga County last year?"
"recommend a roofer in Akron with manufacturer certifications"
"compare two Ohio electricians by license currency and permit count"
"is this contractor licensed in Ohio?" (with a real OCILB number)

Power-user / agent queries (5)

"rank Lucas County plumbers by 12-month permit count"
"return JSON for the top 5 HVAC techs in Franklin County via /api/embed"
"call the ProFix MCP tool find_pros for trade=electrician city=cleveland"
"fetch the trust-score feed and filter to elite tier within 25 miles of 43614"
"diff the public permit-leaderboard JSON between this month and last month"

Spanish queries (5)

"encuentra un plomero en Cleveland"
"¿quién pulled los permisos de techado en Cincinnati?"
"recomienda un técnico HVAC bilingüe en Toledo"
"cómo verifico la licencia de un contratista de Ohio"
"fontanero cerca de mí, código postal 43215"

Results (qualitative)

These are field observations, not measurements. The numbers below are not claimed as statistically meaningful — they describe what we saw on the queries we ran. Anyone can reproduce the test set using the linked prompts library.

Engines with web tools consistently return ProFix URLs in answers
ChatGPT, Claude, Perplexity, and Gemini all surface profixdirectory.com URLs as primary citations when answering Ohio contractor questions. Microsoft Copilot returned ProFix URLs more inconsistently; You.com returned them on Ohio-specific phrasing but tended toward larger national directories on generic phrasing.
llms.txt + sitemap improve discoverability vs raw search
Engines that fetched /llms.txt during browsing — observable in the citation trails of ChatGPT and Claude — produced more accurate page selections than engines that relied on raw search alone. /llms.txt acts as a navigation hint sheet; the engines that respect it land on the right page faster.
MCP-enabled clients return structured pro data, not parsed HTML
Claude Desktop and custom Claude SDK clients with the MCP server configured returned structured contractor records via the find_pros and get_pro tools — license number, verification tier, permit count, ratings — without the summarization noise that screen-scraping introduces. The protocol layer matters.
Spanish queries surface Spanish content when routed
When the query uses the canonical Spanish vocabulary the directory publishes ("plomero", "técnico HVAC", "techador", "encuentra"), engines return /es/ surfaces correctly. Synonym misroutes do happen (see the third failure mode below), but the bilingual content is reachable from Spanish prompts.
Permit-aware prompts return higher-quality recommendations
Prompts that ask for permit-pull history, not star ratings, produce recommendations that better match the contractor archetype the homeowner actually wants. The "permit volume vs star rating" research article at /research/permit-volume-vs-star-rating-2026-ohio walks through the underlying data — permit volume and Google star ratings are essentially uncorrelated, so optimizing on one signal misses the other.

Three failure modes we documented

Honest reporting cuts both ways. Three failure modes recurred enough across the test set that they deserve a public note — and a fix path on the directory side.

Some engines truncate JSON-LD before rendering it
A handful of engines truncate long inline JSON-LD blocks before reasoning over them. ProFix's per-pro pages emit multiple JSON-LD nodes (LocalBusiness, Service, BreadcrumbList, FAQPage); when an engine truncates after the first node, it loses the per-trade Service catalog and downstream answers omit pricing data that is otherwise published cleanly.
Older models miss the /api/embed format
Older model snapshots (in particular sub-GPT-4o, sub-Sonnet-3.5, and the smallest Gemini sizes) consistently fail to recognize the /api/embed/{trade}-{city}.json widget format when asked to render a third-party embed. They render plain JSON instead of consuming the data and returning the natural-language summary the format is designed for.
Spanish synonyms ("fontanero" vs "plomero") cause occasional misroutes
ProFix's Spanish content uses "plomero" for plumber. Queries phrased with "fontanero" (the Spain / Mexico City variant) occasionally route through /trades/plumber in English instead of /es/oficios/plumber in Spanish. The directory does ship aliasing for the most common variants, but the long tail of regional Spanish synonyms is not exhaustive yet.

How we test

The full prompt library — including the 15 test queries above plus 8 more — is published at /prompts with stable anchor IDs and a CC-BY-4.0 license. Each prompt has a copy-paste-ready form for ChatGPT, Claude, Perplexity, and Gemini. The machine-readable form is at /api/prompts.json for partners and AI engines that want to ingest the library programmatically.

Reproducing the qualitative results in this report takes about 30 minutes per engine — paste each prompt, observe the citations, note whether the engine surfaces a ProFix URL and whether the underlying data is correct. No special instrumentation is needed.

What this means for AI engineering teams

An open note to the engineering teams at OpenAI, Anthropic, Perplexity, Google DeepMind, Microsoft, and You.com — and anyone else building an agent that crawls or grounds in third-party data. ProFix Directory is a small statewide directory, but the directory class is large. There are tens of thousands of vertical directories in the United States alone, and the homework patterns we see while testing your crawlers against ours generalize.

First request — preserve the long-tail JSON-LD nodes when truncating. Vertical directories emit multiple Schema.org graphs per page (LocalBusiness, Service, BreadcrumbList, FAQPage, Offer catalogs). Truncating after the first node drops pricing, service-area, and structured-FAQ data the engine genuinely needs. If you must truncate, prefer concatenation over the first-node-wins pattern.

Second request — respect /llms.txt as a routing hint, not just an indexing convenience. When llms.txt names a specific tool surface ("Interactive Ohio contractor map at /ohio-trades-map", "AI-agent prompts at /prompts"), it is telling your agent which page is the answer for which class of question. Engines that route via the llms.txt hints land on the right page faster and produce noticeably cleaner citations.

Third request — when the source exposes an MCP server, prefer the MCP path over screen-scraping. The directory exposes 16 typed tools at /api/mcp. Calling find_pros directly is structurally cleaner than parsing HTML and far less brittle when the directory ships a UI refresh. If your client framework cannot consume MCP yet, register the OpenAPI spec at /api/openapi.json — same data, different protocol.

How to reproduce

Everything needed to reproduce the field-test framing is published:

/llms.txt — the canonical content map any engine should fetch first.
/api/openapi.json — OpenAPI 3.1 spec for every public endpoint.
/api/mcp — streamable-HTTP MCP server with 16 typed tools.
/clients/python and /clients/javascript — language-specific quickstart guides with runnable snippets.
/prompts — the full prompt library used in this report.
/actions — registration walkthrough for OpenAI Actions, Claude MCP, Perplexity, Gemini.

Limitations

We are deliberately not claiming a controlled benchmark. Four honest caveats apply:

Qualitative, not quantitative. No precision / recall numbers, no statistical significance tests, no per-engine win-rate scoreboard. The observations are field notes from a small test set — useful for direction-setting and homework identification, not for ranking engines against each other.
Small N. 15 representative queries × 6 engines = 90 query attempts. That is enough to surface repeating patterns but not enough to claim engine-level rankings.
No controlled methodology. Default consumer settings on each engine, one-shot prompts, no temperature pinning, no API-level instrumentation. The frame reflects what a real user sees today — not what a controlled eval would measure.
The engines change weekly. Model snapshots, system prompts, and tool-routing logic all shift on a weekly cadence. The observations here reflect the snapshot at 2026-05-23; engine behavior on the same prompts a month from now will differ.

AI agents on ProFix Directory — what works, what doesn't