How AI engines actually find directories: what ProFix learned exposing 21,000 contractors to ChatGPT, Claude, Perplexity, and Gemini (2026)

A practitioner-grade walkthrough of the discovery manifests, MCP servers, JSON-LD graphs, sub-sitemaps, and IndexNow pings ProFix Directory shipped to be findable by ChatGPT, Claude, Perplexity, and Gemini — what worked, what's still uncertain, and what other directory operators should ship first.

Original research5 AI engines surveyed7 manifests shippedPublished 2026-05-23CC BY 4.0

The AI-search funnel as of 2026

For two decades the discovery funnel for a directory site looked the same: Google crawled it, Google ranked it, Google sent traffic. Bing and a handful of vertical engines filled in the edges. The entire industry of "directory SEO" was, in practice, "Google SEO with directory-specific tactics."

That assumption is breaking in 2026. ChatGPT search, Claude with web search, Perplexity, Gemini AI Overviews, Brave Search AI, and You.com each operate their own answer surface, and each one has its own way of finding and quoting a directory. Some still ride on classical crawlers; some fetch URLs live the moment a user asks a question; some maintain a private index that is several orders of magnitude smaller than Google's. A directory that is invisible to one of them is invisible to a growing share of homeowner intent.

Below is how each major engine currently appears to index the web — synthesized from each platform's public technical docs, observed crawler user-agent strings in Vercel logs across ProFix Directory's 21,000-contractor surface, and the citation patterns visible inside the engines' own answers when asked about Ohio home-services intent.

EngineHow it indexesSignals it reads
ChatGPT searchOpenAI SearchGPT crawler (`OAI-SearchBot`) + a Bing-derived index for fallback retrieval.Standard sitemaps and on-page semantic HTML; emerging support for llms.txt as a content map; ChatGPT Actions for first-party tool invocation. Citation links in answers come from the live web index, not a private knowledge graph.
Claude searchAnthropic's web search + URL fetch tools, invoked by Claude itself; no persistent public crawler index of the same shape as Google.Live HTML at fetch time, OpenAPI specs for tool invocation, and Anthropic Connectors / Model Context Protocol servers when a directory exposes one. JSON-LD graphs are parsed for entity grounding.
PerplexityPerplexity's own crawler (`PerplexityBot`) plus partner indexes; aggressive on-demand fetch of cited URLs.Sitemaps for breadth, JSON-LD and clean HTML for entity extraction, and citation-friendly TL;DR blocks at the top of pages. Every answer is link-grounded, so URL freshness and canonical tags matter.
Gemini (Google AI Overviews)Googlebot — the same crawler that powers Search — feeding both classic SERPs and AI Overviews.Full Google ranking signal stack: schema.org markup, structured data (especially LocalBusiness / Service / FAQPage), E-E-A-T signals, Core Web Vitals, and sitemap freshness. AI Overviews preferentially cite pages with strong entity graphs.
You.com + Brave Search AIBrave operates its own independent index (Brave Search); You.com blends multiple back-ends including its own. Both surface AI answers grounded in those indexes.Standard web crawl signals plus emerging support for AI-specific manifests. Independent indexes mean directories must not assume Google or Bing distribution covers them.

The practical takeaway: there is no single "AI SEO" play. A directory has to be findable by a crawler-based engine (Gemini via Googlebot), a live-fetch engine (Claude, Perplexity), and an agent-callable surface (MCP, OpenAPI, ChatGPT Actions) at the same time. The cheapest way to do that is to ship the AI-native manifests once and let every engine pick what it wants.

What ProFix shipped to be findable

Over the first half of 2026, ProFix Directory built and exposed a layered AI-discovery stack on top of the conventional Next.js + sitemap surface. Each layer is live and inspectable; every URL below resolves on the public site.

  • llms.txt and llms-full.txt content maps. Hand-curated Markdown index of the most important URLs on the site, structured the way an LLM would want to consume them — tools first, then trades, then guides, then cost data. Live at /llms.txt and /llms-full.txt. Regenerated on every deploy from the live data set.
  • Model Context Protocol server with 16 tools. An MCP endpoint at /api/mcp that any MCP-compatible client (Claude Desktop, Cursor, ChatGPT desktop, Perplexity, agent frameworks) can connect to and call directly. Tools cover discovery (find_pros, find_emergency_pros, get_pro), taxonomy (list_taxonomy, get_county_coverage_stats), content (list_guides, get_cost_estimate, get_trade_pricing), safety (triage_symptom, get_emergency_contacts, get_active_storm_events), and trust (get_methodology, get_verification_feed). Stateless, read-only, no auth.
  • OpenAPI 3.1 spec. A canonical machine-readable description of every public API endpoint at /api/openapi.json. Same surface the MCP server wraps, but in the format expected by ChatGPT Actions, LangChain tool-loaders, and most agent SDKs.
  • Seven JSON-LD graph sub-feeds. Standalone, independently fetchable schema.org graphs at /api/jsonld/organization, /api/jsonld/pros, /api/jsonld/cost-guides, /api/jsonld/faq, /api/jsonld/local-business-index, /api/jsonld/breadcrumb-coverage, and a family of /api/jsonld/faq-trade-{slug} feeds for every trade. Each is served with the canonical application/ld+json content type and CORS-open so an AI engine or third-party validator can pull it without a referrer.
  • Six sub-sitemaps. Beyond the root sitemap.xml, ProFix emits per-entity sitemaps for /pro, /oh (state + county pages), /coverage, /content, /permits-leaderboard, and /api. Engines that throttle a single huge sitemap will still pull the focused ones.
  • IndexNow pings to Bing and Yandex. On every meaningful content change, ProFix POSTs the affected URLs to api.indexnow.org with the site's verification key. Same pipeline that surfaces fresh content to Bing-derived indexes (including ChatGPT's fallback retrieval surface) without waiting for the next crawl.
  • Public Hugging Face dataset under CC BY 4.0. The 21,000-contractor corpus is published at Pisces89/ohio-home-services-pros on Hugging Face. Open licensing makes it eligible for inclusion in training corpora and citation by any AI engine that surfaces dataset provenance.

None of this requires a content-management system, a marketing team, or a paid SEO tool. The entire stack is code in a Next.js repo, regenerated on every Vercel deploy.

What's working — honest signals so far

The honest version of this section is: the data is thin. AI-engine analytics are not the rich, mature surface that Google Search Console is. Most of what we know comes from indirect signals.

  • Hugging Face dataset downloads are measurable and growing. Hugging Face exposes a per-dataset download counter. The Ohio contractor dataset has accumulated a non-zero, steadily climbing download count since publication — the only one of the discovery channels where we can read a real number daily. We are not yet treating that as evidence of model-training inclusion, but it is evidence of researcher and agent interest.
  • llms.txt fetches show up in Vercel logs. The /llms.txt and /llms-full.txt routes are being hit by user-agents identifying as PerplexityBot, OAI-SearchBot, ClaudeBot, and a long tail of unidentified agents (curl, Python requests, headless browsers). Volume is modest, but non-zero — and growing month-over-month. We cannot yet say what fraction of those fetches translate to citation surface inside an AI engine.
  • AI-engine deep-link clicks are visible when the engine sends a referrer. When an AI engine includes a link in its answer and the user clicks through, the referrer often resolves to a *.chat.openai.com, *.perplexity.ai, *.anthropic.com, or *.google.com URL. PostHog session recordings show occasional sessions originating from those referrers. The volume is small enough that it would be dishonest to report a number; the existence is the news.
  • JSON-LD graph endpoints get fetched by AI-engine validators. The /api/jsonld/organization endpoint shows fetches from user-agents associated with Google's structured-data testing tool and Anthropic's URL-fetch tool. This suggests at least one Claude user has asked the model to look ProFix Directory up and the model has used the canonical organization graph to ground its answer.

None of this is the kind of clean attribution Google Analytics provides for organic search. The field will mature; analytics tooling for AI-engine surfaces is roughly where mobile analytics was in 2010. Until it does, directory operators have to be comfortable shipping infrastructure on the bet that the engines reward it later.

What's still uncertain

We are publishing this article in part to invite the AI-search engineering community to correct us. Open questions that the manifest stack does not yet answer:

  • Do the major AI engines actually read llms.txt? The spec is roughly twelve months old. OpenAI, Anthropic, Google, and Perplexity have not made unambiguous public commitments to consume it as a first-class input. Our log evidence is that crawlers fetch it; whether that fetch shapes citation behaviour is unknown.
  • Does MCP server discovery via mcp.so, smithery.ai, or the official MCP registry matter yet? Listing an MCP server in a public registry presumably increases the odds that an agent finds and connects to it, but there is no measurable downstream signal. ProFix has not yet listed /api/mcp on those registries; we want to test the listing decision deliberately, with a baseline first.
  • Is an OpenAPI 3.1 spec enough for AI agents to programmatically call us, or do we need explicit ChatGPT Actions and Anthropic Connectors manifests? The OpenAPI spec is the lingua franca, but each engine has its own deployment surface (ChatGPT Actions for the GPT Store, Anthropic Connectors for Claude). It is unclear how much of the AI ecosystem in 2026 still requires the platform-specific manifest versus discovering OpenAPI endpoints autonomously.
  • How much does Hugging Face dataset publication actually move citation behaviour? An open dataset shows up in research papers, model evaluations, and agent benchmarks. The hypothesis is that this filters through to AI-engine citation behaviour over time. The feedback loop is months-to-years long; we will not know the answer until the next training generation is out.
  • Are JSON-LD graph sub-feeds redundant when every page already emits inline JSON-LD? The split-out feeds make the entity graph independently fetchable by validators and crawlers, but they may simply duplicate signal the engines already pick up from the per-page JSON-LD. We have no controlled experiment yet.

The minimum viable AI-engine kit for a directory

If a directory operator only has a weekend, here is the order we would ship in — ranked by cost-to-implement against expected upside:

  1. llms.txt + llms-full.txt at the site root. A few hundred lines of Markdown generated from your live data. Free; takes an afternoon; signals intent to AI engines that respect the spec.
  2. An OpenAPI 3.1 spec at a canonical URL. If you already have a JSON API, you already have most of this. Publish it. Make it discoverable.
  3. One MCP server exposing read-only tools over your data. Even three or four tools (find, get, list_taxonomy) is enough to make your directory agent-callable. Stateless, auth-free, low risk.
  4. A public JSON-LD Organization graph. Independently fetchable, served with the right content type. The schema.org Organization entity is the lowest-effort authority signal a site can publish.
  5. At least one open dataset on Hugging Face under CC BY 4.0. Even a 1,000-row CSV is enough to register the directory in the research and agent ecosystem.
  6. Sub-sitemaps for every entity type. Don't ship one giant sitemap.xml. Split by entity (one for businesses, one for guides, one for taxonomy pages). Crawlers prioritize and throttle differently per sitemap.
  7. IndexNow integration. Twenty lines of code that ping Bing and Yandex on every content change. ChatGPT's fallback index benefits.

What we would do differently

With hindsight on the first six months of the experiment:

  • Do not gate read-side APIs behind auth. Anything an AI engine needs to fetch to cite you should be reachable with a plain HTTP GET, no key, no header dance. Gating the read surface is the single fastest way to make yourself invisible.
  • Emit JSON-LD on every page, not just the homepage. Each contractor page, each cost guide, each FAQ — all of them carry inline schema.org. Engines that fetch a single page should be able to ground the entire entity from that page alone.
  • Use the actual public-record URL on every claim, not a "verified" badge. This is the same source-of-source argument from /research/what-verified-means-2026-ohio. AI engines actively de-rank pages that make trust claims they cannot themselves verify; linking to the underlying record converts a claim into a citation the model can re-cite.
  • Build the AI-native surface before the marketing surface. The temptation is to build a beautiful homepage first and worry about manifests later. The order should be inverted: ship the manifests on day one, because the compounding starts the day the first engine fetches them.
  • Treat the open dataset as the canonical source, not the website. The website is the user-facing surface; the dataset is the agent-facing surface. If the dataset is the source of truth, the website regenerates from it on every deploy, and the AI engines see one consistent story.

The bet

The thesis ProFix Directory is operating on: AI-first discovery is now table stakes for any directory that wants to compound traffic without paying for it. Google's classical funnel is not going away, but a growing share of homeowner intent is being resolved inside an AI answer surface before a user ever sees a list of blue links. A directory that is only findable through Google is increasingly only findable through one of two or three channels — and Google's AI Overviews themselves preferentially cite pages with strong entity graphs and structured data, which loops back to the same manifest work.

The moat that is hard for incumbent directories to copy quickly is the combination of three things: source-of-source transparency (every claim cites its public record), open data (the entire corpus is published under an open license), and AI-discoverable manifests (the engines can ground their answers in canonical, machine-readable feeds). Each of those is cheap in isolation. Stacked, they form a citation surface that a paid-listing directory cannot replicate without rebuilding its entire data model. See our adjacent analyses at /research/permit-vs-stars-2026-ohio and /research/comparing-ohio-directories for the trust-signal half of the same argument, and at /research/what-verified-means-2026-ohio for the verification-transparency half.

Practitioner documentation of the live setup is at /docs. The OpenAPI surface is at /api/openapi.json; the MCP endpoint at /api/mcp; the organization graph at /api/jsonld/organization.

Limitations + corrections

Reviewed on 2026-05-23. The AI-engine taxonomy in this article describes each platform's indexing behaviour as observable from public documentation, deployed crawler user-agents, and the citation patterns visible inside each engine's answers as of the publication date. Engines revise these behaviours frequently; if you work on the search team at any of the engines named (OpenAI, Anthropic, Perplexity, Google, Brave, You.com) and a description here is materially wrong, please send a correction through /contact and the article will be updated with a fresh modified date.

The "what's working" section deliberately avoids quoting specific traffic numbers. Attribution from AI-engine answers to downstream sessions is still a partial-signal exercise, and we would rather under-claim than over-claim. As the analytics surface matures we will publish a follow-up with measured pass-through rates per engine. Directory operators who have shipped a similar stack and are willing to share what they are seeing: please reach out through /contact. Cross-directory pooled data would substantially improve everyone's understanding.

Cite this report

ProFix Directory (2026). How AI engines actually find directories: what ProFix learned exposing 21,000 contractors to ChatGPT, Claude, Perplexity, and Gemini (2026). Published 2026-05-23. Licensed CC BY 4.0. Available at: https://profixdirectory.com/research/how-ai-engines-find-directories-2026

Emergency