AI Crawler Robots.txt Audit

Analyzes a site's robots.txt specifically for AI crawler access policies. Complements /seo-technical (which does a broad robots.txt check) with deep AI-specific analysis.

@skills/seo/references/ai-crawlers-guide.md

AI Crawler Registry

Bot Name	Owner	Purpose
GPTBot	OpenAI	Training data + ChatGPT web search
OAI-SearchBot	OpenAI	ChatGPT search only (not training)
ChatGPT-User	OpenAI	ChatGPT browsing (real-time)
ClaudeBot	Anthropic	Training data collection
anthropic-ai	Anthropic	Anthropic web crawler
PerplexityBot	Perplexity	AI search engine
Google-Extended	Google	Gemini / AI training (not Search)
Bytespider	ByteDance	TikTok / AI training
CCBot	Common Crawl	Open dataset used by many AI models
Applebot-Extended	Apple	Apple Intelligence training
cohere-ai	Cohere	AI model training
FacebookBot	Meta	Meta AI training
Meta-ExternalAgent	Meta	Meta AI browsing agent
Amazonbot	Amazon	Alexa / AI training
Diffbot	Diffbot	AI knowledge graph
ImagesiftBot	ImagesiftBot	AI image training
Omgili	Webz.io	AI data feeds

Inputs

url: The website URL to audit (will fetch /robots.txt from site root)
- Normalize to domain root: example.com/page → https://example.com/robots.txt

Execution

Fetch robots.txt: WebFetch <domain>/robots.txt
- If 404 → report "No robots.txt found — all crawlers allowed by default"
- If 200 → proceed to parse
Parse User-agent blocks: Extract all User-agent directives and their associated Allow / Disallow rules.
Check each AI crawler: For each bot in the registry, determine access:
- Allowed — No specific block, or explicit Allow: /
- Blocked — Disallow: / for this User-agent
- Partial — Some paths blocked, others allowed (list specifics)
- Inherited — Falls under User-agent: * rules (note this)
Check wildcard rules: If User-agent: * has Disallow: /, note that ALL bots (including AI) are blocked unless explicitly allowed.
Check for ai.txt: WebFetch <domain>/ai.txt — an emerging standard for AI-specific crawler policies. Report if found and summarize contents.
Check for llms.txt: WebFetch <domain>/llms.txt — report if found (cross-reference with /seo llms-txt for full audit).
Analyze crawl-delay: Note any Crawl-delay directives that affect AI bots specifically or via wildcard.
Check sitemap declaration: Note if Sitemap: directive is present (helps AI crawlers discover content).

Output Format

## AI Crawler Audit: [domain]

### Crawler Access Matrix

| Crawler | Owner | Status | Rule Source | Details |
|---|---|---|---|---|
| GPTBot | OpenAI | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| ClaudeBot | Anthropic | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| PerplexityBot | Perplexity | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| Google-Extended | Google | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| ... | ... | ... | ... | ... |

### AI Openness Score: X/10

Scoring:
- 10/10 = All AI crawlers allowed, ai.txt present, llms.txt present
- 7-9 = Most crawlers allowed, some minor gaps
- 4-6 = Mixed policy — some allowed, some blocked
- 1-3 = Most AI crawlers blocked
- 0/10 = All AI crawlers blocked (or blanket Disallow: /)

### Key Findings

- **AI crawlers explicitly blocked**: [count] of [total]
- **AI crawlers explicitly allowed**: [count]
- **Falling under wildcard rules**: [count]
- **ai.txt present**: Yes/No
- **llms.txt present**: Yes/No
- **Sitemap declared**: Yes/No

### Recommendations

Based on the site's apparent goals:

**If goal is maximum AI visibility:**
- [Specific recommendations to allow AI crawlers]
- [Suggest llms.txt creation if missing]

**If goal is AI protection:**
- [Note any crawlers not yet blocked]
- [Suggest ai.txt adoption]

**If goal is selective access:**
- [Recommend allowing search-focused bots: OAI-SearchBot, PerplexityBot]
- [Block training-only bots: CCBot, Bytespider]
- [Distinguish training vs search crawlers]

### Industry Context

Note how the site's policy compares to common patterns:
- Most major publishers block training bots but allow search bots
- Most SaaS companies allow all AI crawlers for visibility
- E-commerce sites typically allow all crawlers
- Media/news sites increasingly block training-only bots

### robots.txt Snippets

If the user wants to implement changes, provide ready-to-paste robots.txt
blocks for their chosen strategy:

**Allow all AI crawlers:**

AI Crawlers — Allowed

User-agent: GPTBot Allow: /

User-agent: ClaudeBot Allow: /

User-agent: PerplexityBot Allow: /

User-agent: Google-Extended Allow: /


**Block training, allow search:**

AI Search — Allowed

User-agent: OAI-SearchBot Allow: /

User-agent: PerplexityBot Allow: /

AI Training — Blocked

User-agent: GPTBot Disallow: /

User-agent: ClaudeBot Disallow: /

User-agent: CCBot Disallow: /

User-agent: Google-Extended Disallow: /

User-agent: Bytespider Disallow: /

seo-robots-ai

Installation

Summary

SKILL.MD