Installation
$npx skills add lionkiii/claude-seo-skills --skill seo-robots-aiSummary
Fetch and analyze a site's robots.txt file specifically for AI crawler access policies, checking against a registry of AI bots (GPTBot, ClaudeBot, PerplexityBot, etc.) and generating an access matrix with an AI openness score. The agent can execute this workflow end-to-end, then recommend robots.txt changes aligned with the user's content protection or visibility goals.
SKILL.MD
AI Crawler Robots.txt Audit
Analyzes a site's robots.txt specifically for AI crawler access policies.
Complements /seo-technical (which does a broad robots.txt check) with
deep AI-specific analysis.
@skills/seo/references/ai-crawlers-guide.md
AI Crawler Registry
| Bot Name | Owner | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training data + ChatGPT web search |
| OAI-SearchBot | OpenAI | ChatGPT search only (not training) |
| ChatGPT-User | OpenAI | ChatGPT browsing (real-time) |
| ClaudeBot | Anthropic | Training data collection |
| anthropic-ai | Anthropic | Anthropic web crawler |
| PerplexityBot | Perplexity | AI search engine |
| Google-Extended | Gemini / AI training (not Search) | |
| Bytespider | ByteDance | TikTok / AI training |
| CCBot | Common Crawl | Open dataset used by many AI models |
| Applebot-Extended | Apple | Apple Intelligence training |
| cohere-ai | Cohere | AI model training |
| FacebookBot | Meta | Meta AI training |
| Meta-ExternalAgent | Meta | Meta AI browsing agent |
| Amazonbot | Amazon | Alexa / AI training |
| Diffbot | Diffbot | AI knowledge graph |
| ImagesiftBot | ImagesiftBot | AI image training |
| Omgili | Webz.io | AI data feeds |
Inputs
url: The website URL to audit (will fetch/robots.txtfrom site root)- Normalize to domain root:
example.com/page→https://example.com/robots.txt
- Normalize to domain root:
Execution
-
Fetch robots.txt: WebFetch
<domain>/robots.txt- If 404 → report "No robots.txt found — all crawlers allowed by default"
- If 200 → proceed to parse
-
Parse User-agent blocks: Extract all
User-agentdirectives and their associatedAllow/Disallowrules. -
Check each AI crawler: For each bot in the registry, determine access:
- Allowed — No specific block, or explicit
Allow: / - Blocked —
Disallow: /for this User-agent - Partial — Some paths blocked, others allowed (list specifics)
- Inherited — Falls under
User-agent: *rules (note this)
- Allowed — No specific block, or explicit
-
Check wildcard rules: If
User-agent: *hasDisallow: /, note that ALL bots (including AI) are blocked unless explicitly allowed. -
Check for ai.txt: WebFetch
<domain>/ai.txt— an emerging standard for AI-specific crawler policies. Report if found and summarize contents. -
Check for llms.txt: WebFetch
<domain>/llms.txt— report if found (cross-reference with/seo llms-txtfor full audit). -
Analyze crawl-delay: Note any
Crawl-delaydirectives that affect AI bots specifically or via wildcard. -
Check sitemap declaration: Note if
Sitemap:directive is present (helps AI crawlers discover content).
Output Format
## AI Crawler Audit: [domain]
### Crawler Access Matrix
| Crawler | Owner | Status | Rule Source | Details |
|---|---|---|---|---|
| GPTBot | OpenAI | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| ClaudeBot | Anthropic | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| PerplexityBot | Perplexity | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| Google-Extended | Google | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| ... | ... | ... | ... | ... |
### AI Openness Score: X/10
Scoring:
- 10/10 = All AI crawlers allowed, ai.txt present, llms.txt present
- 7-9 = Most crawlers allowed, some minor gaps
- 4-6 = Mixed policy — some allowed, some blocked
- 1-3 = Most AI crawlers blocked
- 0/10 = All AI crawlers blocked (or blanket Disallow: /)
### Key Findings
- **AI crawlers explicitly blocked**: [count] of [total]
- **AI crawlers explicitly allowed**: [count]
- **Falling under wildcard rules**: [count]
- **ai.txt present**: Yes/No
- **llms.txt present**: Yes/No
- **Sitemap declared**: Yes/No
### Recommendations
Based on the site's apparent goals:
**If goal is maximum AI visibility:**
- [Specific recommendations to allow AI crawlers]
- [Suggest llms.txt creation if missing]
**If goal is AI protection:**
- [Note any crawlers not yet blocked]
- [Suggest ai.txt adoption]
**If goal is selective access:**
- [Recommend allowing search-focused bots: OAI-SearchBot, PerplexityBot]
- [Block training-only bots: CCBot, Bytespider]
- [Distinguish training vs search crawlers]
### Industry Context
Note how the site's policy compares to common patterns:
- Most major publishers block training bots but allow search bots
- Most SaaS companies allow all AI crawlers for visibility
- E-commerce sites typically allow all crawlers
- Media/news sites increasingly block training-only bots
### robots.txt Snippets
If the user wants to implement changes, provide ready-to-paste robots.txt
blocks for their chosen strategy:
**Allow all AI crawlers:**
AI Crawlers — Allowed
User-agent: GPTBot Allow: /
User-agent: ClaudeBot Allow: /
User-agent: PerplexityBot Allow: /
User-agent: Google-Extended Allow: /
**Block training, allow search:**
AI Search — Allowed
User-agent: OAI-SearchBot Allow: /
User-agent: PerplexityBot Allow: /
AI Training — Blocked
User-agent: GPTBot Disallow: /
User-agent: ClaudeBot Disallow: /
User-agent: CCBot Disallow: /
User-agent: Google-Extended Disallow: /
User-agent: Bytespider Disallow: /