How do I block AI crawlers in robots.txt?

Add a User-agent entry for each AI bot followed by Disallow: /. For example: User-agent: GPTBot then Disallow: /. OpenAI, Anthropic, and Google all respect robots.txt directives.

llms.txt is a complementary Markdown file at yoursite.com/llms.txt that guides AI assistants on how to interpret and cite your content. Unlike robots.txt which controls crawl access, llms.txt provides context for large language models specifically.

Does robots.txt work for WordPress sites?

Yes. For WordPress, block /wp-admin/ (but Allow: /wp-admin/admin-ajax.php), /wp-includes/, /wp-json/, /xmlrpc.php, and /trackback/ to prevent crawlers wasting crawl budget on admin and system files.

Robots.txt Generator 2026 | Comprehensive Builder Tool

Q: What is a robots.txt file?

A robots.txt file tells web crawlers and bots which pages or sections of your site they can or cannot access. It lives at the root of your domain (yoursite.com/robots.txt) and is checked before any page is visited.

Q: Is Crawl-delay still relevant in 2026?

Google completely ignores Crawl-delay. Bing and Yandex still support it. For most modern sites it is not recommended. Use Google Search Console or Bing Webmaster Tools to control crawl rates instead.

Learn More

Robots.txt Explained

Everything you need to know about robots.txt in 2026, from basic directives to AI crawler controls.

What is robots.txt?

A plain-text file placed at the root of your domain (yoursite.com/robots.txt) that tells crawlers which pages they can or cannot access. Crawlers check this file before visiting any page on your site.

User-agent: specifies which bot the rules apply to (* = all bots)
Disallow: tells bots not to crawl a path
Allow: overrides a Disallow for a sub-path (Googlebot only)
Sitemap: points crawlers to your XML sitemap

AI Crawlers in 2026

AI crawlers have exploded in 2026. GPTBot traffic grew 305% year-over-year. Each major AI company now runs three separate bots: one for training, one for user-initiated fetches, and one for search indexing. This lets you allow AI search visibility while blocking training data scraping.

Note: Perplexity has documented compliance issues using undeclared user-agent strings. robots.txt alone may not fully stop it.

Common Mistakes

Blocking CSS and JS files (prevents proper rendering by Google)
Using Disallow: / for all bots and forgetting to re-allow search engines
Relying on robots.txt to protect sensitive content (it is advisory only)
Exceeding the 500 KB file size limit (Google silently ignores the rest)
Using Crawl-delay expecting Google to respect it (it does not)

robots.txt vs llms.txt

robots.txt controls whether bots can crawl and index your pages. It is the established standard, respected by all major search engines and AI crawlers.

llms.txt is an emerging (not yet standardized) Markdown file at yoursite.com/llms.txt that guides AI assistants on how to interpret, summarize, and cite your content. Claude (Anthropic) officially supports it. Google does not yet.

FAQ

Frequently Asked Questions

What is a robots.txt file and why do I need one? ▼

A robots.txt file instructs web crawlers which pages or sections of your site to avoid. Without one, all bots can crawl everything. A well-configured robots.txt protects sensitive admin pages, prevents duplicate content issues from search parameters, and conserves your crawl budget for the pages that matter.

How do I block AI crawlers like GPTBot or ClaudeBot? ▼

Use the AI Crawlers section above. Each bot can be individually allowed, blocked, or left unspecified. Blocking adds a User-agent: GPTBot followed by Disallow: /. OpenAI, Anthropic, and Google all formally respect robots.txt. The "Block Training Only" preset allows search/answer bots while blocking training crawlers.

Is Crawl-delay still relevant in 2026? ▼

For most sites, no. Google has ignored Crawl-delay for years. Bing and Yandex still support it. Modern bots automatically detect server stress via HTTP 429 responses. If you have a genuinely resource-limited server, use Google Search Console to set crawl rate limits directly instead.

What is the robots.txt file size limit? ▼

Google enforces a hard limit of 500 KiB (512,000 bytes). Any content past this point is silently ignored, with no warning or error. This generator shows your file size in real time so you can stay well under the limit. Most sites should never get close to it.

What is llms.txt and should I use it? ▼

llms.txt is a Markdown file at yoursite.com/llms.txt that provides guidance to AI assistants about how to interpret your content. It is not yet a formal web standard and Google does not support it. Anthropic (Claude) officially endorses it. If AI visibility matters for your site, it is worth adding alongside your robots.txt.

How do I set up robots.txt for WordPress? ▼

Select "WordPress" from the Platform/CMS dropdown above. The tool will pre-check the standard WordPress paths: /wp-admin/ (with /wp-admin/admin-ajax.php auto-allowed), /wp-includes/, /wp-json/, /xmlrpc.php, /trackback/, and /feed/. You can toggle each individually and add custom paths.

Does robots.txt actually prevent bots from accessing my pages? ▼

robots.txt is advisory, not a security mechanism. Legitimate crawlers (Google, Bing, major AI companies) respect it. Malicious scrapers and bad actors may not. For genuinely sensitive content, use password protection, server-level access controls, or authentication. Never rely on robots.txt alone to protect private data.

Can I import my existing robots.txt file? ▼

Yes. Use the "Import Existing File" section at the top of the form. Drag and drop your robots.txt or click to browse. The tool validates the file client-side (only .txt files up to 512 KB are accepted, content is checked for valid directives), then parses it and populates all the form fields automatically. Your file is never sent to any server.

Robots.txtGenerator 2026