robots.txt is a file at the root of your domain that tells web crawlers which pages they can visit. It follows the Robots Exclusion Protocol — it's a courtesy, not a technical barrier.

How do I block AI crawlers in robots.txt?

Add User-agent: GPTBot / Disallow: / for OpenAI, User-agent: Claude-Web / Disallow: / for Anthropic, User-agent: CCBot / Disallow: / for Common Crawl. Check the result at https://subs.gr/tools/robots.

llms.txt is an unofficial standard (2024) for making a site's content understandable to AI systems. It lives at /llms.txt in Markdown format and contains a structured summary of the site and its pages.

SSL & Security 5 min read

robots.txt and llms.txt — Guide to Crawling & AI Indexing

How robots.txt works, what Allow/Disallow rules mean, and what the new llms.txt is for controlling AI crawlers.

What Is robots.txt?

The robots.txt file sits at the root of your domain (e.g. https://example.com/robots.txt) and tells web crawlers which pages they're allowed to visit and which they're not. It follows the Robots Exclusion Protocol (REP).

Important: robots.txt is a courtesy, not prevention — a malicious crawler can ignore it. For real protection you need authentication, a firewall or password protection.

robots.txt Structure

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

User-agent: Googlebot
Disallow: /no-google/

Sitemap: https://example.com/sitemap.xml

Core Directives

Directive	What it does
`User-agent`	Specifies which crawler the rule applies to (`*` = all)
`Disallow`	Blocks access to the path and its subpaths
`Allow`	Permits access — used for exceptions within a Disallow block
`Sitemap`	Points to the XML sitemap location — helps crawlers discover new content faster
`Crawl-delay`	Delay (seconds) between requests — supported by Bing, not by Google

Allow vs Disallow — Priority

When a path matches both Allow and Disallow, the more specific rule wins (longer path length):

Disallow: /images/
Allow: /images/public/

# Result:
# /images/private.jpg → BLOCKED
# /images/public/logo.png → ALLOWED

Common User-Agents

User-agent	Crawler
`Googlebot`	Google web crawler
`Bingbot`	Microsoft Bing
`GPTBot`	OpenAI training crawler
`Claude-Web`	Anthropic web crawler
`Applebot`	Apple Siri / Spotlight
`*`	All crawlers

Check your domain's robots.txt and llms.txt instantly:

→ Robots & LLMs Checker

What Is llms.txt?

The llms.txt is a new (2024) — still unofficial — standard aimed at making a site's content understandable to AI systems (LLMs). It lives at /llms.txt and contains a structured Markdown summary of the site.

Separately, many site owners use robots.txt to control which AI crawlers can access their content:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: CCBot
Disallow: /

# Still allow Google to index normally
User-agent: Googlebot
Allow: /

Common robots.txt Mistakes

Blocking Googlebot from CSS/JS

Blocking Googlebot from loading CSS or JavaScript means it can't properly «see» your pages — this impacts ranking. Never do:

# WRONG — blocks CSS/JS from Googlebot
User-agent: Googlebot
Disallow: /wp-content/
Disallow: /assets/

Accidentally Blocking the Entire Site

# WRONG — blocks everything
User-agent: *
Disallow: /

# OK during construction, BUT:
# Make sure to remove this when you go live!

Keep in Mind

Disallow: /page does NOT prevent indexing if backlinks to the page exist
Use a noindex meta tag or HTTP header for reliable index exclusion
robots.txt is cached by crawlers — changes may take time to take effect

Frequently Asked Questions

If I Disallow a page, won't Google see it?

It won't crawl it, but it may still index it if it finds links pointing to it from other pages — just without knowing the content. For reliable index exclusion, use <meta name="robots" content="noindex">.

Is robots.txt mandatory?

No — if it doesn't exist, crawlers assume they can access everything. But it's recommended to always have one, even if it only contains a Sitemap directive to help crawlers discover your content.

Should I block AI crawlers?

It depends. If you want to protect your content from being used for AI training, block GPTBot, CCBot, Claude-Web etc. If you want to appear in AI search results (ChatGPT Search, Perplexity), let them crawl you.

What is Crawl-delay and should I use it?

Crawl-delay asks crawlers to wait X seconds between requests to reduce server load. Google ignores it — use Google Search Console to adjust Googlebot's crawl rate instead. Bing respects it.

Try it now

Robots & LLMs Checker robots.txt & llms.txt analysis

Related guides

All guides