robots.txt and llms.txt — Guide to Crawling & AI Indexing

How robots.txt works, what Allow/Disallow rules mean, and what the new llms.txt is for controlling AI crawlers.

What Is robots.txt?

The robots.txt file sits at the root of your domain (e.g. https://example.com/robots.txt) and tells web crawlers which pages they're allowed to visit and which they're not. It follows the Robots Exclusion Protocol (REP).

Important: robots.txt is a courtesy, not prevention — a malicious crawler can ignore it. For real protection you need authentication, a firewall or password protection.

robots.txt Structure

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

User-agent: Googlebot
Disallow: /no-google/

Sitemap: https://example.com/sitemap.xml

Core Directives

DirectiveWhat it does
User-agentSpecifies which crawler the rule applies to (* = all)
DisallowBlocks access to the path and its subpaths
AllowPermits access — used for exceptions within a Disallow block
SitemapPoints to the XML sitemap location — helps crawlers discover new content faster
Crawl-delayDelay (seconds) between requests — supported by Bing, not by Google

Allow vs Disallow — Priority

When a path matches both Allow and Disallow, the more specific rule wins (longer path length):

Disallow: /images/
Allow: /images/public/

# Result:
# /images/private.jpg → BLOCKED
# /images/public/logo.png → ALLOWED

Common User-Agents

User-agentCrawler
GooglebotGoogle web crawler
BingbotMicrosoft Bing
GPTBotOpenAI training crawler
Claude-WebAnthropic web crawler
ApplebotApple Siri / Spotlight
*All crawlers

Check your domain's robots.txt and llms.txt instantly:

→ Robots & LLMs Checker

What Is llms.txt?

The llms.txt is a new (2024) — still unofficial — standard aimed at making a site's content understandable to AI systems (LLMs). It lives at /llms.txt and contains a structured Markdown summary of the site.

Separately, many site owners use robots.txt to control which AI crawlers can access their content:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: CCBot
Disallow: /

# Still allow Google to index normally
User-agent: Googlebot
Allow: /

Common robots.txt Mistakes

Blocking Googlebot from CSS/JS

Blocking Googlebot from loading CSS or JavaScript means it can't properly «see» your pages — this impacts ranking. Never do:

# WRONG — blocks CSS/JS from Googlebot
User-agent: Googlebot
Disallow: /wp-content/
Disallow: /assets/

Accidentally Blocking the Entire Site

# WRONG — blocks everything
User-agent: *
Disallow: /

# OK during construction, BUT:
# Make sure to remove this when you go live!

Keep in Mind

  • Disallow: /page does NOT prevent indexing if backlinks to the page exist
  • Use a noindex meta tag or HTTP header for reliable index exclusion
  • robots.txt is cached by crawlers — changes may take time to take effect

Frequently Asked Questions

If I Disallow a page, won't Google see it?
It won't crawl it, but it may still index it if it finds links pointing to it from other pages — just without knowing the content. For reliable index exclusion, use <meta name="robots" content="noindex">.
Is robots.txt mandatory?
No — if it doesn't exist, crawlers assume they can access everything. But it's recommended to always have one, even if it only contains a Sitemap directive to help crawlers discover your content.
Should I block AI crawlers?
It depends. If you want to protect your content from being used for AI training, block GPTBot, CCBot, Claude-Web etc. If you want to appear in AI search results (ChatGPT Search, Perplexity), let them crawl you.
What is Crawl-delay and should I use it?
Crawl-delay asks crawlers to wait X seconds between requests to reduce server load. Google ignores it — use Google Search Console to adjust Googlebot's crawl rate instead. Bing respects it.

Try it now

Related guides