robots.txt and llms.txt — Guide to Crawling & AI Indexing
How robots.txt works, what Allow/Disallow rules mean, and what the new llms.txt is for controlling AI crawlers.
What Is robots.txt?
The robots.txt file sits at the root of your domain (e.g. https://example.com/robots.txt) and tells web crawlers which pages they're allowed to visit and which they're not. It follows the Robots Exclusion Protocol (REP).
Important: robots.txt is a courtesy, not prevention — a malicious crawler can ignore it. For real protection you need authentication, a firewall or password protection.
robots.txt Structure
# Example robots.txt User-agent: * Disallow: /admin/ Disallow: /private/ Allow: /public/ User-agent: Googlebot Disallow: /no-google/ Sitemap: https://example.com/sitemap.xml
Core Directives
| Directive | What it does |
|---|---|
User-agent | Specifies which crawler the rule applies to (* = all) |
Disallow | Blocks access to the path and its subpaths |
Allow | Permits access — used for exceptions within a Disallow block |
Sitemap | Points to the XML sitemap location — helps crawlers discover new content faster |
Crawl-delay | Delay (seconds) between requests — supported by Bing, not by Google |
Allow vs Disallow — Priority
When a path matches both Allow and Disallow, the more specific rule wins (longer path length):
Disallow: /images/ Allow: /images/public/ # Result: # /images/private.jpg → BLOCKED # /images/public/logo.png → ALLOWED
Common User-Agents
| User-agent | Crawler |
|---|---|
Googlebot | Google web crawler |
Bingbot | Microsoft Bing |
GPTBot | OpenAI training crawler |
Claude-Web | Anthropic web crawler |
Applebot | Apple Siri / Spotlight |
* | All crawlers |
Check your domain's robots.txt and llms.txt instantly:
→ Robots & LLMs CheckerWhat Is llms.txt?
The llms.txt is a new (2024) — still unofficial — standard aimed at making a site's content understandable to AI systems (LLMs). It lives at /llms.txt and contains a structured Markdown summary of the site.
Separately, many site owners use robots.txt to control which AI crawlers can access their content:
# Block AI training crawlers User-agent: GPTBot Disallow: / User-agent: Claude-Web Disallow: / User-agent: CCBot Disallow: / # Still allow Google to index normally User-agent: Googlebot Allow: /
Common robots.txt Mistakes
Blocking Googlebot from CSS/JS
Blocking Googlebot from loading CSS or JavaScript means it can't properly «see» your pages — this impacts ranking. Never do:
# WRONG — blocks CSS/JS from Googlebot User-agent: Googlebot Disallow: /wp-content/ Disallow: /assets/
Accidentally Blocking the Entire Site
# WRONG — blocks everything User-agent: * Disallow: / # OK during construction, BUT: # Make sure to remove this when you go live!
Keep in Mind
Disallow: /pagedoes NOT prevent indexing if backlinks to the page exist- Use a
noindexmeta tag or HTTP header for reliable index exclusion - robots.txt is cached by crawlers — changes may take time to take effect
Frequently Asked Questions
<meta name="robots" content="noindex">.