Here’s a ready-to-drop-in wiki entry for your repos/Notion. It’s practical, opinionated, and tailored to Drupal/Symfony projects.
robots.txt — Complete Guide (Concepts, Usage & Recommendations)
Audience: Developers, PMs, and SEO maintainers Applies to: Drupal, Symfony, static sites, APIs, staging & production
1) What robots.txt is (and isn’t)
-
What it is: A plain-text file at the site root (e.g.,
https://example.com/robots.txt) that tells well-behaved crawlers which URLs they may or shouldn’t crawl. -
What it isn’t:
- A security control. It does not hide private content.
- A de-indexer for already indexed URLs (use
noindexvia meta or headers). - A rate limiter (some engines ignore
Crawl-delay; use server/serverless throttling).
Rule of thumb: Use robots.txt to shape crawl paths; use meta tags/headers and canonicals to shape indexing; use authentication to protect private areas.
2) Syntax quick reference
A robots.txt file is parsed top-to-bottom in groups of:
User-agent: <identifier or *>
Allow: <path>
Disallow: <path>
Sitemap: <absolute-URL-to-sitemap>
2.1 User-agent
User-agent: *applies to all bots.- One or more specific agents can have their own blocks (e.g.,
User-agent: Googlebot).
2.2 Allow / Disallow
-
Paths are relative to site root.
-
Longest match wins when
AllowandDisallowconflict. -
Wildcards:
*matches any sequence$anchors end of URL
-
Examples:
Disallow: /admin/→ blocks any URL starting with/admin/Disallow: /*?*utm_→ blocks URLs containing query params likeutm_...Allow: /public/withDisallow: /→ only/public/allowed
2.3 Sitemaps
-
Use absolute URLs, one per line if multiple:
Sitemap: https://example.com/sitemap.xml Sitemap: https://example.com/sitemap-news.xml
2.4 Crawl-delay (⚠️)
- Not a web standard; Google ignores it. Some engines may honor, many don’t. Prefer server-side rate limits if bandwidth is a concern.
3) Choosing a policy: decision tree
-
Is the site public marketing/docs? → Allow all, block noisy/duplicate paths, provide sitemap(s).
-
Is it a web app (login-gated)? → Disallow everything. Keep authentication as the real barrier.
-
Is it staging/preview? → Disallow everything and enforce HTTP Basic Auth.
-
Is it an API? → Disallow everything; optionally allow
/docs/only. -
Do you want SEO tools’ crawlers (Ahrefs/Semrush/etc.)? → Allow only if you use them; otherwise disallow to save bandwidth.
-
AI/open-corpus crawlers (GPTBot/CCBot/Amazonbot)? → Decide explicitly. Default to opt-out if data use is sensitive.
4) Templates (copy/paste)
Replace
example.com. Keep only the blocks you need.
4.1 Open SEO (marketing/blog)
User-agent: *
Allow: /
# Internal / transactional
Disallow: /admin/
Disallow: /user/
Disallow: /login/
Disallow: /cart/
Disallow: /checkout/
# Duplicates / facets / tracking params
Disallow: /search/
Disallow: /*?*sort=
Disallow: /*?*filter=
Disallow: /*?*utm_
Disallow: /*?*fbclid=
Disallow: /*?*gclid=
# Social preview bots (make sure they can fetch OG/Twitter cards)
User-agent: facebookexternalhit
Allow: /
User-agent: Facebot
Allow: /
User-agent: Twitterbot
Allow: /
User-agent: LinkedInBot
Allow: /
User-agent: WhatsApp
Allow: /
Sitemap: https://example.com/sitemap.xml
4.2 Drupal overlay (append to 4.1 when site is Drupal)
# Drupal internals (non-canonical content and scaffolding)
Disallow: /core/
Disallow: /modules/
Disallow: /profiles/
Disallow: /themes/
Disallow: /vendor/
Disallow: /CHANGELOG.txt
Disallow: /install.php
Disallow: /update.php
Disallow: /filter/tips/
Disallow: /comment/reply/
# When "Clean URLs" fail:
Disallow: /?q=*
4.3 Symfony overlay
# Common debug/profiler endpoints (shouldn't be public)
User-agent: *
Disallow: /_wdt/
Disallow: /_profiler/
Disallow: /*?*debug=
Sitemap: https://example.com/sitemap.xml
4.4 Apps/landing pages shared in chats (lock most, allow previews)
User-agent: *
Disallow: /
# Allow only social link previews to render OG tags
User-agent: facebookexternalhit
Allow: /
User-agent: Facebot
Allow: /
User-agent: Twitterbot
Allow: /
User-agent: LinkedInBot
Allow: /
User-agent: WhatsApp
Allow: /
4.5 Staging / preview
User-agent: *
Disallow: /
Also protect with HTTP Basic Auth to prevent accidental indexing.
4.6 API surface
User-agent: *
Disallow: /
# Optionally allow docs only:
# Allow: /docs/
4.7 AI & open-corpus bots (pick a stance)
Opt-out
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Amazonbot
Disallow: /
Allow a specific one, block others
User-agent: GPTBot
Allow: /
User-agent: CCBot
Disallow: /
User-agent: Amazonbot
Disallow: /
5) Bot groups & default stance (quick matrix)
| Group | Examples (UA) | Default | Notes |
|---|---|---|---|
| Core search | Googlebot, bingbot, DuckDuckBot, Applebot, YandexBot, Baiduspider, PetalBot | Allow | If geo-targeting is narrow, you may selectively allow. |
| Social preview | facebookexternalhit/Facebot, Twitterbot, LinkedInBot, WhatsApp | Allow | Needed for OG/Twitter Card previews. |
| Media | Googlebot-Image/Video, bing image | Allow (if visuals matter) | Disable if bandwidth is tight or media is private. |
| SEO tools | AhrefsBot, SemrushBot, MJ12bot, DotBot | Conditional | Allow if you use them; otherwise disallow to reduce crawl. |
| AI/corpus | GPTBot, CCBot, Amazonbot | Decide explicitly | Default to disallow if data use is sensitive. |
| Scrapers/unknown | (various) | Disallow | Use WAF/rate limits; robots.txt won’t stop bad actors. |
6) Indexing vs Crawling: the common pitfalls
-
Blocking via robots ≠ noindex If a URL is Disallowed, Google may still index it without content if linked elsewhere.
-
To prevent indexing, use:
- Meta tag:
<meta name="robots" content="noindex, nofollow"> - HTTP header:
X-Robots-Tag: noindex, nofollow(useful for non-HTML)
- Meta tag:
-
Only apply
noindexon pages you actually serve; don’t block them inrobots.txtif you need the tag to be seen.
-
-
Canonical URLs still matter Use
<link rel="canonical"...>to consolidate variants (e.g., with/without parameters).
7) Environment strategy (recommended)
- Production: open by default, block only noisy/duplicate/internal paths; expose sitemaps.
- Staging/QA/Preview:
Disallow: /+ HTTP Basic Auth. - Dev (local): no need for
robots.txt; don’t expose publicly.
8) File location and delivery
- Must be reachable at:
https://<host>/robots.txt - If you have multiple hosts (
www, country sites, sub-apps), each host should serve its own file. - For headless/CDN setups, ensure routing rules deliver the plain text file at root.
9) Server snippets (optional but handy)
Apache
# Serve robots.txt as text/plain
<Files "robots.txt">
ForceType text/plain
</Files>
Nginx
location = /robots.txt {
default_type text/plain;
try_files $uri =404;
}
10) Verification & monitoring
Fetch and inspect
curl -i https://example.com/robots.txt
Check top bots in Apache access logs (24h)
# Adjust path to your logs; field positions may vary with your LogFormat
awk -v d="$(date -u +"%d/%b/%Y")" '$4 ~ d { for (i=1;i<=NF;i++) if ($i ~ /bot|Bot|crawler|spider/) print $i }' /var/log/apache2/access.log* \
| sed 's/"//g' | sort | uniq -c | sort -nr | head
Search console tools
- Google Search Console → robots tester & URL inspection
- Bing Webmaster Tools → robots tester
Health checklist
- ✅
robots.txtis reachable (200 OK, text/plain). - ✅
Sitemap:lines use absolute URLs and return 200 OK. - ✅ No accidental
Disallow: /in production. - ✅ Staging/preview hosts are password-protected.
11) Drupal & Symfony specifics
Drupal
- Many core paths don’t need crawl access (see template 4.2).
- Public files under
/sites/default/files/are fine to crawl; rely on canonicals. - Views with exposed filters can generate faceted URLs; block noisy params (sort, filter, page) and use canonicals/pagination.
Symfony
- Ensure
/_wdtand/_profilerare never public. - If you serve from
/public, framework internals aren’t web-exposed; still block debug params and duplicate patterns.
12) Playbook for common scenarios
A) Marketing site with blog and products
- Start from Open SEO template.
- Add
Allowrules for critical media if needed. - Keep
Sitemap:updated post-deploy (CI job or cron). - Optionally allow Ahrefs/Semrush if you use them.
B) Web app/dashboard
Disallow: /- Ensure authenticated routes aren’t publicly linkable.
- If you share marketing pages under same host, move them to a separate host (e.g.,
app.example.comvswww.example.com) with its ownrobots.txt.
C) API
Disallow: /- Optionally
Allow: /docs/only. - Document rate limits on the docs, not via robots.
D) Staging
Disallow: /+ HTTP Basic Auth.- Block by IP if feasible.
E) Opting out of AI datasets
- Add the opt-out block for GPTBot/CCBot/Amazonbot.
- Document rationale in repo (compliance/governance).
13) FAQ
Q: Will Disallow remove pages already in Google?
A: No. Use noindex header/meta and allow crawling until they drop, or remove/redirect.
Q: Can I use robots.txt to hide secrets?
A: No. Use authentication/authorization.
Q: Multiple User-agent groups: which one applies?
A: The most specific user-agent group that matches the crawler applies; within that group, longest path match decides Allow vs Disallow.
14) Maintenance checklist (monthly)
- Confirm
robots.txtreturns 200 and correct content type. - Validate sitemaps (URLs, freshness, status codes).
- Review new routes/filters and add disallows if they create duplicates.
- Audit logs for unexpected bots; add Disallow or server-side mitigations if needed.
- Revisit AI/SEO crawler policy as business needs change.
15) Quick templates for your projects (fill & commit)
Create
robots.txtper host and commit to each repo or serve via web server/CDN.
- ProtonSystems marketing (Drupal/Symfony): 4.1 + 4.2/4.3 + opt-out AI.
- Zutritto & other dashboards: 4.5 (staging) → 4.4 or full Disallow in prod (depending on public landing pages).
- APIs: 4.6 with optional
/docs/allow. - All staging/previews: 4.5 + HTTP Basic Auth.
Changelog (keep this section in your wiki)
2025-11-04— Initial version; includes Drupal/Symfony overlays, AI bot policy options, and operational checklists.
If you want, tell me the exact domains and which bucket each belongs to (marketing/app/api/staging), and I’ll output the final robots.txt files ready to paste, plus tiny Apache/Nginx snippets for each host.