Here’s a ready-to-drop-in wiki entry for your repos/Notion. It’s practical, opinionated, and tailored to Drupal/Symfony projects.

robots.txt — Complete Guide (Concepts, Usage & Recommendations)

Audience: Developers, PMs, and SEO maintainers Applies to: Drupal, Symfony, static sites, APIs, staging & production

1) What `robots.txt` is (and isn’t)

What it is: A plain-text file at the site root (e.g., https://example.com/robots.txt) that tells well-behaved crawlers which URLs they may or shouldn’t crawl.
What it isn’t:
- A security control. It does not hide private content.
- A de-indexer for already indexed URLs (use noindex via meta or headers).
- A rate limiter (some engines ignore Crawl-delay; use server/serverless throttling).

Rule of thumb: Use robots.txt to shape crawl paths; use meta tags/headers and canonicals to shape indexing; use authentication to protect private areas.

2) Syntax quick reference

A robots.txt file is parsed top-to-bottom in groups of:

User-agent: <identifier or *>
Allow: <path>
Disallow: <path>
Sitemap: <absolute-URL-to-sitemap>

2.1 User-agent

User-agent: * applies to all bots.
One or more specific agents can have their own blocks (e.g., User-agent: Googlebot).

2.2 Allow / Disallow

Paths are relative to site root.
Longest match wins when Allow and Disallow conflict.
Wildcards:
- * matches any sequence
- $ anchors end of URL
Examples:
- Disallow: /admin/ → blocks any URL starting with /admin/
- Disallow: /*?*utm_ → blocks URLs containing query params like utm_...
- Allow: /public/ with Disallow: / → only /public/ allowed

2.3 Sitemaps

Use absolute URLs, one per line if multiple:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

2.4 Crawl-delay (⚠️)

Not a web standard; Google ignores it. Some engines may honor, many don’t. Prefer server-side rate limits if bandwidth is a concern.

3) Choosing a policy: decision tree

Is the site public marketing/docs? → Allow all, block noisy/duplicate paths, provide sitemap(s).
Is it a web app (login-gated)? → Disallow everything. Keep authentication as the real barrier.
Is it staging/preview? → Disallow everything and enforce HTTP Basic Auth.
Is it an API? → Disallow everything; optionally allow /docs/ only.
Do you want SEO tools’ crawlers (Ahrefs/Semrush/etc.)? → Allow only if you use them; otherwise disallow to save bandwidth.
AI/open-corpus crawlers (GPTBot/CCBot/Amazonbot)? → Decide explicitly. Default to opt-out if data use is sensitive.

4) Templates (copy/paste)

Replace example.com. Keep only the blocks you need.

4.1 Open SEO (marketing/blog)

User-agent: *
Allow: /

# Internal / transactional
Disallow: /admin/
Disallow: /user/
Disallow: /login/
Disallow: /cart/
Disallow: /checkout/

# Duplicates / facets / tracking params
Disallow: /search/
Disallow: /*?*sort=
Disallow: /*?*filter=
Disallow: /*?*utm_
Disallow: /*?*fbclid=
Disallow: /*?*gclid=

# Social preview bots (make sure they can fetch OG/Twitter cards)
User-agent: facebookexternalhit
Allow: /
User-agent: Facebot
Allow: /
User-agent: Twitterbot
Allow: /
User-agent: LinkedInBot
Allow: /
User-agent: WhatsApp
Allow: /

Sitemap: https://example.com/sitemap.xml

4.2 Drupal overlay (append to 4.1 when site is Drupal)

# Drupal internals (non-canonical content and scaffolding)
Disallow: /core/
Disallow: /modules/
Disallow: /profiles/
Disallow: /themes/
Disallow: /vendor/
Disallow: /CHANGELOG.txt
Disallow: /install.php
Disallow: /update.php
Disallow: /filter/tips/
Disallow: /comment/reply/

# When "Clean URLs" fail:
Disallow: /?q=*

4.3 Symfony overlay

# Common debug/profiler endpoints (shouldn't be public)
User-agent: *
Disallow: /_wdt/
Disallow: /_profiler/
Disallow: /*?*debug=

Sitemap: https://example.com/sitemap.xml

4.4 Apps/landing pages shared in chats (lock most, allow previews)

User-agent: *
Disallow: /

# Allow only social link previews to render OG tags
User-agent: facebookexternalhit
Allow: /
User-agent: Facebot
Allow: /
User-agent: Twitterbot
Allow: /
User-agent: LinkedInBot
Allow: /
User-agent: WhatsApp
Allow: /

4.5 Staging / preview

User-agent: *
Disallow: /

Also protect with HTTP Basic Auth to prevent accidental indexing.

4.6 API surface

User-agent: *
Disallow: /

# Optionally allow docs only:
# Allow: /docs/

4.7 AI & open-corpus bots (pick a stance)

Opt-out

User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Amazonbot
Disallow: /

Allow a specific one, block others

User-agent: GPTBot
Allow: /
User-agent: CCBot
Disallow: /
User-agent: Amazonbot
Disallow: /

5) Bot groups & default stance (quick matrix)

Group	Examples (UA)	Default	Notes
Core search	Googlebot, bingbot, DuckDuckBot, Applebot, YandexBot, Baiduspider, PetalBot	Allow	If geo-targeting is narrow, you may selectively allow.
Social preview	facebookexternalhit/Facebot, Twitterbot, LinkedInBot, WhatsApp	Allow	Needed for OG/Twitter Card previews.
Media	Googlebot-Image/Video, bing image	Allow (if visuals matter)	Disable if bandwidth is tight or media is private.
SEO tools	AhrefsBot, SemrushBot, MJ12bot, DotBot	Conditional	Allow if you use them; otherwise disallow to reduce crawl.
AI/corpus	GPTBot, CCBot, Amazonbot	Decide explicitly	Default to disallow if data use is sensitive.
Scrapers/unknown	(various)	Disallow	Use WAF/rate limits; `robots.txt` won’t stop bad actors.

6) Indexing vs Crawling: the common pitfalls

Blocking via robots ≠ noindex If a URL is Disallowed, Google may still index it without content if linked elsewhere.
- To prevent indexing, use:
  - Meta tag: <meta name="robots" content="noindex, nofollow">
  - HTTP header: X-Robots-Tag: noindex, nofollow (useful for non-HTML)
- Only apply noindex on pages you actually serve; don’t block them in robots.txt if you need the tag to be seen.
Canonical URLs still matter Use <link rel="canonical"...> to consolidate variants (e.g., with/without parameters).

7) Environment strategy (recommended)

Production: open by default, block only noisy/duplicate/internal paths; expose sitemaps.
Staging/QA/Preview: Disallow: / + HTTP Basic Auth.
Dev (local): no need for robots.txt; don’t expose publicly.

8) File location and delivery

Must be reachable at: https://<host>/robots.txt
If you have multiple hosts (www, country sites, sub-apps), each host should serve its own file.
For headless/CDN setups, ensure routing rules deliver the plain text file at root.

9) Server snippets (optional but handy)

Apache

# Serve robots.txt as text/plain
<Files "robots.txt">
  ForceType text/plain
</Files>

Nginx

location = /robots.txt {
  default_type text/plain;
  try_files $uri =404;
}

10) Verification & monitoring

Fetch and inspect

curl -i https://example.com/robots.txt

Check top bots in Apache access logs (24h)

# Adjust path to your logs; field positions may vary with your LogFormat
awk -v d="$(date -u +"%d/%b/%Y")" '$4 ~ d { for (i=1;i<=NF;i++) if ($i ~ /bot|Bot|crawler|spider/) print $i }' /var/log/apache2/access.log* \
  | sed 's/"//g' | sort | uniq -c | sort -nr | head

Search console tools

Google Search Console → robots tester & URL inspection
Bing Webmaster Tools → robots tester

Health checklist

✅ robots.txt is reachable (200 OK, text/plain).
✅ Sitemap: lines use absolute URLs and return 200 OK.
✅ No accidental Disallow: / in production.
✅ Staging/preview hosts are password-protected.

11) Drupal & Symfony specifics

Drupal

Many core paths don’t need crawl access (see template 4.2).
Public files under /sites/default/files/ are fine to crawl; rely on canonicals.
Views with exposed filters can generate faceted URLs; block noisy params (sort, filter, page) and use canonicals/pagination.

Symfony

Ensure /_wdt and /_profiler are never public.
If you serve from /public, framework internals aren’t web-exposed; still block debug params and duplicate patterns.

12) Playbook for common scenarios

A) Marketing site with blog and products

Start from Open SEO template.
Add Allow rules for critical media if needed.
Keep Sitemap: updated post-deploy (CI job or cron).
Optionally allow Ahrefs/Semrush if you use them.

B) Web app/dashboard

Disallow: /
Ensure authenticated routes aren’t publicly linkable.
If you share marketing pages under same host, move them to a separate host (e.g., app.example.com vs www.example.com) with its own robots.txt.

C) API

Disallow: /
Optionally Allow: /docs/ only.
Document rate limits on the docs, not via robots.

D) Staging

Disallow: / + HTTP Basic Auth.
Block by IP if feasible.

E) Opting out of AI datasets

Add the opt-out block for GPTBot/CCBot/Amazonbot.
Document rationale in repo (compliance/governance).

13) FAQ

Q: Will Disallow remove pages already in Google? A: No. Use noindex header/meta and allow crawling until they drop, or remove/redirect.

Q: Can I use robots.txt to hide secrets? A: No. Use authentication/authorization.

Q: Multiple User-agent groups: which one applies? A: The most specific user-agent group that matches the crawler applies; within that group, longest path match decides Allow vs Disallow.

14) Maintenance checklist (monthly)

Confirm robots.txt returns 200 and correct content type.
Validate sitemaps (URLs, freshness, status codes).
Review new routes/filters and add disallows if they create duplicates.
Audit logs for unexpected bots; add Disallow or server-side mitigations if needed.
Revisit AI/SEO crawler policy as business needs change.

15) Quick templates for your projects (fill & commit)

Create robots.txt per host and commit to each repo or serve via web server/CDN.

ProtonSystems marketing (Drupal/Symfony): 4.1 + 4.2/4.3 + opt-out AI.
Zutritto & other dashboards: 4.5 (staging) → 4.4 or full Disallow in prod (depending on public landing pages).
APIs: 4.6 with optional /docs/ allow.
All staging/previews: 4.5 + HTTP Basic Auth.

Changelog (keep this section in your wiki)

2025-11-04 — Initial version; includes Drupal/Symfony overlays, AI bot policy options, and operational checklists.

If you want, tell me the exact domains and which bucket each belongs to (marketing/app/api/staging), and I’ll output the final robots.txt files ready to paste, plus tiny Apache/Nginx snippets for each host.

Documentation

robots.txt — Complete Guide (Concepts, Usage & Recommendations)

1) What `robots.txt` is (and isn’t)

2) Syntax quick reference

2.1 User-agent

2.2 Allow / Disallow

2.3 Sitemaps

2.4 Crawl-delay (⚠️)

3) Choosing a policy: decision tree

4) Templates (copy/paste)

4.1 Open SEO (marketing/blog)

4.2 Drupal overlay (append to 4.1 when site is Drupal)

4.3 Symfony overlay

4.4 Apps/landing pages shared in chats (lock most, allow previews)

4.5 Staging / preview

4.6 API surface

4.7 AI & open-corpus bots (pick a stance)

5) Bot groups & default stance (quick matrix)

6) Indexing vs Crawling: the common pitfalls

7) Environment strategy (recommended)

8) File location and delivery

9) Server snippets (optional but handy)

10) Verification & monitoring

11) Drupal & Symfony specifics

12) Playbook for common scenarios

13) FAQ

14) Maintenance checklist (monthly)

15) Quick templates for your projects (fill & commit)

Changelog (keep this section in your wiki)

Contents

Documentation

robots.txt — Complete Guide (Concepts, Usage & Recommendations)

1) What robots.txt is (and isn’t)

2) Syntax quick reference

2.1 User-agent

2.2 Allow / Disallow

2.3 Sitemaps

2.4 Crawl-delay (⚠️)

3) Choosing a policy: decision tree

4) Templates (copy/paste)

4.1 Open SEO (marketing/blog)

4.2 Drupal overlay (append to 4.1 when site is Drupal)

4.3 Symfony overlay

4.4 Apps/landing pages shared in chats (lock most, allow previews)

4.5 Staging / preview

4.6 API surface

4.7 AI & open-corpus bots (pick a stance)

5) Bot groups & default stance (quick matrix)

6) Indexing vs Crawling: the common pitfalls

7) Environment strategy (recommended)

8) File location and delivery

9) Server snippets (optional but handy)

10) Verification & monitoring

11) Drupal & Symfony specifics

12) Playbook for common scenarios

13) FAQ

14) Maintenance checklist (monthly)

15) Quick templates for your projects (fill & commit)

Changelog (keep this section in your wiki)

Contents

1) What `robots.txt` is (and isn’t)