Documentation

Here’s a ready-to-drop-in wiki entry for your repos/Notion. It’s practical, opinionated, and tailored to Drupal/Symfony projects.

robots.txt — Complete Guide (Concepts, Usage & Recommendations)

Audience: Developers, PMs, and SEO maintainers Applies to: Drupal, Symfony, static sites, APIs, staging & production


1) What robots.txt is (and isn’t)

  • What it is: A plain-text file at the site root (e.g., https://example.com/robots.txt) that tells well-behaved crawlers which URLs they may or shouldn’t crawl.

  • What it isn’t:

    • A security control. It does not hide private content.
    • A de-indexer for already indexed URLs (use noindex via meta or headers).
    • A rate limiter (some engines ignore Crawl-delay; use server/serverless throttling).

Rule of thumb: Use robots.txt to shape crawl paths; use meta tags/headers and canonicals to shape indexing; use authentication to protect private areas.


2) Syntax quick reference

A robots.txt file is parsed top-to-bottom in groups of:

User-agent: <identifier or *>
Allow: <path>
Disallow: <path>
Sitemap: <absolute-URL-to-sitemap>

2.1 User-agent

  • User-agent: * applies to all bots.
  • One or more specific agents can have their own blocks (e.g., User-agent: Googlebot).

2.2 Allow / Disallow

  • Paths are relative to site root.

  • Longest match wins when Allow and Disallow conflict.

  • Wildcards:

    • * matches any sequence
    • $ anchors end of URL
  • Examples:

    • Disallow: /admin/ → blocks any URL starting with /admin/
    • Disallow: /*?*utm_ → blocks URLs containing query params like utm_...
    • Allow: /public/ with Disallow: / → only /public/ allowed

2.3 Sitemaps

  • Use absolute URLs, one per line if multiple:

    Sitemap: https://example.com/sitemap.xml
    Sitemap: https://example.com/sitemap-news.xml
    

2.4 Crawl-delay (⚠️)

  • Not a web standard; Google ignores it. Some engines may honor, many don’t. Prefer server-side rate limits if bandwidth is a concern.

3) Choosing a policy: decision tree

  1. Is the site public marketing/docs? → Allow all, block noisy/duplicate paths, provide sitemap(s).

  2. Is it a web app (login-gated)? → Disallow everything. Keep authentication as the real barrier.

  3. Is it staging/preview? → Disallow everything and enforce HTTP Basic Auth.

  4. Is it an API? → Disallow everything; optionally allow /docs/ only.

  5. Do you want SEO tools’ crawlers (Ahrefs/Semrush/etc.)? → Allow only if you use them; otherwise disallow to save bandwidth.

  6. AI/open-corpus crawlers (GPTBot/CCBot/Amazonbot)? → Decide explicitly. Default to opt-out if data use is sensitive.


4) Templates (copy/paste)

Replace example.com. Keep only the blocks you need.

4.1 Open SEO (marketing/blog)

User-agent: *
Allow: /

# Internal / transactional
Disallow: /admin/
Disallow: /user/
Disallow: /login/
Disallow: /cart/
Disallow: /checkout/

# Duplicates / facets / tracking params
Disallow: /search/
Disallow: /*?*sort=
Disallow: /*?*filter=
Disallow: /*?*utm_
Disallow: /*?*fbclid=
Disallow: /*?*gclid=

# Social preview bots (make sure they can fetch OG/Twitter cards)
User-agent: facebookexternalhit
Allow: /
User-agent: Facebot
Allow: /
User-agent: Twitterbot
Allow: /
User-agent: LinkedInBot
Allow: /
User-agent: WhatsApp
Allow: /

Sitemap: https://example.com/sitemap.xml

4.2 Drupal overlay (append to 4.1 when site is Drupal)

# Drupal internals (non-canonical content and scaffolding)
Disallow: /core/
Disallow: /modules/
Disallow: /profiles/
Disallow: /themes/
Disallow: /vendor/
Disallow: /CHANGELOG.txt
Disallow: /install.php
Disallow: /update.php
Disallow: /filter/tips/
Disallow: /comment/reply/

# When "Clean URLs" fail:
Disallow: /?q=*

4.3 Symfony overlay

# Common debug/profiler endpoints (shouldn't be public)
User-agent: *
Disallow: /_wdt/
Disallow: /_profiler/
Disallow: /*?*debug=

Sitemap: https://example.com/sitemap.xml

4.4 Apps/landing pages shared in chats (lock most, allow previews)

User-agent: *
Disallow: /

# Allow only social link previews to render OG tags
User-agent: facebookexternalhit
Allow: /
User-agent: Facebot
Allow: /
User-agent: Twitterbot
Allow: /
User-agent: LinkedInBot
Allow: /
User-agent: WhatsApp
Allow: /

4.5 Staging / preview

User-agent: *
Disallow: /

Also protect with HTTP Basic Auth to prevent accidental indexing.

4.6 API surface

User-agent: *
Disallow: /

# Optionally allow docs only:
# Allow: /docs/

4.7 AI & open-corpus bots (pick a stance)

Opt-out

User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Amazonbot
Disallow: /

Allow a specific one, block others

User-agent: GPTBot
Allow: /
User-agent: CCBot
Disallow: /
User-agent: Amazonbot
Disallow: /

5) Bot groups & default stance (quick matrix)

Group Examples (UA) Default Notes
Core search Googlebot, bingbot, DuckDuckBot, Applebot, YandexBot, Baiduspider, PetalBot Allow If geo-targeting is narrow, you may selectively allow.
Social preview facebookexternalhit/Facebot, Twitterbot, LinkedInBot, WhatsApp Allow Needed for OG/Twitter Card previews.
Media Googlebot-Image/Video, bing image Allow (if visuals matter) Disable if bandwidth is tight or media is private.
SEO tools AhrefsBot, SemrushBot, MJ12bot, DotBot Conditional Allow if you use them; otherwise disallow to reduce crawl.
AI/corpus GPTBot, CCBot, Amazonbot Decide explicitly Default to disallow if data use is sensitive.
Scrapers/unknown (various) Disallow Use WAF/rate limits; robots.txt won’t stop bad actors.

6) Indexing vs Crawling: the common pitfalls

  • Blocking via robots ≠ noindex If a URL is Disallowed, Google may still index it without content if linked elsewhere.

    • To prevent indexing, use:

      • Meta tag: <meta name="robots" content="noindex, nofollow">
      • HTTP header: X-Robots-Tag: noindex, nofollow (useful for non-HTML)
    • Only apply noindex on pages you actually serve; don’t block them in robots.txt if you need the tag to be seen.

  • Canonical URLs still matter Use <link rel="canonical"...> to consolidate variants (e.g., with/without parameters).


7) Environment strategy (recommended)

  • Production: open by default, block only noisy/duplicate/internal paths; expose sitemaps.
  • Staging/QA/Preview: Disallow: / + HTTP Basic Auth.
  • Dev (local): no need for robots.txt; don’t expose publicly.

8) File location and delivery

  • Must be reachable at: https://<host>/robots.txt
  • If you have multiple hosts (www, country sites, sub-apps), each host should serve its own file.
  • For headless/CDN setups, ensure routing rules deliver the plain text file at root.

9) Server snippets (optional but handy)

Apache

# Serve robots.txt as text/plain
<Files "robots.txt">
  ForceType text/plain
</Files>

Nginx

location = /robots.txt {
  default_type text/plain;
  try_files $uri =404;
}

10) Verification & monitoring

Fetch and inspect

curl -i https://example.com/robots.txt

Check top bots in Apache access logs (24h)

# Adjust path to your logs; field positions may vary with your LogFormat
awk -v d="$(date -u +"%d/%b/%Y")" '$4 ~ d { for (i=1;i<=NF;i++) if ($i ~ /bot|Bot|crawler|spider/) print $i }' /var/log/apache2/access.log* \
  | sed 's/"//g' | sort | uniq -c | sort -nr | head

Search console tools

  • Google Search Console → robots tester & URL inspection
  • Bing Webmaster Tools → robots tester

Health checklist

  • robots.txt is reachable (200 OK, text/plain).
  • Sitemap: lines use absolute URLs and return 200 OK.
  • ✅ No accidental Disallow: / in production.
  • ✅ Staging/preview hosts are password-protected.

11) Drupal & Symfony specifics

Drupal

  • Many core paths don’t need crawl access (see template 4.2).
  • Public files under /sites/default/files/ are fine to crawl; rely on canonicals.
  • Views with exposed filters can generate faceted URLs; block noisy params (sort, filter, page) and use canonicals/pagination.

Symfony

  • Ensure /_wdt and /_profiler are never public.
  • If you serve from /public, framework internals aren’t web-exposed; still block debug params and duplicate patterns.

12) Playbook for common scenarios

A) Marketing site with blog and products

  • Start from Open SEO template.
  • Add Allow rules for critical media if needed.
  • Keep Sitemap: updated post-deploy (CI job or cron).
  • Optionally allow Ahrefs/Semrush if you use them.

B) Web app/dashboard

  • Disallow: /
  • Ensure authenticated routes aren’t publicly linkable.
  • If you share marketing pages under same host, move them to a separate host (e.g., app.example.com vs www.example.com) with its own robots.txt.

C) API

  • Disallow: /
  • Optionally Allow: /docs/ only.
  • Document rate limits on the docs, not via robots.

D) Staging

  • Disallow: / + HTTP Basic Auth.
  • Block by IP if feasible.

E) Opting out of AI datasets

  • Add the opt-out block for GPTBot/CCBot/Amazonbot.
  • Document rationale in repo (compliance/governance).

13) FAQ

Q: Will Disallow remove pages already in Google? A: No. Use noindex header/meta and allow crawling until they drop, or remove/redirect.

Q: Can I use robots.txt to hide secrets? A: No. Use authentication/authorization.

Q: Multiple User-agent groups: which one applies? A: The most specific user-agent group that matches the crawler applies; within that group, longest path match decides Allow vs Disallow.


14) Maintenance checklist (monthly)

  • Confirm robots.txt returns 200 and correct content type.
  • Validate sitemaps (URLs, freshness, status codes).
  • Review new routes/filters and add disallows if they create duplicates.
  • Audit logs for unexpected bots; add Disallow or server-side mitigations if needed.
  • Revisit AI/SEO crawler policy as business needs change.

15) Quick templates for your projects (fill & commit)

Create robots.txt per host and commit to each repo or serve via web server/CDN.

  • ProtonSystems marketing (Drupal/Symfony): 4.1 + 4.2/4.3 + opt-out AI.
  • Zutritto & other dashboards: 4.5 (staging) → 4.4 or full Disallow in prod (depending on public landing pages).
  • APIs: 4.6 with optional /docs/ allow.
  • All staging/previews: 4.5 + HTTP Basic Auth.

Changelog (keep this section in your wiki)

  • 2025-11-04 — Initial version; includes Drupal/Symfony overlays, AI bot policy options, and operational checklists.

If you want, tell me the exact domains and which bucket each belongs to (marketing/app/api/staging), and I’ll output the final robots.txt files ready to paste, plus tiny Apache/Nginx snippets for each host.

Contents