Documentation

Here’s a drop-in wiki entry you can keep in your repos/Notion. It’s pragmatic, SEO-friendly, and tailored for Drupal/Symfony projects—now with emojis 😄

🗺️ Sitemaps — Complete Guide (Concepts, Usage & Recommendations)

Audience: Devs, PMs, SEO maintainers Applies to: Drupal, Symfony, static sites, APIs, staging & production


1) What a sitemap is (and isn’t) 🧭

  • What it is: An XML file that lists canonical URLs you want crawlers to discover efficiently. It can include lastmod, changefreq, priority, and media/alt-language hints.

  • What it isn’t:

    • A guarantee of indexing (search engines choose what to index).
    • A place for non-canonical, blocked, or 404/301 URLs.
    • A replacement for internal linking and canonical tags.

Rule of thumb: Sitemaps help discovery & freshness. Indexing quality still depends on content, links, canonicals, performance, and UX.


2) Sitemap flavors 🍦

  • URL (web) sitemap: sitemap.xml or sitemap-*.xml
  • Sitemap Index: points to multiple child sitemaps (best for big sites)
  • Image sitemap: adds <image:image> entries (galleries, product photos)
  • Video sitemap: adds <video:video> entries (video hubs, product demos)
  • News sitemap: time-sensitive (for eligible news sites)
  • Alternate/hreflang support: add <xhtml:link rel="alternate" hreflang="…"> to each URL entry

3) Key limits & structure 🚦

  • Max URLs per sitemap: 50,000 (or uncompressed 50MB, whichever comes first)

  • Compression: GZIP allowed (.xml.gz)

  • Multiple sitemaps: Use a sitemap index to organize by section/lang/date

  • Paths: URLs must be absolute and in the same host (or properly authorized across hosts)

  • Signals to include:

    • lastmodhighly recommended (UTC ISO 8601)
    • changefreq → optional (informational only)
    • priority → optional (search engines often ignore)

4) Minimal examples (copy/paste) 🧩

4.1 Basic web sitemap

<?xml version="1.0" encoding="UTF-8"?>
<urlset
  xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2025-10-15T12:30:00+00:00</lastmod>
  </url>
  <url>
    <loc>https://example.com/blog/my-post</loc>
    <lastmod>2025-10-20T09:12:00+00:00</lastmod>
  </url>
</urlset>

4.2 Sitemap index (multiple children)

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemaps/sitemap-pages.xml.gz</loc>
    <lastmod>2025-10-20T10:00:00+00:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemaps/sitemap-blog.xml.gz</loc>
    <lastmod>2025-10-20T10:00:00+00:00</lastmod>
  </sitemap>
</sitemapindex>

4.3 Image entries (inside a normal URL sitemap)

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>https://example.com/product/abc</loc>
    <lastmod>2025-10-21T08:00:00+00:00</lastmod>
    <image:image>
      <image:loc>https://example.com/images/abc-1.jpg</image:loc>
      <image:title>Product ABC front view</image:title>
    </image:image>
    <image:image>
      <image:loc>https://example.com/images/abc-2.jpg</image:loc>
    </image:image>
  </url>
</urlset>

4.4 Video entries

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
  <url>
    <loc>https://example.com/tutorials/how-to-x</loc>
    <video:video>
      <video:thumbnail_loc>https://example.com/thumbnails/how-to-x.jpg</video:thumbnail_loc>
      <video:title>How to X</video:title>
      <video:description>A 2-minute guide to X.</video:description>
      <video:content_loc>https://cdn.example.com/videos/how-to-x.mp4</video:content_loc>
      <video:publication_date>2025-10-15T10:00:00+00:00</video:publication_date>
    </video:video>
  </url>
</urlset>

4.5 Hreflang (multilingual) 🌍

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <url>
    <loc>https://example.com/de/produkt/abc</loc>
    <xhtml:link rel="alternate" hreflang="de" href="https://example.com/de/produkt/abc"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://example.com/en/product/abc"/>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://example.com/en/product/abc"/>
    <lastmod>2025-10-19T11:20:00+00:00</lastmod>
  </url>
</urlset>

5) Generation strategies (Drupal/Symfony) ⚙️

Drupal 🟦

  • Use a maintained sitemap module (e.g., Simple XML Sitemap) to:

    • Automatically include nodes, terms, media with canonical URLs
    • Split large sitemaps into multiple files + index
    • Add images/video where applicable
    • Respect “noindex” / unpublished statuses
  • Add custom inclusion/exclusion rules via path patterns or entity conditions.

  • Regenerate on cron or on relevant content events.

Symfony 🟪

  • Implement a SitemapController that:

    • Streams XML (and GZIP variant)
    • Pulls canonical URLs from your routing/entities
    • Includes lastmod from updated timestamps
    • Splits into multiple files and serves a sitemap index for large catalogs
  • Cache output (HTTP cache / reverse proxy) and invalidate on content changes (Messenger events, Doctrine listeners).


6) What to include (and exclude) ✅❌

Include

  • Canonical URLs that should be indexed
  • Important listing pages (only if canonical and not thin/duplicate)
  • Product/article/detail pages
  • Media pages if they add value (and are indexable)

Exclude

  • Non-canonical variants (with tracking or sort/filter params)
  • Paginated duplicates unless they have unique canonical signals
  • Thin, gated, or noindex content
  • 3xx/4xx/5xx URLs (keep sitemap clean)

7) Freshness & automation 🔄

  • lastmod discipline: update on publish/edit (not on every cache clear)
  • CI/CD: after deploy or import jobs, trigger a sitemap rebuild step
  • Cron: daily (or more frequent) regeneration for active sites
  • Ping: Optionally notify search engines when the index changes (low priority, safe to skip)

8) robots.txt integration 🤝

Always advertise your sitemaps in robots.txt:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemaps/sitemap-index.xml

Even if crawlers find them via GSC, the robots.txt pointer improves discoverability and is useful across engines.


9) QA & monitoring 🕵️

Quick checks

# Return headers & preview
curl -I https://example.com/sitemap.xml
curl -s https://example.com/sitemap.xml | head -n 40

# Validate gzip child
curl -I https://example.com/sitemaps/sitemap-pages.xml.gz

Search Console

  • Submit the sitemap index (not every child)
  • Watch coverage: errors, excluded URLs, and “discovered – currently not indexed”
  • Investigate spikes in indexed vs submitted trends

Health checklist

  • ✅ 200 OK + Content-Type: application/xml (or text/xml)
  • ✅ URLs resolve to 200 and match canonicals
  • ✅ No blocked (robots Disallow) URLs listed
  • ✅ Reasonable file sizes; split when needed
  • ✅ Accurate lastmod (no false churn)

10) Server routing & headers 🧱

Apache

# Serve XML correctly (often automatic)
AddType application/xml .xml

# Optional: Redirect pretty path to generator
Redirect 301 /sitemap.xml /sitemaps/sitemap-index.xml

Nginx

location = /sitemap.xml {
  try_files /sitemaps/sitemap-index.xml =404;
  types { application/xml xml; }
  default_type application/xml;
}

Caching

  • Cache XML for minutes to hours (purge on update).
  • Avoid app-level regeneration on every request.

11) Multisite & multilingual patterns 🏢🌐

  • By site/host: one sitemap index per host, kept separate (e.g., de.example.com vs en.example.com)
  • By type: /sitemaps/sitemap-pages.xml.gz, /sitemaps/sitemap-blog.xml.gz, /sitemaps/sitemap-products-1.xml.gz, etc.
  • By language: /sitemaps/sitemap-de.xml.gz, /sitemaps/sitemap-en.xml.gz (with hreflang links in each URL entry)

12) Playbook (fast start) 🚀

Marketing/blog (Drupal)

  1. Install Simple XML Sitemap (or equivalent).
  2. Enable nodes/terms/media; set lastmod rule.
  3. Exclude search/facet/archive noise.
  4. Generate sitemap index; add Sitemap: lines to robots.txt.
  5. Submit index in Google Search Console.

Catalog app (Symfony)

  1. Create generator service + controller.
  2. Stream index + chunked sitemaps (50k URLs each).
  3. Include lastmod from entities.
  4. Cache & purge on writes.
  5. Wire Sitemap: in robots.txt; submit in GSC.

13) Common pitfalls 🧨

  • Listing non-canonical or blocked URLs (wastes crawl budget)
  • Omitting lastmod → slower freshness detection on large sites
  • Giant single sitemap → exceeds size/URL limits
  • Regenerating too often with fake lastmod churn → noisy, low signal
  • Mixing hosts improperly (cross-host URLs without proper ownership)

14) Maintenance checklist (monthly) 🧰

  • Validate status codes of top URLs (200, not 3xx/4xx/5xx)
  • Compare submitted vs indexed charts in GSC
  • Review exclusions (duplicates, canonicalized) and clean sitemap inputs
  • Check file counts/size; split as needed
  • Ensure robots.txt Sitemap: pointers are correct

15) Quick FAQ ❓

Q: Should I include paginated pages? A: Only if they’re canonical and valuable. Prefer strong listing → detail linking with good internal links.

Q: Do priority and changefreq matter? A: They’re optional hints; engines rely more on content quality, links, and freshness signals.

Q: Can I include noindex URLs? A: Don’t. Sitemaps should list indexable canonical URLs only.


Changelog

  • 2025-11-04 — Initial version; includes Drupal/Symfony recipes, image/video/hreflang examples, and ops checklists.

If you tell me the exact domains and content types (blog, products, docs), I can generate final sitemap index + child files plan, plus ready-to-wire Drupal/Symfony configs and a tiny CI job snippet.

Contents