Here’s a drop-in wiki entry you can keep in your repos/Notion. It’s pragmatic, SEO-friendly, and tailored for Drupal/Symfony projects—now with emojis 😄
🗺️ Sitemaps — Complete Guide (Concepts, Usage & Recommendations)
Audience: Devs, PMs, SEO maintainers Applies to: Drupal, Symfony, static sites, APIs, staging & production
1) What a sitemap is (and isn’t) 🧭
-
What it is: An XML file that lists canonical URLs you want crawlers to discover efficiently. It can include
lastmod,changefreq,priority, and media/alt-language hints. -
What it isn’t:
- A guarantee of indexing (search engines choose what to index).
- A place for non-canonical, blocked, or 404/301 URLs.
- A replacement for internal linking and canonical tags.
Rule of thumb: Sitemaps help discovery & freshness. Indexing quality still depends on content, links, canonicals, performance, and UX.
2) Sitemap flavors 🍦
- URL (web) sitemap:
sitemap.xmlorsitemap-*.xml - Sitemap Index: points to multiple child sitemaps (best for big sites)
- Image sitemap: adds
<image:image>entries (galleries, product photos) - Video sitemap: adds
<video:video>entries (video hubs, product demos) - News sitemap: time-sensitive (for eligible news sites)
- Alternate/hreflang support: add
<xhtml:link rel="alternate" hreflang="…">to each URL entry
3) Key limits & structure 🚦
-
Max URLs per sitemap: 50,000 (or uncompressed 50MB, whichever comes first)
-
Compression: GZIP allowed (
.xml.gz) -
Multiple sitemaps: Use a sitemap index to organize by section/lang/date
-
Paths: URLs must be absolute and in the same host (or properly authorized across hosts)
-
Signals to include:
lastmod→ highly recommended (UTC ISO 8601)changefreq→ optional (informational only)priority→ optional (search engines often ignore)
4) Minimal examples (copy/paste) 🧩
4.1 Basic web sitemap
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2025-10-15T12:30:00+00:00</lastmod>
</url>
<url>
<loc>https://example.com/blog/my-post</loc>
<lastmod>2025-10-20T09:12:00+00:00</lastmod>
</url>
</urlset>
4.2 Sitemap index (multiple children)
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemaps/sitemap-pages.xml.gz</loc>
<lastmod>2025-10-20T10:00:00+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemaps/sitemap-blog.xml.gz</loc>
<lastmod>2025-10-20T10:00:00+00:00</lastmod>
</sitemap>
</sitemapindex>
4.3 Image entries (inside a normal URL sitemap)
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>https://example.com/product/abc</loc>
<lastmod>2025-10-21T08:00:00+00:00</lastmod>
<image:image>
<image:loc>https://example.com/images/abc-1.jpg</image:loc>
<image:title>Product ABC front view</image:title>
</image:image>
<image:image>
<image:loc>https://example.com/images/abc-2.jpg</image:loc>
</image:image>
</url>
</urlset>
4.4 Video entries
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<url>
<loc>https://example.com/tutorials/how-to-x</loc>
<video:video>
<video:thumbnail_loc>https://example.com/thumbnails/how-to-x.jpg</video:thumbnail_loc>
<video:title>How to X</video:title>
<video:description>A 2-minute guide to X.</video:description>
<video:content_loc>https://cdn.example.com/videos/how-to-x.mp4</video:content_loc>
<video:publication_date>2025-10-15T10:00:00+00:00</video:publication_date>
</video:video>
</url>
</urlset>
4.5 Hreflang (multilingual) 🌍
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://example.com/de/produkt/abc</loc>
<xhtml:link rel="alternate" hreflang="de" href="https://example.com/de/produkt/abc"/>
<xhtml:link rel="alternate" hreflang="en" href="https://example.com/en/product/abc"/>
<xhtml:link rel="alternate" hreflang="x-default" href="https://example.com/en/product/abc"/>
<lastmod>2025-10-19T11:20:00+00:00</lastmod>
</url>
</urlset>
5) Generation strategies (Drupal/Symfony) ⚙️
Drupal 🟦
-
Use a maintained sitemap module (e.g., Simple XML Sitemap) to:
- Automatically include nodes, terms, media with canonical URLs
- Split large sitemaps into multiple files + index
- Add images/video where applicable
- Respect “noindex” / unpublished statuses
-
Add custom inclusion/exclusion rules via path patterns or entity conditions.
-
Regenerate on cron or on relevant content events.
Symfony 🟪
-
Implement a SitemapController that:
- Streams XML (and GZIP variant)
- Pulls canonical URLs from your routing/entities
- Includes
lastmodfrom updated timestamps - Splits into multiple files and serves a sitemap index for large catalogs
-
Cache output (HTTP cache / reverse proxy) and invalidate on content changes (Messenger events, Doctrine listeners).
6) What to include (and exclude) ✅❌
Include
- Canonical URLs that should be indexed
- Important listing pages (only if canonical and not thin/duplicate)
- Product/article/detail pages
- Media pages if they add value (and are indexable)
Exclude
- Non-canonical variants (with tracking or sort/filter params)
- Paginated duplicates unless they have unique canonical signals
- Thin, gated, or
noindexcontent - 3xx/4xx/5xx URLs (keep sitemap clean)
7) Freshness & automation 🔄
lastmoddiscipline: update on publish/edit (not on every cache clear)- CI/CD: after deploy or import jobs, trigger a sitemap rebuild step
- Cron: daily (or more frequent) regeneration for active sites
- Ping: Optionally notify search engines when the index changes (low priority, safe to skip)
8) robots.txt integration 🤝
Always advertise your sitemaps in robots.txt:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemaps/sitemap-index.xml
Even if crawlers find them via GSC, the
robots.txtpointer improves discoverability and is useful across engines.
9) QA & monitoring 🕵️
Quick checks
# Return headers & preview
curl -I https://example.com/sitemap.xml
curl -s https://example.com/sitemap.xml | head -n 40
# Validate gzip child
curl -I https://example.com/sitemaps/sitemap-pages.xml.gz
Search Console
- Submit the sitemap index (not every child)
- Watch coverage: errors, excluded URLs, and “discovered – currently not indexed”
- Investigate spikes in indexed vs submitted trends
Health checklist
- ✅ 200 OK +
Content-Type: application/xml(ortext/xml) - ✅ URLs resolve to 200 and match canonicals
- ✅ No blocked (
robots Disallow) URLs listed - ✅ Reasonable file sizes; split when needed
- ✅ Accurate
lastmod(no false churn)
10) Server routing & headers 🧱
Apache
# Serve XML correctly (often automatic)
AddType application/xml .xml
# Optional: Redirect pretty path to generator
Redirect 301 /sitemap.xml /sitemaps/sitemap-index.xml
Nginx
location = /sitemap.xml {
try_files /sitemaps/sitemap-index.xml =404;
types { application/xml xml; }
default_type application/xml;
}
Caching
- Cache XML for minutes to hours (purge on update).
- Avoid app-level regeneration on every request.
11) Multisite & multilingual patterns 🏢🌐
- By site/host: one sitemap index per host, kept separate (e.g.,
de.example.comvsen.example.com) - By type:
/sitemaps/sitemap-pages.xml.gz,/sitemaps/sitemap-blog.xml.gz,/sitemaps/sitemap-products-1.xml.gz, etc. - By language:
/sitemaps/sitemap-de.xml.gz,/sitemaps/sitemap-en.xml.gz(with hreflang links in each URL entry)
12) Playbook (fast start) 🚀
Marketing/blog (Drupal)
- Install Simple XML Sitemap (or equivalent).
- Enable nodes/terms/media; set
lastmodrule. - Exclude search/facet/archive noise.
- Generate sitemap index; add
Sitemap:lines torobots.txt. - Submit index in Google Search Console.
Catalog app (Symfony)
- Create generator service + controller.
- Stream index + chunked sitemaps (50k URLs each).
- Include
lastmodfrom entities. - Cache & purge on writes.
- Wire
Sitemap:inrobots.txt; submit in GSC.
13) Common pitfalls 🧨
- Listing non-canonical or blocked URLs (wastes crawl budget)
- Omitting
lastmod→ slower freshness detection on large sites - Giant single sitemap → exceeds size/URL limits
- Regenerating too often with fake
lastmodchurn → noisy, low signal - Mixing hosts improperly (cross-host URLs without proper ownership)
14) Maintenance checklist (monthly) 🧰
- Validate status codes of top URLs (200, not 3xx/4xx/5xx)
- Compare submitted vs indexed charts in GSC
- Review exclusions (duplicates, canonicalized) and clean sitemap inputs
- Check file counts/size; split as needed
- Ensure
robots.txtSitemap:pointers are correct
15) Quick FAQ ❓
Q: Should I include paginated pages? A: Only if they’re canonical and valuable. Prefer strong listing → detail linking with good internal links.
Q: Do priority and changefreq matter?
A: They’re optional hints; engines rely more on content quality, links, and freshness signals.
Q: Can I include noindex URLs? A: Don’t. Sitemaps should list indexable canonical URLs only.
Changelog
2025-11-04— Initial version; includes Drupal/Symfony recipes, image/video/hreflang examples, and ops checklists.
If you tell me the exact domains and content types (blog, products, docs), I can generate final sitemap index + child files plan, plus ready-to-wire Drupal/Symfony configs and a tiny CI job snippet.