In recent versions, I have noted a reference to a “Google SiteMap” in the administration pages. Being ignorant of them, I ignored it. Yesterday, I decided to look a bit further into it to understand them. This was partly because lately I’ve felt like a large amount of my server bandwidth has been taken by search robots, and I wondered/hoped that the sitemap would make the crawler use less bandwidth.
When Googling for sitemap, most top links take you to the Google Webmaster Tools. Unfortunately, if you have a google account, this won’t really show you any information about what it is until after you’ve gone through a bit of extra registration, and even so, I don’t feel like there’s much information there.
The best link I’ve found about sitemaps is either the “Official” sitemaps page or the Wikipedia Sitemaps page. The first thing I wanted to know was if there was a way, like “robots.txt”, that web crawlers discover them. (In that case, the crawler just checks to see if the file is there, and IMO, it makes sense for the sitemap to be found in the same way. Yet I found no reference to a “standard” naming convention for sitemaps. Instead, most webpages I found went on and on about using RPC ‘pings’ to notify crawlers of sitemaps, or uploading your sitemap using the google webmaster tools. This is useful, I agree, but I was looking for a method which did not involve actively trying to get crawlers interested in my site, but rather would just be there when they came looking. What I eventually found (and is mentioned in both of the two links above about sitemaps), is that you can list the URL to the “Sitemap” in the robots.txt file, which can clue web crawlers into its presence.
(Incidentally, at this point, it is worth mentioning that there are two things called sitemaps. One is the “Google sitemap”, supported by practically all major search crawlers. The other is a generic term for a page which links to all other pages on your site, which has many of the same benefits, but is free form, rather than with semantic meaning. I don’t actually know if the robots.txt should specifically point to a “Google Sitemap” or if it can be any page which links to all other pages on your site.)
There is little information about what crawlers do with sitemaps, but I can at least share with you what information is in one. In addition to the URLs of each page you want included on your site, the sitemap tells the crawler (1) when the page last was modified, (2) how often the page is modified, and (3) relatively speaking, how important the page is to you. The sitemap seems to serve a few useful purposes. First, it may tell the crawler pages have not been modified, and need not be crawled. Next, it may point out pages to the crawler which are not easily linked on your page, and which the crawler might otherwise overlook. Also, it tells the crawler how often different pages change, which it may use as hints as to when to re-crawl them.
So back to my original question: “Does a sitemap reduce the crawler bandwidth?”. I think the answer is — no one knows, outside of the secret teams who implement the crawler. But in the case of Gallery (which it the largest content source on my site), the sitemap tells the crawler that the images themselves never change, but that new links may appear on gallery pages. This could reduce bandwidth as the crawler need not re-download all my pictures very often. But even better, the crawler could in fact reduce its crawling to non-listed pages if it trusts my last-modified timestamps, because if any of them are updated, the sitemap will contain the new timestamp.
Finding this somewhat satisfactory, I also went looking for a sitemap generator for WordPress, and found one by Arne Brachhold on the extension site. This extension even notifies search engines when the sitemap changes. This extension also had the added benefit of allowing me to specify other pages (which are not written in wordpress) to add to the Sitemap, since the blog is only part of my site.
Who knows if it will make any difference to my site. But I figure it cannot hurt.