Problems creating your XML sitemap?

Some users experience problems when they create a XML sitemap because of how their website is implemented and hosted. Here are some common problems with websites that result in inconsistent xml sitemaps.

  • Server / performance issues
  • Inconsistent urls / domains
  • iFramed homepage
  • Bad header tags 
  • Incorrect server tags
  • Page size too large
  • No “real” links / Non native behaviors
  • Inconsistent behavior for different user agents / browsers
  • Poor HTML mark up
  • Incorrect use of character sets
  • Incorrect modified date header

Server / performance issues

Some webservers are slow to respond, throw errors, or crash. Your webserver needs to be responsive and be able to serve pages quickly. We limit the amount of time our spider will wait for a given page and the total time to create a sitemap. If your server is slow or producing errors our spider will fail to complete your sitemap.

Inconsistent urls / domains

Our spider only covers the current domain. If you have multiple domains in your website our spider will not follow them. This can be subtle in terms of http vs https and addresses with or without www.

home
about
contact

Always use the same domain throughout your website, even better use relative urls :

home
about
contact

This goes for links, framesets, image maps, etc as the sitemap spider works within the context of one domain so will ignore urls in other domains.

iFramed / framed pages

Problems with iFrames are usually related to inconsistent urls/domains as above.

e.g. if your website is https://xmlsitemapgenerator.org and on your homepage you have an iFrame  as below our spider will not get beyond your homepage.

To resolve this ensue the domains match or use relative urls. If you cannot do this use the address from the iFrame to create your sitemap.

Bad header tags

If you use the canonical url meta tag or noindex / no follow make sure you use them correctly. we have seen cases where peple have a noindex/nofollow on their homepage! Our spider will ignore pages with a nofollow / noindex and this is generally very bad for SEO!

Similarly a bad or circular canonical url send our spider round in circle until it gives up.We essentially treat canonical tags as a redirect.

Incorrect server response

To include a page in your website it must return a 200 or 206 success code. If it returns anything else it will be excluded from your sitemap.

Misconfiguration or server errors are usually the cause of problems. e.g. issuing a 404 incorrectly, a bad 302 redirect or a server error such as 500.

Page size to large

For performance reasons we limit the size of content that our spider processes per page. If your page is very large we will truncate this. If we truncate it before finding any urls the spider will fail to get beyond the given page.

This usually occurs if you have a lot of header content such as embedded CSS and javascripts.It’s generally good practice to separate these out to enable better caching and reduce page sizes. e.g.

Some website use flash and JavaScript for links. Remember spiders and search engines struggle with these and they are not particularly accessible to users with disabilities. You should always cater for native link behaviour, especially within your home page to ensure that all visitors and spiders can find their way in to your website.

These problems can occur when using some JavaScript frameworks that render pages or manage navigation outside native HTML, “a” tags and href attributes.

You should also follow best practise and native behaviour for carrying out common tasks. For example if you want to do redirects use the correct meta tag or response headers. Using patterns such as document.location.href = ‘index.asp’ will prevent our spider finding your pages.

Poor HTML mark up

Not all web design software and indeed web designers create high quality HTML. Some basic tips include :

Always enclose your HTML attributes in quotes.

Don’t use unnecessary spaces in markup:

Don’t forget to include closing tags
contact

Hello World
 Use the correct syntax for urls. We quite often see things like this : contact

Always use the same domain throughout your website, even better use relative urls :

home
about
contact

You can use a free HTML / XHTML validator to help check your HTML.
http://validator.w3.org/

Inconsistent behaviour for different user agents / browsers

We’ve noticed that some websites redirect browsers, but don’t redirect our spider or vice versa. For example when we visit your website in our browser you redirect to a page or folder and your server presents a page to our spider. Some users are not aware of this page or had forgotten about it and the links are often broken or out of date.

Similarly links and content can be presented differently. Always treat our spider as you would a browser to get the sitemap you expect.

Note you can detect our sitemap spider from our user agent string :

“XmlSitemapGenerator - http://xmlsitemapgenerator.org

Incorrect use of character sets

Make sure you are using the correct response header character set. This is particularly important if you are using non Latin character sets including Arabic, Chinese, etc.
http://www.w3schools.com/tags/ref_charactersets.asp

Incorrect modified date header

Make sure your server response includes the correct modified date for a page. Some just return the current date and type which results in an incorrect sitemap and can mean it takes longer to spider your website.

Conclusion 

Our spider trusts your server response, the headers, meta tags and page content.

Adopt standards and be clean and consistent in your approach to ensure that our spider, users and search engines can navigate your website effectively.

The cleaner your website implementation the more effective it will be and the easier it will be to create an XML sitemap.