Sunday, 23 February 2014

Problems creating your XML sitemap?

Some users experience problems when they create a XML sitemap because of how their website is implemented and hosted. Here are some common problems with websites that result in inconsistent xml sitemaps.
  • Server / performance issues
  • Inconsistent urls / domains
  • iFramed homepage
  • Bad header tags 
  • Incorrect server tags
  • Page size too large
  • No "real" links / Non native behaviors
  • Inconsistent behavior for different user agents / browsers
  • Poor HTML mark up
  • Incorrect use of character sets
  • Incorrect modified date header


Server / performance issues


Some webservers are slow to respond, throw errors, or crash. Your webserver needs to be responsive and be able to serve pages quickly. We limit the amount of time our spider will wait for a given page and the total time to create a sitemap. If your server is slow or producing errors our spider will fail to complete your sitemap.

Inconsistent urls / domains


Our spider only covers the current domain. If you have multiple domains in your website our spider will not follow them. This can be subtle in terms of http vs https and addresses with or without www.

<a href="http://xmlsitemapgenerator.org/index.aspx">home</a>
<a href="http://www.xmlsitemapgenerator.org/about.aspx">about</a>
<a href="http://www.xmlsitemapgenerator.org.uk/contact.aspx">contact</a>

Always use the same domain throughout your website, even better use relative urls :

<a href="/index.aspx">home</a>
<a href="/about.aspx">about</a>
<a href="/contact.aspx">contact</a>

This goes for links, framesets, image maps, etc as the sitemap spider works within the context of one domain so will ignore urls in other domains.

iFramed / framed pages


Problems with iFrames are usually related to inconsistent urls/domains as above.

e.g. if your website is https://xmlsitemapgenerator.org and on your homepage you have an iFrame  as below our spider will not get beyond your homepage.

<iframe src="http://www.xmlsitemapgenerator.org"></iframe>

To resolve this ensue the domains match or use relative urls. If you cannot do this use the address from the iFrame to create your sitemap.

Bad header tags


If you use the canonical url meta tag or noindex / no follow make sure you use them correctly. we have seen cases where peple have a noindex/nofollow on their homepage! Our spider will ignore pages with a nofollow / noindex and this is generally very bad for SEO!

Similarly a bad or circular canonical url send our spider round in circle until it gives up.We essentially treat canonical tags as a redirect.

Incorrect server response


To include a page in your website it must return a 200 or 206 success code. If it returns anything else it will be excluded from your sitemap.

Misconfiguration or server errors are usually the cause of problems. e.g. issuing a 404 incorrectly, a bad 302 redirect or a server error such as 500.

Page size to large


For performance reasons we limit the size of content that our spider processes per page. If your page is very large we will truncate this. If we truncate it before finding any urls the spider will fail to get beyond the given page.

This usually occurs if you have a lot of header content such as embedded CSS and javascripts.It's generally good practice to separate these out to enable better caching and reduce page sizes. e.g.

<link rel="stylesheet"  href="myCssFile.css"/>
<script type="text/javascript" src="/myJavascriptsFile.js" ></script>

No "real" links / non native behaviours


Some website use flash and JavaScript for links. Remember spiders and search engines struggle with these and they are not particularly accessible to users with disabilities. You should always cater for native link behaviour, especially within your home page to ensure that all visitors and spiders can find their way in to your website.

These problems can occur when using some JavaScript frameworks that render pages or manage navigation outside native HTML, "a" tags and href attributes.

You should also follow best practise and native behaviour for carrying out common tasks. For example if you want to do redirects use the correct meta tag or response headers. Using patterns such as document.location.href = 'index.asp' will prevent our spider finding your pages.

Poor HTML mark up


Not all web design software and indeed web designers create high quality HTML. Some basic tips include :

Always enclose your HTML attributes in quotes.
<meta name="description" content="something" />

Don't use unnecessary spaces in markup:
<meta name = "description" content = "something" />

Don't forget to include closing tags
<a href="/contact.aspx">contact</a>
<div>Hello World</div>

 Use the correct syntax for urls. We quite often see things like this :
<a href="///contact.aspx">contact</a>

Always use the same domain throughout your website, even better use relative urls :

<a href="/index.aspx">home</a>
<a href="/about.aspx">about</a>
<a href="/contact.aspx">contact</a>

You can use a free HTML / XHTML validator to help check your HTML.
http://validator.w3.org/

Inconsistent behaviour for different user agents / browsers

We've noticed that some websites redirect browsers, but don't redirect our spider or vice versa. For example when we visit your website in our browser you redirect to a page or folder and your server presents a page to our spider. Some users are not aware of this page or had forgotten about it and the links are often broken or out of date.

Similarly links and content can be presented differently. Always treat our spider as you would a browser to get the sitemap you expect.

Note you can detect our sitemap spider from our user agent string :

"XmlSitemapGenerator - http://xmlsitemapgenerator.org"


Incorrect use of character sets

Make sure you are using the correct response header character set. This is particularly important if you are using non Latin character sets including Arabic, Chinese, etc.
http://www.w3schools.com/tags/ref_charactersets.asp

 

Incorrect modified date header

Make sure your server response includes the correct modified date for a page. Some just return the current date and type which results in an incorrect sitemap and can mean it takes longer to spider your website.

 

Conclusion 

Our spider trusts your server response, the headers, meta tags and page content.

Adopt standards and be clean and consistent in your approach to ensure that our spider, users and search engines can navigate your website effectively.

The cleaner your website implementation the more effective it will be and the easier it will be to create an XML sitemap.

Saturday, 15 February 2014

Improved support for character encoding and redirects

Character encoding

We've improved our spider so that it can cope with a wider range of character sets including Arabic and Chinese.

Don't forget that for this feature to work correctly it is important that we can understand your website encoding otherwise our spider won't interpret it correctly and your sitemap will contain strange characters and symbols.

http://www.w3schools.com/tags/ref_charactersets.asp

Improved HTTP 301 redirect and 302 Moved handling

Not only do we now follow HTTP 301 and HTTP 302 automatically. As well as this when the spider first hits your home page if we detect a 301 or 302  we will work out what the base domain for your website is.

Coupled with the improved patter match (see below), we can now determine your primary domain and match more variations to come up with a complete xml sitemap with canonical urls.

Improved pattern matching

If you enter a single domain we will now check for more patterns :

http://www.xmlsitemapgenerator.org
https://www.xmlsitemapgenerator.org 
http://xmlsitemapgenerator.org

https://xmlsitemapgenerator.org

When we create your sitemap we will use the version that you entered in the settings so make sure this is the correct primary domain for your website.

Fixed https validation

You can now enter an https:// address without getting an invalid URL error.

We also now try to handle badly constructed urls better such as :
xmlsitemapgenerator.org///directory/test.html

Thursday, 13 February 2014

Spider Performance Improvements

Our spider was looking a little tired! At times some users were waiting a while for it to complete its job and in some particularly busy periods getting timeout errors.

The good news is we have done some house keeping, clearing out millions of records, re-building, de-fragging, etc. and we are now ticking over a bit more smoothly.

Don't forget if you are having problems you can always contact us.

Sunday, 9 February 2014

Support for http-equiv="refresh" added and more ....

On the 9th of Feb we made some minor updates.....

New : Follow meta refresh tag e.g. http-equiv="refresh"


We found quite a number of websites using meta tags in their homepage to redirect to another page. Previoulsy we were not detecting this so our spider only found a hompage.

We now take the http-equiv tag url and follow it. e.g. : 

<meta http-equiv="refresh" content="0; url=http://example.com/">

New : Automatically follow both www. and non www. urls.

Some users were not understanding what domain they were using and where they were specifying the full URL within pages mixxing the www and non www equivelent.
e.g. http://www.xmlsitemapgenerator.org and http://xmlsitemapgenerator.org

If you enter one we now follow both by default but we don't duplicate your urls. We list them once using the domain you entered on the setup page.

Fix : Errors with duplicate urls for frames.

We noticed that when there were frames on a page with the same url it was causing a spider problems and throwing an error so we fixed this!