Improved HTML spidering (Robots, Canonical, Rel)

Friday, 28 March 2014

We now now parse a number of HTML elements to better understand your website and which files should be in your sitemap.

Canonical urls

We now detect the link rel="canonical" tag.

 <link rel="canonical" href=""/>

Where we detect this tag and it points to another page we will not include the current page in the sitemap and will instead spider the url specified in href attribute of the tag.

Meta robots 


We now obey the meta tag for robots.

<meta name="robots" content="noindex, nofollow" />
<meta name="robots" content="noindex" />
<meta name="robots" content="nofollow" />

Where  a noindex or  nofollow value si detected we will not index or will stop following urls on the given page.

Anchor rel attribute


We not obey rel="nofollow" in anchor tags

<a rel="nofollow" href="/index.aspx" />

As with the meta robots tag if we detect a nofollow value we will not follow this url.


Text, HTML sitemaps, Robots.txt and more

Sunday, 23 March 2014

This version includes some new updates that people have been asking for including a new number of sitemap formats.

We recommend that you have a valid robots.txt file, an xml sitemap and an HTML sitemap in your website root folder to optimise your sitemap coverage.

XML and HTML Sitemaps, and Robots.txt

HTML sitemaps

The great thing about an HTML sitemap is that when you publish it any search engine can deal with it whether they officially support sitemaps or not. At the minute we list out all urls in alpahbetical order. If you have any suggestions to improve this please feel free contact us.

Text Sitemaps

We also added text sitemaps to the list of files which is a really simple list of URLs.


We also produce a robots.txt file for you which you can upload to your website (or use to modify your existing robots file).
The robots.txt file makes it easier for search engines to automatically discover your sitemaps.

Other changes

We've also improved the error report and changed it from an XML file to a standard HTML page to make it easier to understand and refined our quick guide to publishing a sitemap.

How long does it take to generate my XML sitemap?

Friday, 14 March 2014

Our spider can sometimes take a while to process your website and people ask how long they should wait. The time to generate a sitemap can vary quite dramatically from a few seconds to over 10 minutes and this can be influenced by a number of factors.


Key Influencing Factors


Your website performance

The key limiting factor will usually be how fast your website / web server can respond to our spiders requests. If your web server is on the other side of the world and/ or slow to respond this will delay our spider. Your website may seem fast to you if you are geographically close to it, but our spider may not be.


Size of pages

Our spider has to download and process every page to find links. The smaller and more efficient your pages the fast we can spider. If your pages are large it will take longer to download files.


Number of pages

The more pages you have on your site the longer it will take. So if you have a slow server or lots of larger pages, the more pages you have the longer it will take.

For example if you have 10 pages that take 5 seconds it will take at least 50 seconds to complete. If you have 200 pages and it takes 2 seconds for each it will take at least 400 seconds (6.7 minutes).


Our service load

During busy times our spider may be indexing lots of websites and we have limited CPU, memory and bandwidth. This can cause the service to run slower.




Try to ensure your web pages are well designed and as lean as possible to aid the spidering of your website.

Ensure that you are not throttling the browsing of your website for our user agent
"XmlSitemapGenerator -"

Most importantly when you generate your XML sitemap ensure you enter your email address. That way if there is a delay you will be notified when your sitemap is ready by email.

Home page redirects fixed

Tuesday, 11 March 2014

Some homepage redirects were causing problems for our spider and resulted in sitemaps with no files for a small number of users. We believe this is now resolved. Thanks for the feedback.

Why do you limit the number of URLs in a sitemap?

Sunday, 9 March 2014

We get asked this quite a lot.....

The main reason is that XmlsSitemapGenerator is a free tool and generating sitemaps is not a free process.

Our spider indexes thousands of pages a day utilizing lots of server resources (Memory, CPU and bandwidth) and  racking up gigabytes of data as it indexes pages and builds up profiles.

The overhead of the spidering process is quite large especially at busy times when we have many people generating sitemaps. To help ease the pressure we have a queuing mechanism to avoid a log jam on the server, however that means people then have to wait.

Therefore to control resource utlisation and minimize wait times we limit the number of urls our spider will crawl for a given website.

Over time we have increased the number of urls we accept from 50, to a 100 and at the time of writing we now support 250. Of course we constantly review this and may decide to raise or lower it in the future.

From reviewing our statistics 99% of users who use our free xml sitemap generator don't hit this limit and many don't even come close, so for now we are happy that the tool is fit for purpose.


What if I have more urls?

Generally if you have a website that comprises of more than 100 pages you are probably using some sort of content managment ystsem (CMS) or an ecommerce system. With such systems there are much better ways to create sitesmaps faster and more effectively directly from the database. Many systems support adaptes and plugins to help you generate your sitemap.

Improved download and error reports

Sunday, 2 March 2014

We've made some improvements to the sitemap download page to make it easier to download your sitemaps.
  • Firstly we've made all the files available as a single zip file download. 
  • We have also added a simple table that gives you access to your XML Sitemap, RSS sitemap and a
  • New error report.