Friday, 28 March 2014

Improved HTML spidering (Robots, Canonical, Rel)

We now now parse a number of HTML elements to better understand your website and which files should be in your sitemap.

Canonical urls


We now detect the link rel="canonical" tag.

 <link rel="canonical" href="http://xmlsitemapgenerator.org"/>

Where we detect this tag and it points to another page we will not include the current page in the sitemap and will instead spider the url specified in href attribute of the tag.

Meta robots 

 

We now obey the meta tag for robots.

<meta name="robots" content="noindex, nofollow" />
<meta name="robots" content="noindex" />
<meta name="robots" content="nofollow" />

Where  a noindex or  nofollow value si detected we will not index or will stop following urls on the given page.

Anchor rel attribute

 

We not obey rel="nofollow" in anchor tags

<a rel="nofollow" href="/index.aspx" />

As with the meta robots tag if we detect a nofollow value we will not follow this url.