From Wikipedia, the free encyclopedia
The Sitemaps Protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol, and complement robots.txt a URL exclusion protocol.
Sitemaps are particularly beneficial in situations
- when users cannot access all areas of a website through a browseable interface. In these cases, a search engine can't find these pages. For example, a site with a large "archive" or "database" of resources that aren't well linked to each other (if at all), only accessible via a search form.
- where webmasters use rich AJAX or Flash, and search engines can't navigate through to get to the content.
The webmaster can generate a sitemap containing all accessible URLs on the site and submit it to search engines. Since Google, MSN, Yahoo, and Ask use the same protocol now, having a sitemap would let the biggest search engines have the updated pages information.
Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. By submitting Sitemaps to a search engine a webmaster is only helping that engine's crawlers to do a better job of crawling their site(s). Using this protocol does not guarantee that your webpages will be included in search indexes nor does it influence the way that pages are ranked by a search engine.
Contents |
History of Sitemaps
- Google first introduced Sitemaps 0.84 in June 2005 so web developers could publish lists of links from across their sites.
- Google, MSN and Yahoo announced joint support for the Sitemaps protocol in November 2006. The schema version was changed to "Sitemap 0.90", but no other changes were made.
- In April 2007, Ask.com and IBM announced support for Sitemaps. Also, Google, Yahoo, MS announced auto-discovery for sitemaps through robots.txt.
- In May 2007, the state governments of Arizona, California, Utah and Virginia announced they would use Sitemaps on their web sites.
The Sitemaps protocol is based on ideas[1]from Crawler-friendly Web Servers [2]
XML Sitemap Format
The Sitemap Protocol format consists of XML tags. The file itself must be UTF-8 encoded.
Sample
A sample Sitemap that contains just one URL and uses all optional tags is shown below.
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
http://www.wikipedia.org
2006-11-18
daily
0.8
Submitting Sitemaps
If Sitemaps are submitted directly to a search engine, it will return status information and any processing errors. Refer to Google Webmaster Tools.
Also, the location of the Sitemap can be specified using a robots.txt file to help search engines find the Sitemaps. To do this, the following lines need to be added to robots.txt:
Sitemap:
The
This directive is independent of the user-agent line, so it doesn't matter where you place it in your file. If you have a Sitemap index file, you can include the location of just that file. You don't need to list each individual Sitemap listed in the index file.
URLs in the Sitemap need to be completely specified
You need to include the protocol (for instance, http) in your URL. You also need to include a trailing slash in your URL if your web server requires one. For example, http://www.example.org/ is a valid URL for a Sitemap, whereas www.example.org is not.
Sitemap URL
It is strongly recommended that you place your Sitemap at the root directory of your HTML server; that is, place it at http://example.org/sitemap.xml.
In some situations, you may want to produce different Sitemaps for different paths on your site — e.g., if security permissions in your organization compartmentalize write access to different directories.
We assume that if you have the permission to upload http://example.org/path/sitemap.xml, you also have permission to report metadata under http://example.org/path/.
All URLs listed in the Sitemap must reside on the same host as the Sitemap. For instance, if the Sitemap is located at http://www.example.org/sitemap.xml, it can't include URLs from http://subdomain.example.org. If the Sitemap is located at http://www.example.org/myfolder/sitemap.xml, it can't include URLs from http://www.example.org. [3].
Sitemap Limits
Sitemap files have a limit of 50,000 URLs and 10 MegaBytes per sitemap. Sitemaps can be compressed using gzip, reducing bandwidth consumption. Multiple sitemap files are supported, with a Sitemap index file serving as an entry point for a total of 1000 sitemaps.
As with all XML files, any data values (including URLs) must use entity escape codes for the characters : ampersand(&), single quote ('), double quote ("), less than (<) and greater than (>).
Google's tools
Google provides a Python tool to generate the XML file based on the Sitemap Protocol[1]. It will look at server logs, web directory, or a list of URLs. The program can be scheduled to run, via cron or Windows Task Scheduler. During the program's execution it will notify Google that the sitemap has changed and to schedule a download of that sitemap.
Notes
- ^ M.L. Nelson, J.A. Smith, del Campo, H. Van de Sompel, X. Liu (2006). "Efficient, Automated Web Resource Harvesting". WIDM'06.
- ^ O. Brandman, J. Cho, Hector Garcia-Molina, and Narayanan Shivakumar (2000). "Crawler-friendly web servers". Proceedings of ACM SIGMETRICS Performance Evaluation Review, Volume 28, Issue 2.
- ^ FAQ of sitemapwriter.com
See also
External links
- Official page
- Third party programs & websites listed on code.google.com
- MySitemaps.org automatic sitemaps.xml generator
No comments:
Post a Comment