8 Ways To Ensure Search Engines Can Crawl Your Website

Matt Mcgee

Thursday, July 28th, 2011

In the physical world, storeowners put a lot of time and energy into where products are placed and how customers flow through the sales floor. In grocery stores, for example, the most profitable products are often placed at eye-level. In clothing stores, the most attractive and desirable items are often displayed in store windows or right inside the entry to entice people to come inside.

Similar ideas come into play with your website.

But rather than placing products or content where visitors will most easily find them—which is very important, of course—I’m talking about setting up your website to ensure that search engine spiders can find your products, services, and all the great content you’ve published.

It’s calledcrawlability—a concept that doesn’t get as much attention as content and links, but it’s no less important where SEO is concerned. Why? Because Web pages that can’t be crawled and indexed will never rank highly.

Here are eight ways to make sure search engine spiders have no trouble finding and indexing your Web pages:

1. Avoid flash

Flash isn’t inherently bad. When used correctly, it can enhance a visitor’s experience. But your website shouldn’t be built entirely in Flash, nor should your site navigation be done only in Flash. Search engines have claimed for a couple years now that they’re better at crawling Flash, but it’s still not a substitute for good, crawlable site menus and content.

2. Avoid AJAX

The same ideas mentioned above regarding Flash apply here to AJAX. It can add to your site’s user experience, but AJAX has historically not been visible to search engine crawlers. Googleoffers guidelinesto help make AJAX-based content crawlable, but it’s complicated and the SEO “best practice” recommendations remain the same: Don’t put important content in AJAX.

3. Avoid complex javascript menus

Javascript is another technology that search engines are getting better at crawling, but is still best avoided as the primary method of presenting site navigation. Back in 2007,Google explained:

While we are working to better understand JavaScript, your best bet for creating a site that's crawlable by Google and other search engines is to provide HTML links to your content.

That’s still the best practice today: Make sure your site navigation is presented in simple, easy-to-crawl HTML links.

4. Avoid long dynamic URLs

A “dynamic URL” is most simply defined as one that has a “?” in it, like

http://www.yourdomain.com/page.src?ID=3456

That’s a very simple dynamic URL and today’s search engines have no trouble crawling something like that. But when dynamic URLs get longer and more complicated, search engines may be less likely to crawl them (for a variety of reasons, one of which is that studies show searchers prefer short URLs). So, if your URLs look anything like this, you may have crawlability problems:

http://www.yourdomain.com/page.src?ID=3456&XID=453456565&CID=336794445&VID=34521456&SESSION=9875e907332atf56

Google’swebmaster help pagesays it well: “…be aware that not every search engine spider crawls dynamic pages as well as static pages. It helps to keep the parameters short and the number of them few.”

5. Avoid session IDs in URLS

This is an off-shoot of the previous item, but should be mentioned separately. Search engines don’t like to crawl and index URLs that have a session ID. Why? Because even though the session ID makes the URL different each time the spider visits, the actual content on the page is the same. If they indexed URLs with session IDs, there’d be a ton of duplicate content showing up in the search results.

If you look at the long URL I shared above, the last piece says

SESSION=9875e907332atf56

That’s a red flag to search engine spiders. Make sure your site doesn’t have session IDs in the URLs.

6. Avoid code bloat

By “code bloat,” I’m referring to situations where the code required to render your page is dramatically more substantial than the actual content of the page. In many cases, this is not something you’ll need to worry about—search engines have gotten better at dealing with pages that have heavy code and little content. Code bloat isn’t a problem until it’s abig problem…but it’s something website owners should be aware of.

7. Avoid robots.txt blocking

First, you’re not required to have a robots.txt file on your website; millions of websites are doing just fine without one. But if you use one (perhaps because you want to make sure your Admin or Members-only pages aren’t crawled), be careful not to completely block spiders from your entire website.

In no circumstances should your robots.txt file have something like this:

User-agent: *
Disallow: /

That code blocks all spiders from accessing your site. If you ever have questions about using a robots.txt file, visitrobotstxt.org.

8. Avoid incorrect XML sitemaps

A XML sitemap lets you give a list of URLs to search engines for possible crawling and indexing. They’re not a replacement for correct on-site navigation and not a cure-all for situations where your website is difficult to crawl.

If implemented properly, an XML sitemap can help search engines become aware of content on your site that they may have missed. But, if implemented incorrectly, the XML sitemap might actually deter spiders from crawling.

If you’re curious, I’ve only once recommended that a client use XML sitemaps, and that was a website with upwards of 15 million pages. If you want to learn more about XML sitemaps,check out sitemaps.org.

If you take care of all the issues above, you can rest assured that you’ve made it as easy as possible for search engines to crawl and index your website.

About Matt Mcgee

SearchEngineLand.com