You'd think this kind of thing wouldn't be a common problem.
The most basic thing SEO needs to accomplish is to get your pages listed somewhere in the search results. In fact, it seems so basic that you'd think an SEO hardly needs to worry about indexation.
Just get an XML sitemap up and you're done, right?
In fact, many of the pages you think are indexed may not actually be in Google's main index. If a site shows up in Webmaster Tools, I wouldn't necessarily call it fully indexed. If it's been crawled, that doesn't mean it's been indexed.
Even if you can find it in a Google search, I still wouldn't necessarily say it's in the main index.
Let me explain, and let's talk about how to prevent these problems.
Not all Indexes are Created Equal
Here's something a lot of people don't realize or think about when it comes to Google.
Google operates 12 data centers around the world, 6 of them in the United States. If you think all of those data centers are perfectly synced and contain the exact same Google index...well, they don't. Just because the content on your site has been indexed, that doesn't mean it's accessible from Google's main search.
Even if it is, I wouldn't necessarily count on it being accessible from all 12 data centers.
It took two months after Panda was unleashed in the US for Google to roll it out in Europe. Two months is an eternity online. That's two months where the US and Europe had very different indexes. Just because you can find a page in Google, that doesn't mean others can.
Here's something else I've noticed, and that used to be a lot more common back when my personal site wasn't crawled very often.
If I performed a search for the title of the page, it would show up right at the top of the search results.
But if I turned around and copied some exact text from the blog post and pasted it into the search bar, guess what? That's right, no search results.
Another occurrence I've seen often? Sometimes you can find a page if you add "site:site.com" to your query, or even just "site.com," but not if you search for the exact title of the page, or exact text from the page.
Would you really call those pages indexed?
What to Do About It
Obviously, if you don't have an XML sitemap set up, that should be the first thing you take care of. If you're on WordPress, install the Google XML sitemaps plugin. Make sure that your sitemap shows up under Crawl > Sitemaps in Google Webmaster Tools, and that your sitemap is always up to date with all the new pages.
Beyond that, there are tons of less obvious issues that can hurt crawling and indexing, meaning people won't find you in the search results:
- Links are very important for crawling, which is a prerequisite for indexing, but don't expect your pages to be indexed well just because you have a link, or even tons of links, pointing at your site.
Links from unheard of pages are rarely if ever crawled, and that means that Google's robots aren't going to follow them to your site. If your site isn't seeing any referral traffic at all from links, you can't necessarily count on Google's robots coming from them either.
- Few things help with indexing more than your internal link structure. Your homepage can be indexed or even rank very well; it doesn't mean all of your internal pages will get indexed.
Consider how many blogger or wordpress.com blogs aren't indexed, despite the fact that they are sitting on a very heavily linked-to site.
In my experience, most SEOs place too much emphasis on a made-up metric called "domain authority." While it can be useful as a quick gauge of how popular a site is, Google doesn't generally evaluate sites, it evaluates pages.
If you want a page to be indexed and to rank well, it shouldn't be too many clicks away from the home page of your site.
- All too often, webmasters will have the content="noindex" attributehidden on their pages without them even knowing about it, due to some template they used or the work of an ignorant developer.
This attribute tells Google not to index pages. Make sure this attribute only exists on page duplicates, or other pages you don't want showing up in the search results for whatever reason.
- If you don't use the rel="canonical" attribute where you should, Google won't necessarily be able to handle the duplicate content on your site. It may end up indexing only a weak version of the page, or none of them at all.
- One of the most common problems is the overuse of the rel="nofollow" attribute in links. While Google probably uses these for indexing in some circumstances, these links are by their very nature supposed to be "irrelevant," and that likely means the crawlers don't follow them as often.
Do not use the nofollow attribute on any of your internal links. This can slow down or outright prevent some of your pages from being indexed.
- Don't assume that a page will stay indexed just because it's been indexed. If a page isn't crawled very frequently, Google may ultimately drop that page from the main index. While Google "remembers everything" internally, that doesn't mean it wants to show it to users.I'm not just talking about penalties, either. If a page is too far away from anything active, Google may consider it irrelevant and drop it.
I wouldn't go as far as saying that every page on your site should be seeing referral traffic from either external or internal links. I would say that if a page like that isn't linking to the page in question, Google might just drop it.
- If you try to hoard PageRank by sculpting your links to point to your "money pages," you can actually end up preventing indexation. It should be easy for users to navigate to any page on your site with hypertext links. If it isn't, Google may not index those pages.
- HTML errors can prevent pages from being fully indexed. While Google's spiders are a lot smarter than they used to be, it's not safe to assume that all of your content can be indexed properly just because you can see it in a browser.
So to wrap this up, there's more to staying indexed in Google that you might think, and you very well might be overestimating how many of your pages are actually indexed.
Ignore indexing at your own peril. Just because the traffic for your vanity keywords keeps going up, that doesn't mean you're seeing your site's true potential.
Image credit: Jan Krömer