Don’t worry too much about duplicate content, Google says

Google Blog tells webmasters how to cope with duplicate content.

The Official Google Webmaster Central Blog has an interesting post about duplicate content today. The main message is that you should not worry too much about it.

Duplicate content is a phenomenon with different sub-categories.

  • You may have two identical pages on your own site (a common thing on blogs where pages representing a particular date is more or less identical to a regular post page)
  • You may have pages with content fetched from another web site (with or without the authors permission)
  • You may have “fit for print” or “fit for mobile phones” pages with the same content as regular web pages.
  • Different URLs may point to the same page (www.pandia.com, www.pandia.com/index.html, pandia.com etc.)

Normally you would solve such problems by telling Google not to index some of them (using metatags or the robots.txt file).

Google has ways of selecting versions

In essence Adam Lasnik of Google says that you should not worry too much about duplicate content. They have ways of identifying similar pages and will in most instances index only one of them:

This filtering means, for instance, that if your site has articles in “regular” and “printer” versions and neither set is blocked in robots.txt or via a noindex meta tag, we’ll choose one version to list.

However, if Google believes that you are intentionally trying to influence your search engine ranking by use of duplicate content, Google may “punish” you by letting your pages slide down that slippery slope to oblivion:

In the rare cases in which we perceive that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved.

Help Google identify the original

Lasnik then goes on to give some useful advice on how you can help Google identify the correct page.

We won’t repeat that list here — you might as well read his own article. However, there are a few items that we found particularly interesting.

Many web authors have syndicated their content to other sites, mainly to generate PR (of both kinds) for their sites or their company. There are even sites out there that contains large libraries of articles that other sites can republish for free.

Syndicated content

“If you syndicate your content on other sites,”Lasnik says, ” make sure they include a link back to the original article on each syndicated article. Even with that, note that we’ll always show the (unblocked) version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer.”

In other words: Google will select one version of the article to be presented in search results. This may not necessarily be the first version of the article, although you might help Google identify the “original” by including a link back to it on all versions of that article. If not, we guess Google will boost versions published by authority sites or the version they spidered first.

Boilerplate repetition

Lasnik also warns against “boilerplate repetition:”

For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details.

We guess this will only be a problem if the repetitive text is longer than the original content of the page.

Scraper sites

It is also of importance to note that Google will not (they say) punish you if someone steals your content:

“Don’t fret too much about sites that scrape (misappropriate and republish) your content. Though annoying, it’s highly unlikely that such sites can negatively impact your site’s presence in Google.”

Meaning, we suppose, that Google trusts its own ability to identify scraper sites. Well, not all of us do, but it is good to hear that Google will not let the existence of such pages harm the rankings of ordinary web sites.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Bumpzee
  • Furl
  • Ma.gnolia
  • MisterWong
  • Propeller
  • Reddit
  • StumbleUpon
  • TwitThis
  • Wikio
  • YahooMyWeb
  • BlinkList
  • NewsVine
  • Netvouz
  • Technorati
  • Yahoo! Buzz