How to avoid duplicate content in search engine promotion
The search engines hate finding two pages with the same or similar content. Unfortunately modern blog software and content management systems have a tendency of generating many different URLs for the same page. So what is a webmaster to do?
Shari Thurow
At the recent New York Strategies conference Shari Thurow from Grantastic Designs, Inc. talked about how the search engines identify duplicate content.
The search engines do not like to see duplicate content, she said, because it slows down the retrieval process and degrades the search experience (you do not want to see several search engine result listings delivering the same text).
They do not agree on what constitutes duplicate content, however. Still, the concept includes more than exact replicas. They look at sequences of terms, with or without formatting differences.
Duplicate content filers.
There are several types of duplicate content filters, i.e. systems that help search engines identify duplicate content, she said.
They do “boilerplate stripping”, i.e. they identify all elements that are not part of the page’s unique fingerprint, like headers, menus etc. This is all the stuff that is repeated on many or all pages of a site.
The search engines try to decide what is unique on the page (“the unique fingerprint”) for instance by looking or heavy HTML density (large amounts of regular text).
If the linkage properties are too similar — meaning that two pages have the same set of inbound and outbound links — we’re probably dealing with duplicate content. You may compare linkage properties on Yahoo Site explorer.
The search engines look at content evolution, Thurow said. High quality sites usually do not change much over time. Normally less than 65% of web content change on a weekly basis. On average 0.8% of web content will change completely every one to two weeks.
Among other tactics used by the search engines, Thurow mentioned host name resolution. If you’re switching servers, this may be an indication of you trying to spam.
She then talked about shingles or word sets. If the shingles are the same, but placed on different pages and the domain is the same, it’s duplicate content even if they’re sorted in different ways.
However, some kinds of duplicate content are spam, others are not.
Some are copyright infringement. Report these to the search engines.
How to solve the problem
If your CMS (content management system or blog software) is creating duplicate content, use the robot exclusion protocol and 301 redirects to lead the search engines to the correct pages, she said.
You may also use the nofollow tag and include relevant metatags to exclude the search engines from indexing a page (see the Pandia Search Engine Marketing 101 for more information).
deMib
Mikkel deMib Svenden said that there is many ways to create duplicate content, many of the unintended:
You may present identical pages with URLs with and without the www subdomain (www.pandia.com vs. pandia.com). The search engines are getting smarter, but some of them continue to think that these are two, identical, pages. Pick one way to present pages: with or without the www.
It is always better to use direct links in your HTML (i.e. http://www.pandia.com/goalgetter/) than indirect links (../goalgetter) as it ensures that the search engine spiders are not confused.
Identical pages may have different session IDs. He told the audience of a site that had the search engine spider tracking 200,000 versions of the same page!
To solve this problem you may use cookies to identify users, or you could try to identify the spiders and feed them non-ID URLs only.
URL rewriting
URL rewriting is another problem.
Wordpress and other blog software will have a default URL with a question mark and number (e.g. http://www.demib.dk/?p=287).
However, you may also ask Wordpress to generate a permalink following a regular URL format (see the Wordpress generated URL of the page you are reading now or http://www.demib.dk/google-noodp-287.html).
The solution is to 301 the non-official version ot the official URL.
There are Wordpress plug-ins that may help you do this, he said.
Then there are “many-to-ones” in forums. He used an example from the Search Engine Watch Forums where http://forums.searchenginewatch.com/showthread.php?t=933 and
http://forums.searchenginewatch.com/showthread.php?t=9331&page=1&pp=20 resolve to the same page.
In this case they have used a POST form, that engines do not execute. Also, the “Rate This” form is on this forum is only available to logged in users. But both URLs are accessible by all (including spiders) if someone links to them.
So how do you cope?
Mikkel says that you should check for major search engine bots and 301 redirect all URL’s with the “pp=“ parameter or “page=1” value to the default URL for the page.
Breadcrumbs
Breadcrumb navigation can also be an issue. He gave an example of one identical page being placed in two different subdirectories.
Products -> Shoes -> Running Shoes -> Adidas
www.domain.com/shoes/runing-shoes/adidas.html
Products -> Sports Equipment -> Shoes -> Adidas
www.domain.com/sports-equipment/shoes/adidas.html
Do not put multiple types if URL in the breadcrumbs!
“There are infinite ways to create multiple URLs to a single page,” Mikkel concluded. “Whatever you do, don’t leave it to the
engines to deal with!”
Anne Kennedy
Anne Kennedy from Beyond Ink drew attention to the Google Webmaster Guidelines which say that you should not create multiple pages, subdomains, or domains with substantially duplicate content.
In the same way Yahoo! argues that you should not create multiple sites offering the same content.
International versions
Various international versions of the same site should be OK. Google identifies users by IP. But the US is one region and Google aims at returning one result for a set of content.
You need one canonical domain and link all internal pages on the site to it. Then exclude landing pages put for for tracking from search engines using the robots.txt file.
[As far as we know, different language versions are not considered duplicates. The Editor]
Web feeds
You may use RSS feeds to help Google identify the right URL. Embed links to your original content in your feed to help Google identify the source. She recommended trying Feedburner for this purpose.
Use 301 redirects to point all your domains to a single site. Test to make sure it’s working: Your home page should display with only on domain.
Use 302 redirects only for content that is going to change and only for temporary content.
Beyond Ink has a “301 Redirect How-to’s & Code” available at www.beyondink.com/301redirect, which tells you how to perform 301 redirects in Apache.htaccess, IIS, PHP, ASP and ColdFusion.
She then presented several online tools that might help you identify duplicate content as this is indexed by the search engines, including The Google Webmaster Central toolbox and
Yahoo URL status form .
See also How to Remedy Duplicate Content and Magical % Thinking (Stuntdubl)
See also the rest of Pandia’s NY Search Engine Strategies coverage.
Recent news from Pandia
Firefox plug-in personalises search results
Pandia Weekend Wrap-up
Microsoft considers increasing its bid for Yahoo!
Coming up: Google Ocean
Interview with Kosmix, the theme oriented search site
Tap into the SEO hive mind
Top 3 sites for researching search engines
Omgili evolves, now spiders social media to answer your questions
Pandia Weekend Wrap-up April 20
Microsoft improves Live News Search
Google adds quotations to search
Link Previews from CoolIris
PicLens improves image search
Nsyght launches beta
Pandia Weekend Wrap-up April 13
Google is testing how to use web site search forms
























