SEO School #1: Duplicate Content
What happens if you repost your blog posts on other sites to attract more traffic?
Will it result in a Google penalty and a loss of traffic?
Could using a manufacturer’s product descriptions hurt rankings?
If you’ve been asking yourself these questions and wondered how content uniqueness affects your site’s search performance, this post is for you.
Today we’re going to discuss everything you need to know about duplicate content.
We’ll also learn some ways to prevent it.
What is Duplicate Content?
The simplest way to define duplicate content is as:
Content that appears on the Internet in more than one location.
This could be a significant block of text appearing on a number of pages on a single website.
Or the same content appearing on different websites.
Google defines duplicate content as:
“Substantive blocks of content within or across domains that either completely match other content or are appreciably similar.”
In other words, the search engine would consider identical content on two sites or two pages on the same domain as duplicate.
Why is Duplicate Content an Issue?
For one, a number of identical pieces of content on the web make it difficult for search engines to discern:
- Which version is more relevant to a particular search query and,
- Which version to display in search listings.
When people start linking to and referencing different versions of the same content, search engines might also have trouble deciding where to direct link metrics like trust, authority, link juice etc.
Should they attribute it to a page they’d consider the original content or distribute across all duplicates?
These problems might result in search engines:
- Displaying not the original version but a clone in search results,
- Demoting the content altogether and not displaying any versions.
What Causes Duplicate Content?
It’s only natural to assume that duplicate content is a result of a deliberate action:
- Reposting content on different websites or domains,
- Using manufacturer’s product descriptions,
- Reusing significant parts of content across all pages on the domain (i.e. your product pages might feature extensive T&Cs covering pricing, returns and other policies. Depending on the size of this text, it might be considered a duplicate content.)
But there are dozens other reasons for duplicate content, many of which you might not have any control over.
Here are the most common ones:
Domain and Subdomain
Search engines see the former (http://mydomain.com) as a domain whereas the URL containing the “www” as its subdomain.
As a result they will crawl and index them as separate website.
Needless to say, any content they contain will be considered a duplicate.
Multiple URLs pointing to the same page
This is a very common problem on ecommerce websites.
Variables a visitor uses to navigate through the site might generate unique URLs pointing to the same page.
Consider your category pages:
Depending on what you sell, visitors can filter and view products depending on various factors: size, color, brand etc. And every time they do, your website might generate a different URL.
The same goes for print-ready version of the page, mobile site and so on.
Take a look at these URLs:
- http://mystore.com/category/laptops/ – that would be your main category page,
- http://mystore.com/category/laptops/?sort=price – this URL is created when users view products ordered by price,
- http://mystore.com/category/laptops/?sort=brand – this one displays products sorted by brand,
- http://mystore.com/category/laptops/print – and this one shows print only version of the site.
Even though each of these URLs points to the same page, to a search engine they are unique. Google will index these web addresses as individual pages, yet containing the same content.
Similarly, your system might be automatically creating and assigning session IDs.
To track your visitors and let them add items to cart, for instance, you need to assign a “session” to each of them. A session is a brief history of what a particular visitor did on the site.
For that session to work, your website needs to create a unique identifier that will move along with the visitor as he or she clicks through your site to record their actions.
That identifier is often stored in the URL and passed from page to page. As a result however, it creates a new URL and thus, duplicate content.
Tracking code you add to URLs to monitor traffic coming from various marketing activities, for instance could also cause a duplicate issue.
Take a look at these two URLs for instance:
Even though they seem the same, to a search engine they are two separate URLs.
Lastly, other websites might use your content too.
Their owners scrape the web searching for relevant articles and repost them on their site, often without your consent.
Since they rarely reference where the original has come from, search engines are left to consider this a duplicate content.
Is duplicate content a serious issue?
That’s a difficult question to answer.
In this video, Matt Cutts states that duplicate content won’t hurt you, unless it’s spammy.
He also points that Google recognizes certain types of content (i.e. terms and conditions or legal information) as susceptible to being duplicated across a domain for legal reasons and won’t take any action.
Any content however that’s duplicated either to gain more ranking benefit or confuses a search engine might cause some issues for your site’s rankings.
How to Overcome the Duplicate Content Issue
1. Use a Canonical Tag
We’ve posted a quite thorough guide to canonical tag and I urge you to read the entire post.
But to recap:
A canonical tag is a piece of code you place in the <head> section of your web page’s HTML code. Its role is to tell a search engine which is the original version of content to display in search results.
That’s how canonical tag looks like:
<link rel=”canonical” href=”http://mystore.com/category/laptops/” />
The “rel” attribute tells the search engine that this in fact is a canonical tag.
The “href” attribute directs it to the URL to display in search.
This one line of code tells a search engine that regardless of what URL it has spotted the content through, i.e. http://mystore.com/category/laptops/?=sort=brand, the page listed in the canonical tag is the original source and should be displayed in search.
Most modern CMS systems automatically create a canonical tag for each page. However, if yours doesn’t, you may have to get your developer to put appropriate tags on each page.
2. Set the Preferred Domain
To avoid confusing Google what version of the domain it should display in search: the main domain or subdomain, set up a preferred domain to tell if your site should be shown with our without the “www” in search results.
To set the preferred domain, go to Google Webmaster Tools > click on a little cog icon and choose Site Settings from the drop down menu.
3. Use 301 Redirects
If you have pages duplicating content from another, original page on your domain, consider redirecting them using a 301 redirect.
This way, pages will no longer compete with one another in search results. They will also consolidate their authority and search strength, positively impacting rankings of the page you redirect them to.
4. Exclude Pages from Search
Lastly, if you want duplicate pages to remain live but not indexed by search engines, exclude them from search index.
To do that, use the “noindex, follow” meta-robots tag.
It will still allow search bots to crawl the page and index any links on it. The tag will however prevent the search engine from indexing the page.
To use this tag, simply add this line to the <head> section of the page’s HTML code:
<meta name=”robots” rel=”noindex,follow”>
Was this helpful?
Did this guide answer your questions regarding duplicate content? Is there anything else we’ve missed? Ask us about it in comments.
Creative commons image by Woodleywonderworks / Flickr.