I've been following the discussion
about Google and mirrored
information for some time. It is "common knowledge"
that Google
penalizes page rank when it determines that content is
duplicated somewhere else. In fact, I've read many experts
stating that there should be no duplicate domain names and
no
duplicate content anywhere.
On the face of it the arguments appear to be sound. Google
obviously has several billion pages in it's database and
could,
it appears, easily determine if content is duplicated. It
also
seems, again on the face of it, that it's reasonable to
check
for duplicate content, as this is the "mark of a spammer"
and
not necessary on the web with hyperlinking available. At
least,
this is the common wisdom.
However, sometimes what seems reasonable and possible is
not:
not by a long shot.
Let's begin with the technical side of things. You've got
domain x and domain y with exactly the same content. How
on
earth would Google be able to figure that out? Let's say
Google
had 3 billion pages in it's database. To compare every page
to
every page would be an enormous task - quadrillions of
comparisons.
Now, if site x had page "page1" which linked
to site y which
also had "page1", then it would be possible for
Google to
determine the duplicate content. Conceivably, it could check
this out.
Not only is the task enormous, but the benefit is so tiny
as to
be insignificant. Duplicate content does not imply in any
way
shape or form spamming. In actual fact, a duplicate site
is
generally going to lower page rank of BOTH sites. Instead
of
having 100 links to one site, there will presumably be 50
links
to one and 50 to another. This would tend (all things being
equal) to lower the page ranking of both sites. So Google
gains
nothing by this incredible expenditure of resources.
There are several reasons for duplicate content which have
nothing to do with spamming. Sometimes the content is actually
duplicated, and sometimes it's just that there are several
different domains (at least the www and non-www versions)
for
the same website
Mirroring a site for load balancing
- This is very common. The purpose is to split up
the traffic between two copies of the site.
Mirroring for region - Sometimes
site mirroring is done simply to make it more efficient
on the internet backbone itself. You might put an identical
copy of a site in Europe, for example, to reduce traffic
across the Atlantic, which should make it faster in European
countries.
Viral marketing - It's extremely
common to allow other sites to republish articles in return
for a link.
Different domain names - Sometimes
a site might be referenced on many different domain names.
You might want to allow the .com, .net and .org versions
of the name to all work the same, you might allow for common
misspellings or you might cover different keywords (sewing-tips
and sewing-secrets are examples of possible combinations).
Different domain names for different
markets - you might also want to reference your site
by different names in order to target different markets.
You could, for example, have a site about search engine
optimization and want to target both SEO and web designers.
Thus domain names like seo.com and webdesign.com would make
sense.
www - Any good webmaster
knows his or her site needs to be referenced with and without
the www.
Okay, so what's the smart thing to do? Well, it is possible
that search engines do compare a limited number of pages
to
check for duplication. They could certainly check if someone
reported something, and they might check directly linked
pages
(although this is still a heck of a lot of overhead for
very
little benefit).
Of course, Google and the other search engines can account
for
a hefty percentage of the traffic received by a site. In
fact,
sometimes the number can exceed 70 percent. So it's wise
to
spend some time ensuring that you are totally clean when
it
comes to search engine optimization. In other words, a
technician from any search engine should be able to examine
your site down to it's smallest detail and find no evidence
of
any kind of search engine spamming (attempting to get higher
rankings by unethical means). This is absolutely critical
to a
site's survival for the long term.
Keeping that in mind, here's what I tend to do.
Multiple domains - Using
multiple domains to the same site has a tremendous number
of advantages. Thus, I tend to follow the advice given by
others: take advantage of permanent redirection. In other
words, set up a redirection (a 301 status code) which simply
tells the browser "this page has moved, proceed to
this page, and the move is permanent. This tells the spider
about the redirection with no possibility of misunderstanding,
yet allows for the multiple domains.
Republished articles - I
allow others to republish many of my articles, and at this
time I have records of over 10,000 of them all over the
internet on thousands of web sites. This is not a problem,
as these articles are sent in text format. The webmaster
must then drop this text into his site, which requires some
reformatting and shuffling around. Thus, the finished articles
may have the same text but the formatting is very, very
different. This is a highly respected method of gaining
a large number of incoming links: I give you something (an
article, i.e., content) and you give me something (a link
back to my site).
Mirroring - I haven't needed
to do this yet, so I have no advice as to what to do if
a site requires actual, physical multiple versions of itself.
I would tend to just do it overtly (out in the open) and
not worry about it. |