SEO and Internationalisation

Separate language from territory; clarify your international site structure to Google

This is part of my Entreprenerd: Marketing for Programmers book, which is currently available to read for free online.

So you’ve seized the market in your homeland and now it’s time to build your empire abroad. Enter the need for internationalisation. This concept, alternatively known as regionalisation, encompasses all the changes you’ll need to make in order to serve customers in your newest regions. This includes translating text into other languages, displaying prices in other currencies (taking into account foreign tax rules), switching time zones, selectively showing or hiding content depending on regional relevance, and installing new payment and delivery options.

Nearly all of these internationalisation modifications will change the public surface area of your website, which means that they will also impact your website’s SEO. As we have seen in earlier chapters, Google dislikes duplicate or near-duplicate content. This can be a problem, because internationalisation efforts such as minor regional modifications run the risk of triggering these duplicate-content alarms and thereby thwarting SEO efforts.

This chapter helps you steer through these straits.

Distinguish Language from Territory in Architecture

Let’s start by disentangling two concepts in internationalisation that tend to get lumped together: language and territory.

Think about this: It is possible for a website to be focused on the geographic region of, say, the United States in terms of product range, delivery possibilities, and time zones, but nonetheless want to serve this territory through various languages (e.g., in English and in Spanish, the two main languages spoken in the US).

In a similar vein, it’s also not difficult to imagine a website whose English language pages could span many geographic regions, serving various English-speaking countries like Ireland and New Zealand—and also serving expat contingents resident in other countries (e.g., in Germany).

I bring up this distinction because a website architecture ought to be configurable to modify regional and linguistic content independently of one another, instead of merely conflating the two concepts together. If one were to categorically say that “this is our American site” or “this is our German site”, this would be a failure to appreciate the more subtle demands of going global.

Choose a Fitting Structure for Internationalisation

After deciding that it’s time to turn a previously one-country-only website into a more cosmopolitan affair, the biggest question from the SEO perspective is that of how to structure the URLs to accommodate all this newly internationalised content.

One temptation might be to retain exactly the same URL structure as you had pre-internationalisation, and instead use cookies or Javascript to display translated versions of the text. With this setup, the URL “example.com/about-us” would be served to both German and American visitors, yet readers in each of these regions would view differing content specialised for their particular regions.

Don’t do this. Google strongly advises against it, specifying that regionalised content ought to appear on separate URLs. Search engines expect that each URL has an unchanging identity; they are unprepared to probe a page by tweaking language and region variables, so internationalisation efforts that deploy these technologies inadvertently have the effect of hiding the internationalised parts of the website from search engines, which is not what you want.

Their condemnation also makes sense when we think about the user experience perspective:

  1. Why should we expect a German speaker to understand the URL “/about-us” when the equivalent in their language is /ueber-uns? This sort of presumption would amount to linguistic colonialism.

  2. The user experience of automatic internationalisation is maddeningly frustrating. What if an American visitor borrows his German friend’s laptop and wishes to place an order in English rather than in German? Is he compelled to fiddle with browser language settings and rummage for some other kind of hack just to reach the English content? Surely it would be a better user experience if the internationalisation were obvious from the URL structure alone, enabling the visitor to switch to their preferred regionalisation without the muddle (e.g., by changing “de.example.com” to “en.example.com”).

Now that we’ve decided that it’s better to place internationalised content on separate URLs, which particular URL structure is best suited for these purposes? Let’s go through the various options one by one and evaluate them:

  1. URL parameters: “example.com/?country=fr”, “example.com/?country=de”. We’re going to throw this one out immediately because Google specifically advises not to internationalise in this way.^1-9-1 Their primary reason is based on difficulties with URL segmentation (i.e., it is challenging for high-speed computer programs to reliably discern the different countries by scanning the URL parameters…the data just isn’t as structured as it is with some of the other options below). And just as this structure is challenging for Google’s high-speed indexers, so too will it be a nuisance for you (e.g., when working with segmentation in Google Analytics or other tools that work with URLs). I can also see a third flaw with this approach: Country is one of the most commonly used GET parameters for on-site searches or filters (e.g., a filter to show the book retailers whose country=Spain). Dedicating this commonly used parameter to internationalisation risks clobbering their other URLs. ’Nuff said—next please!

  2. Country specific domains: “example.fr”, “example.de”, “example.co.uk”. This approach is, without a question, crystal clear to visitors and search engines alike, and it also has the most professional air to it. That said, I would advise small website owners to steer clear of this approach based on the following arguments: Firstly, country-specific domains are heavy on administrative and other “busy work”. Many countries, such as Australia, require you to present their authorities with a case for why you should have some particular domain, a process that could necessitate annoyances such as registering a trademark or business in the region. Secondly, the availability of your desired local domain names is not guaranteed, either because of the presence of other businesses or because of opportunistic domain squatters who snap up variants of your domain and hold them ransom to you. As soon as there is just one territory with a domain name unavailable to you, the technical and branding purity of this “one country-specific domain for each country” approach is irreversibly tarred because you will now need to resort to hacks and bandages in order to reach people in some countries. Thirdly, this approach is more expensive that the others, with costs accruing in terms of renewing domain names, SSL certs, and applying to governments for trademarks. Fourthly, this approach is inferior for SEO because it splits SEO juice across many domains instead of concentrating it into the one. Each of these domains will need to gather its own backlinks and reputation—hardly ideal.

  3. Subdomains: “de.example.com”, “fr.example.com”. This approach is easy to administer and cheap to set up. Additionally, the website’s internationalisation structure is crystal clear to the end-user. Google claims that subdomains are every bit as efficient as subdirectories for imparting SEO juice, but experiments by SEO experts show that there is still a significant penalty for subdomains over subdirectories (which we’ll detail next).^1-9-2 The crux of the issue is that Google sometimes views separate subdomains as separate websites, meaning that they don’t always pool the ranking power from all the subdomains together, which has the effect of dulling the SEO power. Subdomains are best avoided for this reason.

  4. Subdirectories: “example.com/fr/”, “example.com/de/”. This approach is easy and cheap to administer, since there is only one domain to buy and deal with. It is also the easiest to program because a website partitioned by folders is about as standard as it gets, and doesn’t necessitate complex fiddling with host files, SSL certs, or DNS configurations. Google’s SEO meister Matt Cutts has praised this approach over subdomains because internationalising with directories leaves open the possibility of hosting separate services on your various subdomains (“maps.google.com”, “mail.google.com”) instead of filling up those slots with internationalisation information.^1-9-3 And, more importantly, it’s seen to have the strongest SEO effect. In short, this is the approach for you.

Communicate Your Structure to Google

It is not enough to just internationalise your website; you also have to clarify to search engines what your internationalisation structure is. This serves two rather important purposes: Firstly, signalling your internationalisation structure helps steer you free of Google’s penalties for duplicate content. Secondly, it helps ensure that regional searchers see appropriate content from your website (e.g., you don’t want googlers based in Italy to see search results for your UK website—this mismatch would scare off organic traffic).

The Hreflang attribute

This is a special piece of data added to parts of your website in order to inform Google that regionalised alternative versions of content exist.

There are three different ways of specifying this hreflang attribute: within HTTP headers, within the HTML head, and within your sitemaps. Having these three possibilities open to you doesn’t, however, mean that you should use any more than one of these formats—besides this being redundant, this sort of repetition raises the potential for errors and inconsistencies.

The hreflang attribute does not need to be set for every page of your website. Instead, it is only necessary for pages that appear in two or more regionalised alternatives. Let me illustrate this idea by referring to an imaginary pharmacy website. This pharmacy sells the drug paracetamol in all three territories it serves (France, the UK, and Russia), and the language, currency, and delivery options vary depending on the territory in which a visitor resides. Because there are three regional variations of this paracetamol page, it makes abundant sense to attach a hreflang attribute to it.

But the pharmacy also stocks other drugs—for example, controversial narcotic painkillers that are illegal in France and the UK. These medicines only appear on the Russian website. Because there is no mirrored content in any other region, there is no need for a hreflang attribute to be specified in this case—indeed, doing so would send Googlebot down a dead end.

Let’s dig a little into the specific format of the hreflang attribute. Here’s what you would expect to see on a page that has three international variations:

```html <!-- This page is example.com/en/about-us, where the programming team chose to put the href attribute in the HTML head. -->

```

Let me point out a few potential gotchas:

1. The value of the hreflang attribute can contain either one or two components, with the second component separated from the first with a hyphen. The first entry is always the language (in ISO 639-1 format), and the (optional) second component is the geographic region (in ISO 3166-1 Alpha 2 format).

Running through the example from top to bottom, we have:

  • de: German language content, independent of geographic region

  • en-GB: English language content, for visitors in Great Britain

  • en-DE: English language content, for visitors in Germany

Now time for the pop quiz: Does hreflang=“be” specify content for Belgium? The answer is “no”! That’s because the hreflang format places the language first and “be” targets the Belarusian language. To target French speakers in Belgium, you would need “fr-be”.

2. Every page must identify all internationalised variations, including itself. As you’ll notice above, the second entry pointed back to the URL that was currently being accessed.

3. If the hreflang entries in page A link to a regional variation in page B, then the hreflang entries in page B must also link back to page A. Google calls these “confirmation links” and warns that annotations that are missing these pointers back may be ignored. The upshot of this is that the exact same block of hreflang links you saw in the above example (for the URL “example.com/en/about-us”) must also appear on the URLs “example.com/de/ueber-uns” and “example.com/de/about-us”. Web applications that have built-in fragment caching systems could cache and share these snippets across the related pages instead of regenerating them afresh every time, thereby bolstering performance.

Viewing your internationalised Google search results

It’s possible to check whether your hreflang efforts were successfully understood by Google through their Search Console. After logging in, navigate to Search Traffic > International Targeting, and you should see a readout if everything is in order.

Because Google Search remembers and detects your region and preferred language, it can be difficult to see how your website presents itself in searches originating from other regions or done through other languages.

Luckily, there’s a helpful tool for doing this: Google’s Ad Preview.^1-9-4 Ostensibly designed for viewing adverts in faraway lands, this tool also displays regionalised organic search results—which is very handy for evaluating internationalisation measures.


More Articles: Click here for full archive

Awareness Through the Creation of Jargon

An experiment in perception involving a taxonomy of typos.


Debugging Rails with Pry Console

Tips and tricks for snooping about your codebase with the Pry console. You'll learn how to view the source code of ANY method on demand, see global variables, or change "self" to another object.


Debugging Rails with Logs

Everything you ought to be looking at when interpreting the Rails built-in logs. Also looks at other logs, such as Amazon S3 access logs and scheduler logs.