This video continues where Part I left off. Here I talk about managing the limited Google crawl budget (by instructing it to ignore boring pages), preserving accrued SEO juice when users edit their own content on your website (or when you fix typos), getting into the Google Images game, dealing with shadowed content when two users create pages with the same name, and automated tools to audit your on-page SEO.
Transcribed by Rugo Obi
There’s some debate about whether or not you need to explicitly tell Google and other search engines where your content is.
But it's been my experience that Google doesn't find stuff, at least with a relatively large website like mine, unless you tell it explicitly where using a sitemap.
My sitemap is relatively complicated because I deal with multiple domain names.
But in general, my sitemap is responsible for two things.
1) For listing all the pages - specifically for going through all the relevant database objects and then generating the relevant pages out of them.
2) And until relatively recent (there's a bug with this bit at the moment) submitting image sitemaps so that I get listed in Google Images as well.
Something else I like to do is to add a lastmod
attribute to all the entries in my sitemap.
This is kind of a nod towards Google having limited resources and telling it when it should bother coming back and checking the website again.
Google has a limited amount of crawl capacity tt gives each website. And at least in the past, mine was around 12k even though I had about 20k pages submitted at the time.
I've since reduced the amount of pages submitted and also told Google that most of my stuff doesn't need to be recrawled and now the whole website has been indexed.
The code for generating that sitemap is relatively slow, so I run it on a sort of schedule. I believe every 10 minutes. I use a scheduling process that you’re looking at here in order to coordinate that.
And here you see that it runs a task scheduler every 10 minutes and then when we go into that task - this is a Rakefile
- and then you can see that the sitemap refresh - no-ping
happens.
In case you’re curious, this is how the Google Search for images works. You can see that I get these really nice looking images for each of my products, and then you can visit the website via them. And this is another way to generate traffic through Google.
One of the things that helped me get fully indexed in Google was to stop submitting low-value pages that didn't really change or didn't really have any useful keywords on them.
For example, nested within every single product page, I have a separate question page. This is very old fashioned, you'd probably use a JavaScript popup now. And I essentially had as many of these pages as I had products in my database.
And these pages were extremely uninteresting and caused Google to think my website wasn't worth indexing.
The parallel I have to that in the code is a custom function seo_noindex
- which you can see in action over here. Basically adds a meta tag name: "robots", content: "noindex"
to the page.
A great way to figure out what you need to do next, at least for your on-site SEO is to use the Lighthouse audit report provided in Google Chrome.
Here I'm just selecting the SEO report. It's going to take a while to run a test and then it's going to give me a score along with a list of action items.
y score is 92. I've run through this before, as you can see I'm missing alt
elements in my images and that might help things out.
Here we can see all the past audits. The things I've done right like having a title tag, and meta description, status code of success... all that sort of stuff.
How User Edits Can Foil SEO
Next, I'm going to talk about how changes in user-generated content, such as edits or deletes can have an effect on your SEO.
Let's start by taking a look at the most popular sample URLs from my website in the last week. Here we see one that ends with the slug structuring-a-plea-in-mitigation
.
Next, let's look at the production database and see what that looks like from this perspective.
So I'm going to find a notes file that matches that particular slug. Just pasting it in. And here we can see that the ORM has grabbed that data and wrapped it in an object.
You notice that it has a matching slug field as well. And this slug was generated from the original data file name structuring-a-plea-in-mitigation.docx
.
That NotesFile
data comes from authors uploading notes to the website. But because there’s a self-serve area, authors were able to delete their notes or to edit them.
There's nothing I can do about deletions, but at least while there's edits I do want to preserve the old slugs that performed well or that have accrued search engine juice.
There’s nothing I can do to stop an author changing the title of something. But I can stop them from changing the URL and therefore from me throwing away SEO juice.
I do this by making the URL permanent by using a permalink i.e. non-changing slug that stays the same despite changes to the title or whatever.
Another issue with websites with user-generated content is that sometimes they name their files in ways that are good for SEO, like “introduction to land law” here or “leases”. But other times, they name them in really non-descriptive and useless ways like “outline one” here, or “outline two.doc” in this case.
Something I generated for the website was an admin area where someone can go in and look at the particular notes on sale and then figure out the best name for it from an SEO perspective.
This affects all the SEO data. From the meta description, to the title, to the permalink.
Here you can see me opening up this “outline one.doc” and using a document viewer piece of JavaScript and then I can rename that according to what’s inside. That looks like the “Sherman Act” to me so I'm going to rename it along those lines. That's going to be a lot better for SEO.
Another problem that may affect you in a large website containing lots of user-generated content, is shadowing of entities.
Here I have two different notes files each of which have the same name and slug, but they belong to different products. The first one belongs to Australian criminal law and the second one belongs to Irish criminal law.
In reality, I have some sort of uniqueness validation and database constraints preventing this situation. But that wasn't always the case with the website.
Next, let's take a look at a simplified version of the router.
This corresponds to the /notes_files
URL and then takes a :slug
parameter and then it runs this inline function, finding a NotesFile
by the slug
and rendering the show
template with that particular notes file.
The issue here as you can see is that the two notes files in the right-hand pane have the same slug, but they belong to different products so they are going to different entities. And the one more recently created is going to shadow the one less recently created and cause all sorts of confusion.
To make things more concrete for you, I've created two web pages on my site. One for Australian criminal law and one for Irish criminal law.
The two we had in Vim just there. And then you're going to see a link to that notes file. Here- criminal law notes.
And the one here, which is Australian, is going to link to the same one as the Irish one, which we can see down here.
Preserving SEO Redirect Juice
Over a period of years, your website is going to change its URL structure quite a bit. Either because programmers on the team modify the URLs or more likely because user-generated content changes.
This is something that you should address in your SEO strategy.
For example, I have this particular product here, 276, BPTC civil litigation. Its current slug is, bptc-law-civil-litigation
and you can see that right there.
But I have this method slugs
which looks at a history of the slugs I used to have.
So you can say here that I have two slugs each of them is a FriendlyId
slug item and the first one is what you see presently. But there is an old one that if you look at this second-to-last line, created at Sun 20 September 2015, so it's about five years old. And this is bptc-law-bptc-civil-ligitation
, so there's two problems with that.
"bptc" is there twice and "litigation" is misspelt.
Of course I generated-I stupidly generated- a couple of links towards that page and I wanted to preserve that. So I have in place an automatic system to ensure that my website still responds with content for that particular URL.
And let's confirm that this works with curl
.
I'm going to curl the current slug along with its full URL and you can see at the top here response 200 OK.
Now I'm going to modify that previous command and use the second slug, the one with the mistakes.
When I curl that, I also get a 200 OK
which is exactly what I want. Well no, it's NOT exactly what I want. What I exactly want would be a 300
: a redirect, but unfortunately that isn't supported. I just realized that at the time, but it's a feature I’d love to have.
Just to check that everything is working alright, I'm going to curl a slug that doesn't exist in my database. The same thing with an s
at the end. When I do that, I get a 404 not found
as expected.
Before we go, let's check how that continuous integration test went and whether or not I can deploy the change to the article micro-data.
You can see here that everything worked perfectly. Rspec finished its 316 examples. With that I deploy, and now I can use the Google structured data testing tool to see if my change worked. And now you can say the article has zero warnings.
Hopefully by the time you watch the next video, I'll actually be getting an articles micro snippet in Google.
See you next time.