Tuesday 14 June 2011

On Google PageRank and SEO

Content aggregators are bad. They just rely on original work and using seo/sem techniques manage to push their content above the original ones.

A friend had a problem where his original post enjoyed a brief appearance on the first pages of google and then it got picked up by an aggregator/scraper. It went downhill from there.

https://www.sumo.gr/2011/06/01/ta-10-kalytera-nhsia-toy-kosmoy/

So here is some free (as in free beer) advice on SEO:

The hyperlinks of the articles should ideally have the form of
* https://www.sumo.gr/{date}/{article-title}[B].html[/B]
The html extension is important as it essentially tricks the google bot to think that you are serving a static page (vs a dynamic) and so it's boosted

* Even better, the link page and file title should be in its original language:
https://www.sumo.gr/2011/06/01/Τα-10-καλύτερα-νησιά-του-κόσμου.html
So, searching for "τα καλύτερα νησιά" there should be partial and direct correlation to an address

* IMG tags should always have an ALT text! Ideally it should be normal text so *that* can also be included in the index. Normally the filename it self should be human readable.

* The page title must not include the site's name in each and every page/article, only the article's title

* The <meta name="keywords" /> is very bad. So much, that i believe that it appears as spam to the bot. Let it have 10-15 keywords (even if for every language although the original language is the only important) and non-repeatable. Only the "hawaii" keyword appears 5 times! And the title appears something like 7-8 times

* The <meta name="description" /> is a little large. Keep it close to 140-150 characters (now its 167, google keeps 160, bing 150)

Here i should say that allegedly meta description/keyword are not used in the algorithm (since 2009) but that doesn't mean that are not getting flagged as spam

* Content-Encoding:gzip . 'nuff said. Generally the site suffers from a performance perspective and it lowers your pagespeed rank

* caching:
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: ...1981
Really? This does not influence pagerank directly but pagespeed. And *that* influences pagerank

In fact, the latest trend in google search is giving direct preference to pagespeed

* passing the w3c validator may or not may be important
http://validator.w3.org/check?uri=https%3A%2F%2Fwww.sumo.gr%2F2011%2F06%2F01%2Fta-10-kalytera-nhsia-toy-kosmoy%2F&charset=%28detect+automatically%29&doctype=Inline&group=0

Ignoring the various pseudo-attributes for facebook and meta data like rel (and which generally are not considered "braking" there are actual html errors. The most important i could detect is not cleaning double quotes. Take for example the text <<Το πρώτο νέο "σπίτι" της ανθρωπότητας>>. if this is placed in a title attribute (and it is) then the whole element brakes, disrupting the bot's work:
<a href="..." title="Το πρώτο νέο "σπίτι" της ανθρωπότητας" is read like
<a href="..." title="Το πρώτο νέο " σπίτι" της ανθρωπότητας" where <<σπίτι>> is a bad attribute.

also, <<Stray end tag a.>> not good, bots expect quality html

Generally the aggregator's w3c validation is much much better and this may boost them instead of you.

This definitely not an exhaustive list but just some things i saw in a short glance

2 comments: