Development and Design

What is the best Character-Set and Collation for WordPress Database

Written by Sakthi Tharan

Like most of the content management system, WordPress depends on the database to manage the content for the blog/site. Character set and Collation defines how the data is stored in the database and how it is read. Character encoding is a way of storing the data. There is no default Character set or collation for WordPress database when I started using WordPress on 2007. But you could see the default character set is already set down in sample-config.php file when you download the WordPress files. The default Character set was ‘utf8‘ ever since the WordPress version 3.0 (to my knowledge).

So, it is the best Character Set? NO

According to Gary Pendergast’s post, from the version 4.2 WordPress has started to adapt to ‘utf8mb4’ character set. So if you have installed WordPress after the 4.2 release, you don’t have to worry about this. Because the tables created by WordPress will have the character set as ‘utf8mb4’. Unfortunately for sites which have been installed before the mentioned version have different character set which may not play good with WordPress codes.

For those who have ‘utf8mb4’ should have the collation as ‘utf8mb4_unicode_ci’ and also ‘utf8mb4_general_ci’. StackOverflow user Thomas Rutter has given a clear view on the difference between these two character set. According to him ‘utf8mb4_unicode_ci’ is slightly slower than ‘utf8mb4_general_ci’. But on another hand, it is the price we have to pay for accuracy. ‘unicode’ type collation performs better with a lot of support to many scripts (language scripts) including modern smileys.

When I started blogging with WordPress, the Php and MySql version has no support for utf8mb4. Since that day, my WordPress database has undergone many upgrades, sadly no utf8mb4. Till date, many of our sites use ‘utf8’ as the default character set. The default collation was ‘latin1_swedish_ci’ for all the tables and columns. Due to some technical issue, the table’s collation can’t be changed from ‘latin1..’ collation to ‘utf8mb4..’ collation. utf8mb4 character set requires more bytes to store a character than latin/utf8, almost double it seems. So I left it for later.

For now, we changed all the collation for table and columns of each table with ‘utf8_unicode_ci’.

‘utf8_general_ci’ is the best collation for all most all type of WordPress database. Then why we went for ‘utf8_unicode_ci’?

For us, we need a character support for local Indian languages, which is not good with ‘utf8_general_ci’. ‘utf8_unicode_ci’ supports well for Indian language scripts and most of the others too. The additional gain is, it supports most of the modern smileys. So if your default character set is ‘utf8’ then use unicode version of collation. It may be bit slower when it comes to operations like sorting, but it is still worth the accuracy. I believe.

I used David Winterbottom’s code from StackOverflow to convert all my table’s collation to ‘utf8_unicode_ci’ through PhpMyAdmin. Later I found a blog post which does the same with a php file method. Remember it is a risky process and dark art which you regret if goes wrong and there’s a lot of chance to that. So always do a backup.

After reading many expert’s article and recommendations I have concluded the best character set and collation settings for WordPress. But I can’t vouch for the methods to convert them with above two methods. If you are new to this, I suggest going to a professional.

About the author

Sakthi Tharan

Alpha Geek | Blogger by interest | Former Web Developer & Designer | Research Scholar | Likes to share what he learned.