I recently started to encounter MySQL’s dreaded illegal-mix-of-collations error. The occasional query would result in Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='.
MySQL’s default character set is latin1 with an adventurous Swedish collation, which presumably seemed like a good idea at the time. However Rails uses UTF-8 everywhere and from time to time the two collations collide.
Fortunately this problem is not new and clever people have already solved it: see Getting out of MySQL Character Set Hell and Fixing a MySQL Character Encoding Mismatch.
Those two articles advise more or less the same process. Here I’m just jotting down some notes where I found something confusing or ambiguous in case I ever have to do this again.
After exporting the data and hacking the dump file, I wasn’t sure whether to configure MySQL for UTF-8 and then import the data, or the other way round.
I chose to import the data and then configure MySQL to avoid double-encoding the data. It seems to have worked. It silently truncated all my accented text but I was able to fix that.
First figure out where you stand. In MySQL do this:
mysql> show variables like 'char%';
The result you want is this:
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
At this stage several of these values will be latin1.
Similarly for collation:
mysql> show variables like 'collation%';
+----------------------+-----------------+
| Variable_name | Value |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+----------------------+-----------------+
The utf8_general_ci collation will do but ideally we want utf8_unicode_ci. The former basically ignores all accents: it treats ‘ΓΌ’ the same as ‘u’. The latter pays attention to accents; it’s a tiny bit slower but more accurate.
To see what your tables are using:
mysql> show table status \G
…and look at the Collation field in each table.
Here’s the command I used:
mysqldump -uUSER -pPASSWORD --quick --single-transaction --create-options --skip-set-charset --default-character-set=latin1 DATABASE > dump.sql
…where USER, PASSWORD, and DATABASE should be replaced appropriately.
My SQL dump was only 40MB so I was able to edit the tables' character sets in Vim (set :syn off for speed). In Vim:
:%s/DEFAULT CHARSET=latin1;/DEFAULT CHARSET=utf8;/
I added this configuration to /etc/mysql/my.cnf (on Ubuntu 10.04):
[client]:
default-character-set=utf8
[mysqld]
# MySQL 5.5.3+
character-set-server=utf8
collation-server=utf8_general_ci
# MySQL <5.5.3
#default-character-set=utf8
#default-collation=utf8_general_ci
The default-character-set and default-collation options were replaced in MySQL 5.5.3 by character-set-server and collation-server.
After I had done all this and restarted mysql, I found all the character_set_* variables were correct except character_set_database. To fix this:
mysql> alter database DATABASE default character set utf8;
mysql> alter database DATABASE default collate utf8_general_ci;
Verify with show create database DATABASE. (Documentation.)
In the MySQL console I ran these:
mysql> show variables like 'char%';
mysql> show variables like 'collation%';
mysql> show create database DATABASE;
mysql> show table status \G
And then in the Rails console:
>> ActiveRecord::Base.connection.collation
"utf_general_ci"