My previous post on the issue of MySQL character sets addressed the problem when the data in the database has been stored incorrectly by the encoding of the connection. This time, I’d like to cover what to do when the connections are okay, but the table definitions are wrong.
Normally, this isn’t a problem, by the “rubbish in, same rubbish out” principal, whereby even though the database doesn’t understand the data as long as goes in and comes out the same there will not be a problem. This has worked for me in the past, but it turns out it doesn’t work when you start dealing with Cyrillic and symbolic languages such as Russian and Chinese. (I have yet to discover exactly why, but I suspect it has something to do with the range of the bits in UTF-8 encoding.)
Firstly, we’re going to create a new database to hold the properly formated data, so enter the mysql console and create a new database with the following command:
CREATE DATABASE new_database CHARACTER SET utf8;
Unfortunately, MySQL does not allow us to rename a database, so the new name must be permanent and all the permissions that use the database name will need to be modified. The old and new databases for this miniguide are distinguished as old_database and new_database respectively. There are hacks to get around the renaming issue, and solutions to this can be found elsewhere. Phpmyadmin also makes renaming pretty straight forward by copying the database.
Back on the command line, we need to copy and modify the table definitions to use the new character set. At this point, you might want to stop your website to ensure nothing else gets written, and in this instance so that the autonumber fields do not loose track. Dump just the definitions of the original database as follows:
mysqldump --no-data -u root -p old_database > /tmp/tables.sql
The resulting file must be opened, and everything that says “latin1” must be converted to “utf8”. (Vim search and replace might help with this.) Also, you might want to remove any of the COLLATE commands, as they will default to something reasonable if the character set is correct.
With the tables corrected, we need to dump just the data, including the default character set of course.
mysqldump --no-create-info --default-character-set=utf8 -u root -p old_database > /tmp/data.sql
Lets now dump this back into the new database:
cat /tmp/tables.sql | mysql -u root -p new_database
cat /tmp/data.sql | mysql -u root -p new_database
Thats it! The new database should running completely using utf-8 and you shouldn’t have corrupted any of the data that was already in use. Let me know if you find this useful or come up with a problem!