So when I loaded my data into MySQL the other day, MySQL silently truncated every piece of text at the first accented character. This affected about 13,000 records. The truncation was bad enough, though I suppose MySQL just couldn’t figure out what the byte stream meant at that point. But to do it silently?
Not cool.
Why did MySQL truncate my text? I have no idea.
Move to PostgreSQL.
I couldn’t find the affected records in the new database because the truncated data was truly lost. How do you query what’s not there?
So I loaded my data dump into a fresh development database. Strangely this worked despite having gone pear-shaped in production.
I then looked for all texty fields in the database containing a French or German accent (based on my customers' languages) and generated a SQL script with their values. The accent query below just wouldn’t work until I used the binary keyword.
Here’s the Rake task I wrote:
task :fix_accents => :environment do
accents = %w[ ß ä ë ö ü é à è â ô ]
ActiveRecord::Base.connection.tables.
reject { |name| %w[delayed_jobs schema_migrations].include? name }.
each do |table|
klass = table.classify.constantize
texty_columns = klass.columns_hash.
select { |name,col| col.type == :string || col.type == :text }.
map { |name,col| name }.
each do |name|
accents.each do |accent|
klass.select("id, `#{name}`").where("binary `#{name}` like '%#{accent}%'").each do |row|
puts "UPDATE #{table} SET `#{name}` = #{row.send(name).inspect} WHERE id = #{row.id};"
end
end
end
end
end
I piped the output into a new file, tested a few key lines on the production database, then piped the whole thing into the production database. Problem solved.
I’m still moving to PostreSQL though.