Encoding pitfalls when exporting and importing data

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the databases category.

Last Updated: 2024-12-03

My job was to transfer data from an sqlite3 database I found on the internet to a mysql database in our system.

Dumping an .sql file from sqlite3 and then importing it was my first thought, but I ruled this out upon learning there were syntax differences. Instead I exported a csv file.

This, however, turned out to challenging to import for the following reasons:

  1. The character encoding in the CSV file had to be specified in the import statement in mysql:
  CHARACTER SET UTF8
  1. Adding to the confusion - the database itself and the individual columns had different encodings - (i.e. db is latin1 but cols are utf8)

  2. The line ending styles (CRLF/LF) - i.e. carriage return and line feed had to match.

Sqlite, by default, exported with \r\n, which is the Windows style. But I wanted \n for unix. Indeed git automatically removed the \r when I committed this CSV file (it warned me in fairness).

So I needed to modify the sqlite3 exporter:

```sql
.mode csv
# First arg is field separator, next is new line
.separator , "\n"
.once output.csv
select * from mytable;
```
  1. Some columns in the output were quoted - e.g. the 2nd column here: 10999,"Bockstraße" - I was expecting 10999,Bockstraße. I did not anticipate this happening, and ended up importing the quotes to database. Thus a search for Bockstraße failed in a confusing manner, since only "Bockstraße" was there.