Digital work in and around the Humanities often involves moving data from one system or format to another. That data often involves complex textual materials in multiple languages and writing systems. One commonly used format is the "Comma-Separated Values" text file. It's not uncommon to find that characters not used in English get garbled when exported from a spreadsheet program like Microsoft Excel to CSV (or imported from CSV into such a program). What's going on and how do you make it stop?
WhyCSV began life in an era before Unicode and, because of that background, some software assumes that CSV should be encoding using the ASCII text encoding scheme (some older versions of Excel). Some software defaults to using ASCII, but you can override it manually (more recent versions of Excel). Some software tries to guess what encoding to use when reading or writing a given CSV file, but how it guesses may not be foolproof. Some software writes a special code called a Byte-Order Mark (BOM) into the beginning of any CSV file that uses a Unicode-aware encoding (Excel for Mac 2016). Some software doesn't expect a BOM and will fail to read the data correctly even if the encoding (e.g., UTF-8) is otherwise supported.
How to make it stopThe best way to make it stop is to:
- Make sure that any CSV file you import or export is encoded in UTF-8 without a Byte-Order Mark.
- Make sure that any software you're using is capable of reading and writing CSV files in UTF-8 without BOM and has been told to do so.