Internationalization (i18n)
clear IDA 7.0: Internationalization (i18n)
Intended audience
This document describes an important change that happened in the code while designing IDA version 7.0: the move to using UTF-8 everywhere.
This is mostly of interest to plugin authors, either binary or IDAPython
plugins.
Before 7.0
Prior to version 7.0, IDA would store the following strings:
segment names
function names
function local labels
comments fonr function, segment & address
database notepad contents
...
into the database, using the local 8-bit encoding. Specifically:
on Windows: usually, the OEM codepage, but sometimes the ANSI codepage (for the database notepad)
on Linux & OSX: the locale's 8-bit encoding, which in most cases ended up being UTF-8
The problem
There were many issues with this, but the most obvious ones are:
inability to pass an IDB to someone using a different locale, and have sensible non-ASCII text
it would sometimes be unclear what kind of encoding was used, for what transiting data (e.g., for plugins)
The solution
While we were knee-deep in the refactorings we did for 7.0, we decided it was a good time to improve the situation, and we did so by imposing UTF-8 everywhere in IDA: any string that transits within IDA's memory, is now encoded in UTF-8.
Plugin writers needing to support more than just the ASCII subset of Unicode code points, will certainly find the current situation more comfortable.
What about my older databases?
Because a byte is a byte, and without additional context it's impossible to know how that byte should be interpreted, we had to resort to heuristics when a database is ported to the 7.0 format.
The following will be converted during a database upgrade:
function comments
decompiler pseudo-C function comments
segment comments
hidden areas descriptions
structures comments
structure members comments
enum comments
enum members comments
database notepad contents
script snippets
And, for any of those, here's how IDA will decide whether or not a conversion is needed:
if the contents is not valid UTF-8, or
if a conversion encoding has been specified (see below)
..then IDA will convert that data to UTF-8, using the following rule:
if a conversion encoding has been specified (see below), use that
otherwise
if on windows, assume the source is in the locale's codepage(s)
otherwise (i.e., on Linux or OSX), assume the codepage(s) are those of "Western Europe" -- hopefully covering most databases
And if such a conversion happens (and no conversion encoding was specified), IDA will display an example, post-conversion, text in the messages list for you to figure out whether it did the right thing or not. E.g.,
(in this particular case, IDA converted a test function comment, which in 6.95 would be stored using the OEM codepage CP850 (i.e., "Western Europe"))
As you can see at the bottom of the message above, IDA hints at the UPGRADE_CPSTRINGS_*
configuration directives. Let's have a look at those.
Specifying the source encoding for conversion to UTF-8 using UPGRADE_CPSTRINGS_*
UPGRADE_CPSTRINGS_*
In case IDA got it wrong, and either:
your locale's codepages on Windows don't correspond to those of the machine where the IDB was created, or
you are on Linux or OSX and the IDB was created on a Windows machine with a locale that features codepages other than those for "Western Europe"
...then IDA will improperly interpret those bytes, and the conversion to UTF-8 will yield wrong, possibly garbled results.
In that case, you can help IDA by specifying:
UPGRADE_CPSTRINGS_SRCENC
UPGRADE_CPSTRINGS_SRCENCA
in order to instruct it what encodings/codepages should be used for data that's encoded using the OEM codepage or the ANSI codepage, respectively.
Please have a look in the cfg/ida.cfg
file for more documentation regarding those directives.
Specifying those on the command-line
It's also worth pointing out that those directives can be passed on the command line, using the following syntax:
In this case, IDA will use CP866 (i.e., Cyrillic OEM codepage) to perform the conversion of those string that are stored using the OEM codepage.
Last updated