clear IDA 7.0: Automatic discovery of string literals during auto-analysis
Experienced, power users wishing to obtain the best initial auto-analysis results, in particular on files containing non-ASCII string literals.
Note that IDA usually already provides very good results out of the box, so the information below is really for expert, fine-tuning purposes.
When it performs its initial auto-analysis IDA will, among many other things, look for string literals in the segments that were loaded from the file.
That "looking for string literals" relies rather heavily on heuristics, to tell possible string literal from other things. Some of the concepts used by those heuristics are:
length of candidate string
proximity of other strings
whether characters of candidate strings are printable
whether characters are part of ida.cfg
's set of acceptable chars in a string literal
whether characters met in the candidate string are either ASCII, or for those that are non-ASCII if they are all part of the same language
?
The rest of this document will focus on the 4th item: the set of acceptable chars in a string literal.
Prior to IDA 7.0, string literals were just treated as strings of bytes, and it was assumed that the locale's encoding should be used whenever decoding those into actual, displayable strings.
That worked satisfyingly well, but led to many false positives, and the impossibility to have IDA perform the best auto-analysis possible, even when the user knew what specific encodings were used in the file.
IDA 7.0 changes that, and always assigns default encodings for encodings with 1-, 2- and 4-bytes-per-unit.
Example 1-byte-per-unit encodings are: CP1252, CP1251, UTF-8
Example 2-bytes-per-unit encodings are: UTF-16
Example 4-bytes-per-unit encodings are: UTF-32
Unless one is specified, IDA will 'guess' those encodings, and for the 1-byte-per-unit encoding, it'll do so in the following manner:
if the file is a typical Windows or DOS binary (i.e., PE
, EXE
or COM
), then
if running on Windows, then use the locale codepage
else (i.e., running on Linux or OSX) default to CP1252
otherwise,
default to UTF-8
Those are the "best guess" defaults and they are, in effect, not very different from what was happening in IDA before version 7.0
ENCODING
configuration directiveSpecifying ENCODING
in the ida.cfg
configuration file (or on the command line) lets the user inform IDA that the bytes in a 1-byte-per-unit string literal, are encoded using that encoding.
Now that the default (or ENCODING
-specified) encoding topic is covered, let's get back to the root of the problem..
Before 7.0, IDA would use ida.cfg
's (somewhat confusingly-named) AsciiStringChars
directive, to determine what bytes were possibly part of a string literal.
That AsciiStringChars
directive is a byte string, which contains essentially all printable ASCII chars as well as a subset of the upper 128 values of the [0-256)
range.
The most visible problems with this are:
whenever a user wants to improve AsciiStringChars
to match the set of bytes that look valid in a different encoding, the user typically has to:
look up that encoding definition, to see what values above 0x7F are likely valid string literal characters in that encoding
encode those in the global ida.cfg
file, which can be pretty tricky if the user's editor is not setup to work in that target encoding: it will show those byte values as other characters
no support for UTF-8 sequences: AsciiStringChars
doesn't support multibyte encodings. If the user is analyzing, say, a Linux binary file, it's likely that non-ASCII string literals are encoded using a multibyte encoding such as UTF-8. There was no way for the user to express what non-ASCII UTF-8 sequences are acceptable, in ida.cfg
.
Instead of AsciiStringChars
consisting of a C-like string of bytes describing the acceptable set of characters, we have:
renamed AsciiStringChars
to the less ambiguous StrlitChars
bumped StrlitChars
into something more evolved, which can contain not only character literals, but also different forms of content
Let's look at those..
StrlitChars
formatThe new StrlitChars
is composed of a sequence of entries. E.g.,
We can observe that:
entries are separated by ','
(commas)
string literals are accepted, which allows adding ASCII printable characters very easily
Unicode codepoints (uXXXX
entries) are accepted
you can add a whole 'culture' to the set of accepted characters/codepoints
you can add the 'current culture' to the set of accepted characters/codepoints
When IDA starts, it will compile that directive into an efficient lookup table, containing all the codepoints that were specified, and that lookup table will be used just like AsciiStringChars
was used to determine what codepoints are acceptable in a string literal.
Let's now take a closer look at the notions of 'culture' and 'current culture'.
First of all, let's be blunt: we use the term 'culture' for lack of a better word. It doesn't represent an actual culture in terms of history, tradition, ?
A 'culture' in IDA is a quick way to represent a set of codepoints, that conceptually belong together. Typically, those 'culture's will contain many letters, but very few symbol or punctuation codepoints (in order to reduce the number of false positives in automatic string detection.)
As an example, if we wanted to add the set of characters supported by the "Western Europe" charsets to the StrlitChars
directive without using 'cultures', we could do it like so:
Note that we just introduced two additional syntactic possibilities [1], here:
Unicode codepoint range: uXXXX..uXXXX
(end inclusive)
Codepoint suppression: -uXXXX
As you can guess, it can become a tad tedious -- and Latin 1 is simple, but if I wanted to add the characters that are likely to be found in, say, the "Baltic" culture (which roughly corresponds to codepage CP1257
), I would have had to add ~70 disjoint codepoints, which makes it become cryptic & error-prone.
IDA ships with a predefined set of 'culture' files. They can be found in the cfg/
directory:
?but you are of course free to add your own, and/or modify or improve the existing ones as needed (you can even send those back to us; they'll be very much welcome!)
Ok, so now you know a bit about what is a 'culture' in IDA's parlance. There's one more thing to cover though, and it's non-trivial: the CURRENT_CULTURE
token.
CURRENT_CULTURE
about?The StrlitChars
directive will typically contain the CURRENT_CULTURE
directive. That instructs IDA that all codepoints derived from the 'current culture' that IDA is operating with, should be considered valid codepoints in string literals.
There can be 2 sources of information for IDA to know what 'current culture' it should be operating with:
the CULTURE
config directive (in ida.cfg
), or
the default 1-byte-per-unit character encoding of the IDB, if that encoding is not UTF-8 [2] (regardless of whether IDA assigned that default 1-byte-per-unit character encoding, or whether the ENCODING
directive was provided.)
Let's have a look at those.
CULTURE
config directiveIt is possible to tell IDA, at start-time, what 'culture' it should be operating with, by setting the CULTURE
configuration directive in the ida.cfg
file. E.g.,
The above statement means that IDA will load the cfg/Cyrillic.clt
file, parse its set of codepoints, and add that to the ones already specified by the StrlitChars
directive.
Therefore, when performing its initial auto-analysis, IDA will consider valid for a string literal all codepoints defined by StrlitChars
, and that means:
codepoints within the specified ASCII subset,
or among the set of carefully-selected symbols ('COPYRIGHT_SIGN', etc..),
or among the set the codepoints featured in the cfg/Cyrillic.clt
file.
If you didn't specify the CULTURE
config directive though (which is the default), IDA will try to 'guess' the culture, from the current 1-byte-per-unit encoding of the database, but only if that encoding is not a multibyte encoding (e.g., UTF-8.)
However, if the encoding is UTF-8, things will be different?
Non-UTF-8 files: deriving the 'culture' from the default 1-byte-per-unit encoding
By default, IDA doesn't have a CULTURE
specified in its ida.cfg
file. Instead it will try to derive the 'current culture' from the default 1-byte-per-unit encoding (provided that encoding is not UTF-8)
Whether that encoding is specified using the ENCODING
directive, or if it is guessed from the system's locale, IDA will derive the 'current culture' from that encoding using the following table in ida.cfg
:
For example, if the default 1-byte-per-unit encoding is CP1252
, IDA derived that the 'culture' is Latin_1
, causing auto-analysis to discover the following string in a file:
?but if that encoding is something else (e.g., CP1251
), then you might end up with this instead:
That is because IDA derived the 'culture' from the encoding, which in this case led to the 'Cyrillic' culture, which doesn't contain the French letter 'é'
, causing string recognition to fail.
In order to fix this, you can run IDA like so:
Then, all is fine again: IDA could find that string literal:
In addition, if you are very often disassembling files that require that you specify a given ENCODING
, you can simplify your workflow by either
setting ENCODING
in ida.cfg
: ENCODING=CP1252
adding Latin_1
as culture in StrlitChars
:
UTF-8 files: specifying a CULTURE
for IDA to provide the best auto-analysis
In case the default database encoding is UTF-8, however, IDA cannot derive a 'culture' from it.
In that case, IDA will consider by default that all non-ASCII codepoints are not acceptable. That's because accepting all non-ASCII codepoints by default, would possibly bring too many false positives.
To change that behavior, you can specify the CULTURE
configuration directive to match what you believe is the language(s) that the binary file's strings are encoded in.
For example, in an UTF-8 Android Dalvik file that contains some French text, IDA might fail to recognize the following string:
?and turn it into double-words instead at the end of the auto-analysis:
In order to fix this, you can specify the 'culture' for IDA to consider the acceptable set of non-ASCII codepoints for that file:
?and IDA will be able to determine that there is indeed a string there:
CULTURE=all
: accept codepoints from all cultures
Although in the previous section we mentioned that accepting all codepoints by default in a string literal might lead to many false positives, it is still possible to instruct IDA to do so, by using the all
wildcard:
CURRENT_CULTURE
: wrapping upTherefore, the user can either:
specify an ENCODING
for 1-byte-per-unit string literals, and if that encoding is not UTF-8 let IDA derive the 'current culture' from it, or
specify a CULTURE
, to override whatever IDA might have derived from the effective database 1-byte-per-unit encoding (regardless of whether it was guessed, or specified with ENCODING
)
There's a lot of non-trivial information for you to process in this document, and by now you might be either a bit overwhelmed, or just plain confused.
Let me sum up the information in the following manner:
On encodings:
IDA now automatically guesses & assigns 1-byte-per-unit, 2-bpu and 4-bpu encodings to a database
That guess can be overriden by specifying an ENCODING
Regardless of whether it was guessed or specified, that encoding can be used to derive a 'current culture'. That doesn't work for UTF-8 though, as that encoding covers the whole Unicode range
On StrlitChars
:
IDA 7.0 introduces the notion of 'culture'. A 'culture' file describes a set of codepoints that are conceptually grouped together, although they can be disjoint in the Unicode specification
IDA 7.0 extends the previous AsciiStringChars
directive, by making it capable to express much more than just 1-byte characters, and renamed it to StrlitChars
StrlitChars
has a rather flexible syntax, allowing for literals, codepoints, codepoint ranges, codepoint blocks, codepoint suppressions, embedding 'cultures', and even embedding the 'current culture'
The 'current culture' is either guessed from the 1-byte-per-unit default encoding, or can be specified with the CULTURE
directive
Just as with IDA 6.95's AsciiStringChars
, the new StrlitChars
will be used by the initial auto-analysis, in order to guess possible string literals in the program
See ida.cfg
for a wider coverage of the syntax
UTF-8 covers the whole Unicode codepoint range, and thus a 'culture' derived from the UTF-8 encoding would be overly inclusive and turn up many false positives