Table of contents

Description

Disassembly and decompilation are two powerful tools in the analyst’s belt as it allows to inspect the internals of an executable file (EXE, ELF, DEX), a disassembler reads the bytes from the binary and tries to interpret them for obtaining instructions from one computer architecture, in the case of DEX binary format this work is commonly simpler to disassemble as each method contains its own bytes, and in other formats like ELF or EXE, all the executable bytes are commonly contained in a .text section, and it’s necessary to apply algorithms to recognizes functions in the binary.

Two algorithms are commonly used for disassembly a binary: linear sweep and recursive disassembly. The former goes from the first to the last byte disassembling each instruction, and the latter takes a starting point (commonly binary entry point) and follows the control flow to disassembly the instructions. Newer algorithm exists that join both algorithms (speculative disassembly). Once the code has been disassembled, there are algorithms to obtain the bounders of the functions, with these algorithms there’s a best effort to obtain the functions of the binary, commonly fixed patterns or analysis of control flows are used to obtain these boundaries.

A decompiler is a step forward where previous discovered functions are taken and using different techniques common patterns from high level languages (conditional code like if, if-else, loops like while or for, etc) are recognized and a pseudo-code is generated making the analysis simpler.

Due to how these algorithms work it’s possible to use different flaws or write specially crafted code that breaks the logic of the algorithm and produces incorrect code.

Techniques

Incorrect Opcodes

While Dalvik contains a large set of opcodes in its bytecode that define the instructions to run, this set is not as long as other ISAs like x86 or x86-64 where there are many combination of opcodes to create the different instructions. A Disassembler for Dalvik will take the bytecode defined for each method and will try to disassemble all the bytes from the first to the last one, the first opcode is commonly used to detect the type of instruction, and the other bytes from the instruction (the length depends on the instruction format), will be used to detect parameters like registers, fields used, strings accessed, classes, etc. While Dalvik machine or currently ART will not run incorrect instructions, is possible that a protector modifies these bytes before are interpreted, and in the classes.dex file have an incorrect set of bytes.

The next example corresponds to a sample with MD5 78888acc8f2e5b0d59f91ad3b5f6afee:

**************************************************************
* Landroid/support/0RxDAGZKgW2jP4O8XMSGzp8cOHObsCyTp4c1Un... *
*                                                            *
* Instruction Bytes: 0x22                                    *
* Registers Size: 0x2                                        *
* Incoming Size: 0x1                                         *
* Outgoing Size: 0x0                                         *
* Tries Size: 0x0                                            *
*                                                            *
**************************************************************
00463bf0 02 00 01        code_ite
        00 00 00 
        00 00 65 
00463bf0 02 00           dw        2h                      registers_size
00463bf2 01 00           dw        1h                      ins_size
00463bf4 00 00           dw        0h                      outs_size
00463bf6 00 00           dw        0h                      tries_size
00463bf8 65 b4 3e 00     ddw       3EB465h                 debug_info_off
00463bfc 11 00 00 00     ddw       11h                     insns_size
00463c00 00 c2 a7 1c 13  dw[17]                            insns
        67 1d 82 4a bb 
        45 a8 1b 82 4c
    00463c00 [0]             C200h,  1CA7h,  6713h,  821Dh
    00463c08 [4]             BB4Ah,  A845h,  821Bh,  264Ch
    00463c10 [8]             9948h,  6927h,  671Bh,  1EAEh
    00463c18 [12]            206Bh,  5559h,  1EA0h,  B567h
    00463c20 [16]            7D8Ch
00463c22 00 00           dw        0h                      padding

The next buffer corresponds to the bytes of the instructions:

00 c2 a7 1c 13 67 1d 82 4a bb 45 a8 1b 82 4c 26 48 99 27 69 1b 67 ae 1e 6b 20 59 55 a0 1e 67 b5 8c 7d

The parser will start reading the first bytes, the byte 0x00 corresponds to the instructions format: Instruction10x, FillArrayData, PackedSwitch and SparseSwitch. But as the second byte is not 0x01, 0x02 or 0x03, this instruction should be the format Instruction10x being in this case a NOP instruction.

The format of the instruction Instruction10x requires that the second byte must be 0x00, and because in this case is another value (0xC2), the disassembler doesn’t understand the instruction. And it’s disassembler’s work to recover or just skip those bytes for working.

In the case of jadx the disassembler generates a NOP, but later during the disassembly of the method, it crashes:

.method public JTOhbpONI5DyGC9b1eFzkaeNVyp6mL0Ra4eKLhYVjiJFA4wP0A2oox5m06CwbJ1Ks6o9PsuKisOuqncbe5d6FdV7siv3scfMz3ixhUTbhq2W3dF0dJrPC9XBrn3Ww37VFaGPQnmWzqaLdqe1jwDZu0Si4ZUByWrZeBbOrPAMr9J63Xelz6BB()I
    .registers 2

    .prologue
    .line 7
    #unknown opcode: 0xc200
    nop

    sub-float p27, p18, p102

    monitor-enter p129

    aget-short p186, p68, p167

Error generating smali code: Encountered small uint that is out of range at offset 0x463c0e
org.jf.util.ExceptionWithContext: Encountered small uint that is out of range at offset 0x463c0e
...

In the case of apktool we have a similar behavior, it changes the unknown instructions for nop, but in the case of those methods that generated exception it does not generate the smali of the class.

Ghidra takes NOP operation as a single byte instruction, and then it continues disassembling the methods.

This technique while it would not be possible without modifying the bytecode before the execution, it’s very powerful against disassembling and because the method cannot obtain a good disassembled code, against decompilation.

References