Description

Pointer detection analysis.

This analysis attempts to discover which memory addresses store pointer variables and whether those pointer variables point to code or data. The goal is to detect the storage location of things like "arg1", "arg2", and "var2" in the following C code after it is compiled into a binary:

 int f1(bool (*arg1)(), int *arg2) {
     int *var2 = arg2;
     return arg1() ? 1 : *var2;
}

Depending on how the binary is compiled (e.g., which compiler optimizations where applied), it may or may not be possible to detect all the pointer variables. On the other hand, the compiler may generate temporary pointers that don't exist in the source code. Since binary files have no explicit type information (except perhaps in debug tables upon which we don't want to depend), we have to discover that something is a pointer by how it's used. The property that distinguishes data pointers from non-pointers is that they're used as addresses when reading from or writing to memory.

Algorithm

The algorithm works by performing a data-flow analysis in the symbolic domain with each CFG vertex also keeping track of which memory locations are read. When the data-flow step completes, the algorithm scans all memory locations (across all CFG vertices) to get a list of addresses. Each address expression includes a list of all instructions that were used to define the address. For instance, given this simpler code:

; int deref(int *ptr, int index) { return ptr[index]; }
L0: push ebp
L1: mov ebp, esp
L3: mov eax, [ebp+8]
L6: mov ecx, [ebp+12]
L9: mov eax, [eax + ecx*4]
Lc: leave
Ld: ret

L9 reads from memory address eax + ecx * 4, and that address was calculated by previous instructions:

L3 read a value from the stack, therefore L3 is a definer of EAX's value before L9
L6 read a value from the stack, therefore L6 is a definer of ECX's value before L9
L9 performed arithmetic on EAX and ECX, the result of which is defined by L3, L6, and L9.

Other addresses in addition to the one read by L9 are:

The return address stored at the top of the initial stack used by the RET instruction. Defined by L0 and Lc.
The location of the first program argument, defined by L0 and L3.
The location of the second program argument, defined by L0 and L6.
The location of the saved EBP, defined by L0.

A second step (not requiring a second data-flow, but using information gathered by the first data flow), looks at addresses that were read by instructions that defined an address. For instance, L3, L6, and L9 are the instructions that defined the address used by L9, and all three of them read some memory:

L3 read the first argument starting at four bytes past the original ESP.
L6 read the second argument starting at eight bytes past the original ESP.
L9 read an element of the array.

Since L9 reads from the same address whose definers we are processing, we discard the information from L9, keeping only the two reads from L3 and L6. Both of these reads match the width of the stack pointer, therefore we keep both (this is an optional setting for this analysis) and the analysis deems them "addressses of data pointers". Incidentally, the width of the stack pointer is used as the width of data pointers, and the width of the instruction pointer is used as the width of code pointers. The result is that eight bytes on the stack are deemed addresses of data pointers. They are:

(add[32] esp_0[32] 0x00000004[32])
(add[32] esp_0[32] 0x00000005[32])
(add[32] esp_0[32] 0x00000006[32])
(add[32] esp_0[32] 0x00000007[32])
(add[32] esp_0[32] 0x00000008[32])
(add[32] esp_0[32] 0x00000009[32])
(add[32] esp_0[32] 0x0000000a[32])
(add[32] esp_0[32] 0x0000000b[32])

An astute observer will notice that the algorithm has detected that both "ptr" and "index" are detected as pointers. Although they are not "pointers" per se in the C language, they are indeed both pointers by some definition of assembly language: they're both used as indexes into a global memory address space.

The analysis also detects other pointers that are not evident from the C source code: EBP's stored location just below the original top-of-stack is a pointer, and the return address stored at the top of the stack is a pointer.

Usage

Like most binary analysis functionality, binary pointer detection is encapsulated in its own namespace. The main class, Analysis, performs most of the work. A user instantiates an analysis object giving it a certain configuration at the same time. He then invokes one of its analysis methods, such Analysis::analyzeFunction, one or more times and queries the results after each analysis. The results are returned as symbolic address expressions relative to some initial state.

The "testPointerDetection.C" tester has an example use case:

Classes
class	Analysis
	Pointer analysis. More...

class	PointerDescriptor
	Description of one pointer. More...

class	Settings
	Settings to control the pointer analysis. More...

Typedefs
using	PointerDescriptors = std::list< PointerDescriptor >
	Set of pointers.

Functions
void	initDiagnostics ()
	Initialize diagnostics.

Variables
Sawyer::Message::Facility	mlog
	Facility for diagnostic output.

Typedef Documentation

◆ PointerDescriptors

using Rose::BinaryAnalysis::PointerDetection::PointerDescriptors = typedef std::list<PointerDescriptor>

Set of pointers.

Definition at line 217 of file PointerDetection.h.

Function Documentation

◆ initDiagnostics()

void Rose::BinaryAnalysis::PointerDetection::initDiagnostics ( )

Initialize diagnostics.

This is normally called as part of ROSE's diagnostics initialization, but it doesn't hurt to call it often.

Variable Documentation

◆ mlog

Sawyer::Message::Facility Rose::BinaryAnalysis::PointerDetection::mlog

extern

Facility for diagnostic output.

The facility can be controlled directly or via ROSE's command-line.

Description

Algorithm

Usage

Classes

Typedefs