Binary analysis FAQ

Frequently asked questions about binary analysis with answers.

These are questions that have been asked by users and answered (or at least reviewed) by a ROSE developer.

General information

Q1. What does binary analysis have to do with source-to-source translation?

ROSE was, and is still primarily, a source-to-source compiler. But it turns out that much of what happens in source code analysis can be applied to binary analysis. For instance, parsing of an ELF or PE container (the non-instructions, non-initialized data parts of an executable) is much like parsing source code: some files are read, an abstract syntax tree is produced, and the tree can be analyzed and modified, and unparsed to create a new container. Disassembly produces instructions that have static information much like a statement in a source language. Binaries have control flow graphs, call graphs, data flow, etc. that's very similar to source code. A goal is to be able to unify source and binary analysis as much as possible.

Q2. I'm interested in binary analysis. Where is a good place to start?

Most of the binary analysis capabilities are documented in the library itself and are part the doxygen output. The best place to start is the BinaryAnalysis namespace. In short,

Q3. Why is binary analysis slow, and what can I do about it?

ROSE has extensive invariant checks (i.e., assert), encapsulation, and polymorphic classes. When compiling ROSE, make sure to turn on optimizations (e.g., -O3 -fomit-frame-pointer), turn off assertion checking (-DNDEBUG), and turn off interator checking (i.e., don't define _GLIBCXX_DEBUG). Also use optimized versions of libraries, especially Boost libraries. Doing these can easily make ROSE three or four times faster.

The disassembler in ROSE is different than most. First of all, by default it uses a recursive descent algorithm that requires a certain amount of reasoning about instructions, some of which may invoke an external SMT solver. One can use a linear disassembly instead (see projects/BinaryAnalysisTools/linearDisassemble.C) but then most analysis capabilities are sacrificed because there's no control flow graph. Other options are to turn off SMT usage or use the Yices library instead of a separate SMT executable. Secondly, being an analysis framework, ROSE's disassembler and instruction semantics APIs are more modular, complex, and configurable than simple disassemblers, and this comes at a cost.

Disassembling and partitioning

Q1. What is "partitioning"?

Unlike many disassemblers that use a linear sweep through memory disassembling everything they encounter, ROSE uses a recursive approach. It starts with a set of addresses obtained from the user or from parsing the ELF or PE container, disassembles an instruction, and then uses instruction semantics to determine the next possible values for the instruction pointer register. It then adds those addresses to its work list, repeating until the worklist is empty. This not only generates the individual instructions (SgAsmInstruction), but also a global control flow graph (CFG). The next step is to group instructions together to form functions, essentially coloring parts of the control flow graph, thereby partitioning its vertices into smaller subgraphs.

Q2. Why does ROSE use basic blocks?

ROSE's definition of basic block (BB) relaxes the usual requirement that a BB's instructions be contiguous and non-overlapping in memory, and instead uses only control flow properties: a BB's instructions are always executed sequentially from the first to the last. ROSE uses BBs in most of its analysis because at the semantics level, a BB and an individual instruction are indistinguishable, but a BB "does more". Most analysis is more efficient when it's able to perform larger units of work.

Q3. (No longer applicable.)

Q4. Why is ROSE's CFG structured the way it is?

ROSE has many CFGs that it uses internally, some of which are exposed through an API. For instance, the Partitioner2 has a CFG that it creates while discovering instructions, and a resulting global CFG at the end. Many users find this CFG convenient for analysis, but others need things to be a bit different. The good news is that CFGs are copiable, so all one needs to do is copy the CFG and then make any adjustments he needs for his analysis. There are also V_USER_DEFINED and E_USER_DEFINED vertex and edge types if one needs to insert special things into the graph (they carry no data, so a separate lookup table may be required). The Partitioner2 basic blocks, functions, etc. have a mechanism for attaching arbitrary data.

Specifically, the Partitioner2 CFG has a single vertex to represent an indeterminate address, and all function return sites branch to this vertex. A function call site has an extra "call-return" edge from the call site to the point where the called function would return.

Control flow graphs


Instruction semantics

Q1. Why not store semantics in the AST?

Instruction semantics are encoded as C++ code instead of explicitly stored in the AST as subtrees of SgAsmInstruction for two main reasons:

But having said that, ROSE does have the ability to represent semantics within the AST. See StaticSemantics. The approach used by this relatively simple semantic domain can be used to construct pretty much any kind of semantic data structure the user wants.

Q2. Why not use SgExpression AST nodes as symbolic expression?

We decided to use SymbolicExpr instead because:

Q3. Is it possible to generate an SgExpression AST from semantics?

Yes. In fact, the semantics were designed specifically for generating a variety of data structures, and the ROSE AST is just one possibility. Some examples:

By default, the AST is not populated with instruction semantics because we've found the virtual representation to be more flexible and uses less memory. But if you want them, see StaticSemantics.

Q4. How does one access the memory of a binary specimen?

ROSE provides at least two levels of access. If all you've done is parse a binary container (ELF, PE) then you can traverse the AST to find the SgAsmGenericFile or SgAsmGenericSection in which you're interested and use one of its methods.

If the BinaryLoader has executed then you can traverse the AST to find a SgAsmInterpretation, which has a method to return a MemoryMap.

Q5. How can semantics be made to use specimen memory?

Some implementations of RiscOperators have a method that can be called to set the memory map. However, many of these semantics will only read from memory locations that are non-writable because they assume that such memory will not change during the execution of a program. You might need to remove the write access for parts of a memory map, and this can be done by copying the map (a cheap operation) and using the MemoryMap API.

But let's say you want to use a semantic domain like SymbolicSemantics that doesn't have such a method. In that case, you'll need to subclass RiscOperators and augment its readMemory method to do what you want, namely, read the memory and construct a new symbolic value from what was read.

Q6. How do I add a data member to an existing SValue?

To extend an SValue, create a new SValue class that inherits from an existing SValue. InstructionSemantics2::BaseSemantics::SValue is the base class of the SValue hierarchy. As described here, the convention is that each semantic domain has a number of main components enclosed in a namespace. If you're not overriding all of the components then add typedefs for those that you're not overriding in order that all the main components are defined in your namespace.

Within your new SValue class, add whatever members you need. You may inherit or augment the virtual constructors; they are: undefined_, number_, boolean_, and copy (all but boolean_ are pure virtual in the base class). For most other semantic classes the virtual constructors are named create and clone. You'll also need a static method to create new allocated instances; the convention is to name this method instance. You might want to augment the may_equal and must_equal if your new data member affects value equality. You might also want the print method to output your new data members, and perhaps a new Formatter to control whether and how those members are printed.

If your new SValue data members are set/modified as the result of some RISC operation, then you must also define a new RiscOperators class that inherits ultimately from InstructionSemantics2::BaseSemantics::RiscOperators. Usually your RiscOperators will inherit from the RiscOperators defined in the same namespace as your new SValue class's superclass. You need to override only those RiscOperators operators whose behavior will be different (a list can be found here), plus provide implementations for the virtual constructors. Any operators that you don't override will use the virtual constructor mechanism (the SValue virtual constructors mentioned above) to construct your new SValue objects from a prototypical SValue specified when the RiscOperators was created. Your new RiscOperators methods should never create an SValue using a specific class name–doing so will make your operators less user friendly.

We highly recommend that you test your new semantic domain with InstructionSemantics2::TestSemantics, which tries to catch most type-related errors in your new classes and their direct super classes. If you plan to submit your semantics back to ROSE (or you want an easy way to test your semantics against real programs), then the projects/BinaryAnalysisTools/debugSemantics.C should be modified to support your semantics domain.

Q7. Can I attach information to symbolic expressions?

Each symbolic expression node can store one user-defined datum of arbitrary, copyable type via SymbolicExpr::Node::userData property. The user data is mutable, does not participate in expression hashes, and can be clobbered by ROSE's built-in simplification layer. Users can change the value at any time from any thread (with user-supplied synchronization) regardless of how many expressions are sharing this subexpression. The boost::any type is used as the storage machanism because the AST attribute storage mechanism is too expensive (it's common to have hundreds of millions of these nodes creating during an analysis).