ROSE 0.11.145.147
Binary analysis tutorial

Getting started guide for binary analysis.

Why binary analysis?

ROSE was originally designed as a source-to-source analysis and transformation system, but it turns out that parsing and analyzing a binary specimen shares many of the same basic ideas:

  • A binary specimen usually lives in a container that describes areas of memory, symbols, and other things needed when the specimen is executed or linked. These various containers (executables, libraries, object files, core dumps) can be parsed by a ROSE frontend and result in an abstract syntax tree (AST) much like parsing of source code.
  • The machine instructions of a binary specimen can be parsed an have semantics similar to statements in a source language. They also result in abstract syntax trees.
  • Data areas in the specimen can be parsed and are similar to initialized data in a source language. These also become abstract syntax trees.
  • Analysis can be performed on an intermediate representation (IR) for both binaries and source code. One such intermediate representation is the abstract syntax tree.
  • Transformations modify the AST and the resulting AST can be "unparsed" to create a new binary specimen, much like unparsing transformed source code results in a new implementation.

Of course there are also many differences between source code and binary specimens, but ROSE is situated in the unique position of being able to process both. On the one hand, it can operate on a specimen for which no source code is available, and on the other it can perform analysis and transformations where both source and binary are available at once.

What binary analysis features are available?

ROSE can parse a variety of binary inputs. It can parse executable formats such as Linux ELF executables, shared libraries, core dumps, object files, and library archives; and Microsoft Windows PE and DOS executables and libraries. It can parse memory initialization formats such as Motorola S-Records, Intel HEX files, and raw memory dumps. It can analyze running or paused programs on Linux. ROSE can also analyze combinations of these formats&emdash;such as an ELF file extended by providing additional initialized memory from an S-Record file&emdash;and has a mini command-line language for specifying this for all binary analysis tools.

The ROSE disassembler uses a hybrid linear sweep and recursive descent approach that handles things like overlapping instructions (on architectures where that's possible), interleaved functions, rewritten execution flow, and more. ROSE has instruction decoders for ARM, AMD64 (x86_64), Intel x86, MIPS, Motorola 68k, and PowerPC (32- and 64-bit). ROSE is designed so that most analysis is architecture independent.

ROSE understands semantics for common subsets of AMD64, Intel x86, Motorola 68k, and PowerPC (32- and 64-bit) and can therefore reason about how execution of instructions affects a machine's state. It can perform this reasoning symbolically, using only concrete values, using sets of intervals of concrete values, and many more special-purpose domains. If an Satisfiability Modulo Theory (SMT) solver is available, ROSE can use it to increase the fidelity of its analyses, including reasoning that's employed during the disassembly phase.

ROSE is able to construct control flow graphs and call graphs and has a number of pre-built analysis capabilities such as tainted flow analysis, edit distances, code statistics, memory maps, reasoning about register interactions, address usage maps, execution paths and their feasibility, generic inlining and interprocedural data-flow, no-op sequence detection, opaque predicate detection, algorithmic CFG rewriting, calling convention detection, function may-return analysis, address labeling passes, thunk detection and manipulating, switch statement analysis, interrupt vector processing, overlapping instructions, cross referencing, function stack delta analysis, register and memory usage analysis, magic numbers, static and dynamic string decoding, control flow dominance, pointer detection, DWARF parsing, library identification, etc.

Analysis results are available in a number of formats, depending on the analysis: through various APIs, decoration of the abstract syntax tree or control flow graph, commentary in assembly listings, databases, or to a certain extent as a new translated binary specimen.

This document only attempts to describe a few of the easiest features to get you started.

Installing and confguring ROSE for binary analysis

The binary analysis features have a large number of software dependencies beyond what the source-to-source side of ROSE uses, but fortunately all of them are optional. If a software dependency is not available then the analysis that uses it is simply not available. In a few cases cases the analysis can use a different dependency or a slower or less accurate implementation. See installation for instructions on how to install ROSE. ROSE must be configured with "binaries" being one of the "--enable-languages" values.

Hello, World!

Every system needs a basic "Hello, World!" example, so here's that example for binary analysis within the ROSE framework. It parses a specimen from a variety of formats (see its "--help" output), then disassembles the instructions optionally using instruction semantics to help drive the disassembly, then partitions the instructions into functional units, and finally displays an assembly listing.

#include <rose.h>

All programs start by including the main ROSE include file that declares most things that are common to source and binary analysis. This must be the first ROSE header to be included.

#include <Rose/BinaryAnalysis/Partitioner2/EngineBinary.h>
#include <AsmUnparser.h>

ROSE's binary components are declared in individual header files that can only be included after including "rose.h". The binary headers themselves can be included in any order (after rose.h) since each header includes those other headers on which it depends. The binary analysis headers usually have the same names as the primary class they contain.

ROSE_INITIALIZE; // see Rose::initialize
std::string purpose = "disassembles a binary specimen";
std::string description =
"This tool disassembles the specified file and presents the results "
"as a pseudo assembly listing, a listing intended for human consumption "
"rather than assembling. This implementation serves as the \"Hello, "
"World!\" example for binary analysis, so let's keep it simple!";

The binary analysis uses a different command-line processor than most of the rest of ROSE because the command-line options for these tools don't typically interact with a compiler and therefore don't need to parse the host of compiler switches that the source tools need to recognize. This also frees the binary tool's command-line parser to be able to generate Unix man pages on the fly. For instance, if you invoke this tool with "--help" (or just "-h") you should see quite lengthy documentation about how to invoke the tool and what all its switches mean. A tool more advanced than "Hello, World!" can modify the command-line parser.

SgAsmBlock *gblock = engine->frontend(argc, argv, purpose, description);
static Ptr instance()
Allocating constructor.
Instruction basic block.

ROSE divides disassembly into two phases. Within ROSE, a Disassembler is responsible for parsing the bytes of a single machine instruction and turning them into an abstract syntax tree, while the Partitioner2 namespace contains the machinery for driving disassembly. The partitioner partitions the specimen's memory image into parts that are either instructions or data (actually, "partitioner" is a bit of a misnomer since instructions and data can also overlap, although they seldom do in practice).

The Partitioner2 namespace contains many classes, but Engine and Partitioner are the two most important classes. The Engine is used to drive the disassembly process and create a Partitioner, which holds the results in data structures that have better time complexities than an AST when it comes to most forms of analysis. In this example, we are invoking the Engine::frontend, which is similar in nature to ROSE's global ::frontend in that it does everything necessary to make ROSE ready to do analysis: it parses, loads, disassembles, partitions, runs basic analysis, and checks consistency. In the end, it returns an abstract syntax tree (AST) instead of a Partitioner.

AST nodes in ROSE all begin with the letters "Sg" (from "Sage"). Those nodes that are specific to binary analysis begin with "SgAsm" (from "Sage assembly"). Within source code analysis, source files combine to form a single SgProject node, and the same is often true with binary analysis also. For specimens that have a container, like Linux ELF and Windows PE files, the AST has two large subtrees: one subtree describes the containers (there may be more than one), and another subtree describes the instructions organized into interpretation subtrees (SgAsmInterpretation). Interpretations organize instructions into coherent sets, such as the PE32 and DOS subcomponents of a Windows PE file. However not all specimen formats have containers, in which case the AST is abbreviated and contains only the instructions rooted at a "global block" (SgAsmBlock).

unparser.unparse(std::cout, gblock);
Unparses binary AST into text.
virtual size_t unparse(std::ostream &, SgNode *ast)
Unparse part of the AST.

Now that an abstract syntax tree has been created we can "unparse" it. Unparsing in a binary is slightly different than unparsing source code. Binary specimens are usually not transformed like source code, so the unparser for a binary generates a human-readable pseudo-assembly listing instead. We could have also called the global ::backend (same as source code) which would have produced a number of output files including a new executable, but backend only works on SgProject nodes which might not be present (see above). Note that there are better ways to obtain an unparser, and we'll see them later.

Here's the entire "Hello, World!" program:

1
2#include <rose.h>
4
6#include <Rose/BinaryAnalysis/Partitioner2/EngineBinary.h>
7#include <AsmUnparser.h>
9
10int
11main(int argc, char *argv[]) {
13 ROSE_INITIALIZE; // see Rose::initialize
14 std::string purpose = "disassembles a binary specimen";
15 std::string description =
16 "This tool disassembles the specified file and presents the results "
17 "as a pseudo assembly listing, a listing intended for human consumption "
18 "rather than assembling. This implementation serves as the \"Hello, "
19 "World!\" example for binary analysis, so let's keep it simple!";
21
24 SgAsmBlock *gblock = engine->frontend(argc, argv, purpose, description);
26
29 unparser.unparse(std::cout, gblock);
31}

Compiling programs that use ROSE

To compile this program you'll want to use the same compiler, preprocessor switches, compiler switches, and loader switches as are used when compiling ROSE. This is important! Since C++ has no official, documented application binary interface, mixing libraries compiled with different compilers, compiler versions, compiler optimization levels, or even other seemingly benign compiler switches is not guaranteed to work. The easiest way to compile the above example is to change directories to $ROSE_BUILD/tutorial, where $ROSE_BUILD is the top of your build tree, and run make binaryHelloWorld. Similar commands will work for the other examples in this tutorial.

On the other hand, if you've already installed the ROSE library and blown away your build tree, or if you're writing your own programs that use ROSE, you won't be able to leverage ROSE's own build system to compile your program, and you probably have no idea what compiler command was used to compile ROSE. Fear not, ROSE installed a "rose-config.cfg" file that has the information you need, and in a format that can be included directly into GNU make files. See Developing ROSE-based tools for an example Makefile.

Generating a function control flow graph.

This example shows how to load a specimen into analysis space and generate a control flow graph for each function. First we include rose.h and then any additional binary analysis header files:

#include <rose.h>
#include <Rose/Diagnostics.h>
#include <Rose/BinaryAnalysis/Partitioner2/BasicBlock.h>
#include <Rose/BinaryAnalysis/Partitioner2/EngineBinary.h>
#include <Rose/BinaryAnalysis/Partitioner2/GraphViz.h>
#include <Rose/BinaryAnalysis/Partitioner2/Partitioner.h>
#include <Sawyer/CommandLine.h>
static const char* purpose = "obtains a function control flow graph from a binary specimen";
static const char* description =
"This tool disassembles the specified file and generates prints a control flow "
"graph to standard output for each function. It is intended as a demo of how to "
"obtain a function CFG; there are better ways to print graphs.";
using namespace Rose::Diagnostics;
using namespace Rose::BinaryAnalysis;
Binary analysis.
Controls diagnostic messages from ROSE.

Then, in the main program, we create a disassembling and partitioning engine and use it to parse the command-line. This allows our tool to easily support the multitude of partitioner settings and container formats. The command-line parser and its documentation is easily customizable for more advanced tools, but the following works fine for this tool:

ROSE_INITIALIZE; // see Rose::initialize
std::vector<std::string> specimen = engine->parseCommandLine(argc, argv, purpose, description).unreachedArgs();
if (specimen.empty()) {
mlog[FATAL] <<"no binary specimen specified; see --help\n";
exit(1);
}

The "mlog[FATAL]" in the previous code uses ROSE's diagnostics facility, a large set of coordinated streams for different software components and message severity levels, all of which can be controlled from the command-line.

Next, now that the engine is configured and we know the name(s) or resources for the specimen, we can parse the specimen container, load the specimen into memory, disassemble, and discover functions. The results are stored in a Partitioner object:

Partitioner2::Partitioner::Ptr partitioner = engine->partition(specimen);

Finally, we'll iterate over those functions to create function control flow graphs and print each graph. The cfg method returns a const reference to a global control flow graph, which normally serves as the starting point for many analyses. But in this case, we want only the subgraph that represents an individual function, so we copy the global CFG and then remove those vertices and edges that are uninteresting:

for (const Partitioner2::Function::Ptr &function : partitioner->functions()) {
// global control flow graph
Partitioner2::ControlFlowGraph cfg = partitioner->cfg();
// Erase all vertices that don't belong to the function of interest, and their incident edges
Partitioner2::ControlFlowGraph::VertexIterator vi = cfg.vertices().begin();
while (vi != cfg.vertices().end()) {
if (!vi->value().isOwningFunction(function)) {
cfg.eraseVertex(vi++);
} else {
++vi;
}
}
// Print the results
std::cout <<"CFG for " <<function->printableName() <<"\n"
<<" Vertices:\n";
for (const Partitioner2::ControlFlowGraph::Vertex &v : cfg.vertices())
std::cout <<" " <<partitioner->vertexName(v) <<"\n";
std::cout <<" Edges:\n";
for (const Partitioner2::ControlFlowGraph::Edge &e : cfg.edges())
std::cout <<" " <<partitioner->edgeName(e) <<"\n";
}

In the above code, the Partitioner2::ControlFlowGraph is a specialization of Sawyer::Graph, which is a generic yet easy-to-use graph implementation with very good time complexity for most operations. So the previous "for" loop has linear time complexity with respect to the total number of vertices and edges.

The Partitioner2::DataFlow namespace has a buldDfCfg function that creates a slightly different kind of control flow graph–one that's more useful for data-flow–which also permits intra- and inter-procedural analysis based on user criteria.

Here's the entire program:

1
2#include <rose.h>
3
4#include <Rose/Diagnostics.h>
5#include <Rose/BinaryAnalysis/Partitioner2/BasicBlock.h>
6#include <Rose/BinaryAnalysis/Partitioner2/EngineBinary.h>
7#include <Rose/BinaryAnalysis/Partitioner2/GraphViz.h>
8#include <Rose/BinaryAnalysis/Partitioner2/Partitioner.h>
9#include <Sawyer/CommandLine.h>
10
11static const char* purpose = "obtains a function control flow graph from a binary specimen";
12static const char* description =
13 "This tool disassembles the specified file and generates prints a control flow "
14 "graph to standard output for each function. It is intended as a demo of how to "
15 "obtain a function CFG; there are better ways to print graphs.";
16
17using namespace Rose::Diagnostics;
18using namespace Rose::BinaryAnalysis;
20
21int
22main(int argc, char *argv[]) {
24 ROSE_INITIALIZE; // see Rose::initialize
26 std::vector<std::string> specimen = engine->parseCommandLine(argc, argv, purpose, description).unreachedArgs();
27 if (specimen.empty()) {
28 mlog[FATAL] <<"no binary specimen specified; see --help\n";
29 exit(1);
30 }
32
34 Partitioner2::Partitioner::Ptr partitioner = engine->partition(specimen);
36
38 for (const Partitioner2::Function::Ptr &function : partitioner->functions()) {
39 // global control flow graph
40 Partitioner2::ControlFlowGraph cfg = partitioner->cfg();
41
42 // Erase all vertices that don't belong to the function of interest, and their incident edges
43 Partitioner2::ControlFlowGraph::VertexIterator vi = cfg.vertices().begin();
44 while (vi != cfg.vertices().end()) {
45 if (!vi->value().isOwningFunction(function)) {
46 cfg.eraseVertex(vi++);
47 } else {
48 ++vi;
49 }
50 }
51
52 // Print the results
53 std::cout <<"CFG for " <<function->printableName() <<"\n"
54 <<" Vertices:\n";
55 for (const Partitioner2::ControlFlowGraph::Vertex &v : cfg.vertices())
56 std::cout <<" " <<partitioner->vertexName(v) <<"\n";
57 std::cout <<" Edges:\n";
58 for (const Partitioner2::ControlFlowGraph::Edge &e : cfg.edges())
59 std::cout <<" " <<partitioner->edgeName(e) <<"\n";
60 }
62}

Generating a binary function call graph in GraphViz format.

This example is similar to the Hello, World! example, but demonstrates how to analyze function call information to construct a function call graph (CG) and then emit that graph as a GraphViz file. The output can be converted to a picture with the "dot" command, or visualized interactively with ZGRViewer.

#include <rose.h>
#include <Rose/Diagnostics.h>
#include <Rose/BinaryAnalysis/Partitioner2/EngineBinary.h>
#include <Rose/BinaryAnalysis/Partitioner2/GraphViz.h>
#include <Rose/BinaryAnalysis/Partitioner2/Partitioner.h>
#include <Sawyer/CommandLine.h>

As before, the "rose.h" header is the first of the ROSE headers to be included, followed by the binary analysis headers in any order.

using namespace Rose;
using namespace Rose::Diagnostics;
using namespace Rose::BinaryAnalysis;
The ROSE library.

This time we'll use a few namespaces to reduce our typing. The Rose::Diagnostics namespace brings into scope the diagnostic support (mlog and FATAL in this example) which is accessed through the "--log" command-line switches and controls what kinds of diagnostic output is produced. If you run this program with "--log=all" you'll get traces and debugging from all the ROSE components that use this mechanism (lots of output); see "--log=help" for info about fine tuning this.

struct Settings {
std::string outputName;
Settings()
: outputName("cg.dot") {}
};

Many of the binary analysis tools find that holding all command-line settings in a single struct is a convenient way to organize things. This example only has one tool-specific setting–the name of the output file for the call graph.

static std::vector<std::string>
parseCommandLine(int argc, char *argv[], Partitioner2::Engine &engine, Settings &settings) {
using namespace Sawyer::CommandLine;
std::string purpose = "obtains a function call graph from a binary specimen";
std::string description =
"This tool disassembles the specified file and generates a function call "
"graph named \"cg.dot\" in the current working directory. The dot file "
"can be processed with GraphViz commands or viewed directly with ZGRViewer.";
Parser parser = engine.commandLineParser(purpose, description);
SwitchGroup tool("Tool switches");
tool.insert(Switch("output", 'O')
.argument("filename", anyParser(settings.outputName))
.doc("Specifies the name of the call graph that is generated by "
"this tool. The default is \"" +
StringUtility::cEscape(settings.outputName) + "\""));
return parser.with(tool).parse(argc, argv).apply().unreachedArgs();
}
Base class for engines driving the partitioner.
virtual Sawyer::CommandLine::Parser commandLineParser(const std::string &purpose, const std::string &description)
Creates a command-line parser.
const ParserResult & apply() const
Saves parsed values in switch-specified locations.
std::vector< std::string > unreachedArgs() const
Returns program arguments that were not reached during parsing.
The parser for a program command line.
Parser & with(const SwitchGroup &sg)
Add switch declarations.
ParserResult parse(int argc, char *argv[])
Parse program arguments.
A collection of related switch declarations.
Describes one command-line switch.
Switch & doc(const std::string &s)
Property: detailed description.
ROSE_UTIL_API std::string cEscape(const std::string &, char context='"')
Escapes characters that are special to C/C++.
Parses program command line switches and arguments.

The Hello, World! example showed the simplest way to parse a command-line, but this time we'll do something a little more complex. This time we want to augment the command-line parser so it knows about the switches that this tool supports. The parser itself comes from the Sawyer library, part of which is distributed along with the ROSE source code. We obtain the default parser from the disassembly and partitioning engine, and augment it with a switch group containing the switches specific to this tool. Our "--output" or "-O" switch takes a single argument, a file name that can be any string. The "doc" property is the documentation for our switch and will appear in the Unix man page produced by running with "--help". Finally, we invoke the parser with our additional switch group on the argument list in argv. If the parsing is successful we apply the results to our settings and then return the rest of the command line (probably information about the specimen).

ROSE_INITIALIZE;
Settings settings;
std::vector<std::string> specimen = parseCommandLine(argc, argv, *engine, settings);
if (specimen.empty()) {
mlog[FATAL] <<"no binary specimen specified; see --help\n";
exit(1);
}

In the main program we initialize our settings and instantiate a disassembly/partitioning engine, then parse the command-line and get the list of non-switch arguments. If there are none, give the user some help; this is often an indication that the user invoked the command with no arguments at all in order to get an error message that hopefully contains some usage hints.

Partitioner2::Partitioner::Ptr partitioner = engine->partition(specimen);

Instead of calling engine.frontend, this time we call engine.partition in order to get access to the partitioning analysis results. No AST is created in this case, although we could get one if we wanted by querying the engine for it. The data structures used by the partitioner are much more efficiently tuned for analysis than an AST, so we'll stick with the partitioner.

The partitioner knows how to construct a call graph from its internal data structures. The FunctionCallGraph class can also be manipulated directly. Now's a good time to point out that many binary analysis data structures use pointers to shared objects. The objects are reference counted and deleted automatically. Classes that are used in this way do not have public constructors, but rather instance methods (and sometimes additional factories as well) that allocate and initialize the object and return a smart pointer. In this example, if we were to delete the partitioner or engine the callgraph would still point to valid functions, which still point to valid instructions, etc.

std::ofstream output(settings.outputName.c_str());
Partitioner2::GraphViz::CgEmitter emitter(partitioner, callgraph);
emitter.emitCallGraph(output);

Finally we can emit the call graph as a GraphViz file. This is done through a CgEmitter that specializes a more general GraphViz emitter. The emitter API allows you to fine tune the output by adjusting colors and other GraphViz edge and vertex properties.

Here's the full listing. Compile it using the same instructions as for the Hello, World! example.

1
2#include <rose.h>
3
4#include <Rose/Diagnostics.h>
5#include <Rose/BinaryAnalysis/Partitioner2/EngineBinary.h>
6#include <Rose/BinaryAnalysis/Partitioner2/GraphViz.h>
7#include <Rose/BinaryAnalysis/Partitioner2/Partitioner.h>
8#include <Sawyer/CommandLine.h>
10
12using namespace Rose;
13using namespace Rose::Diagnostics;
14using namespace Rose::BinaryAnalysis;
16
18struct Settings {
19 std::string outputName;
20
21 Settings()
22 : outputName("cg.dot") {}
23};
25
27static std::vector<std::string>
28parseCommandLine(int argc, char *argv[], Partitioner2::Engine &engine, Settings &settings) {
29 using namespace Sawyer::CommandLine;
30
31 std::string purpose = "obtains a function call graph from a binary specimen";
32 std::string description =
33 "This tool disassembles the specified file and generates a function call "
34 "graph named \"cg.dot\" in the current working directory. The dot file "
35 "can be processed with GraphViz commands or viewed directly with ZGRViewer.";
36
37 Parser parser = engine.commandLineParser(purpose, description);
38
39 SwitchGroup tool("Tool switches");
40 tool.insert(Switch("output", 'O')
41 .argument("filename", anyParser(settings.outputName))
42 .doc("Specifies the name of the call graph that is generated by "
43 "this tool. The default is \"" +
44 StringUtility::cEscape(settings.outputName) + "\""));
45
46 return parser.with(tool).parse(argc, argv).apply().unreachedArgs();
47}
49
50int
51main(int argc, char *argv[]) {
53 ROSE_INITIALIZE;
54 Settings settings;
56 std::vector<std::string> specimen = parseCommandLine(argc, argv, *engine, settings);
57 if (specimen.empty()) {
58 mlog[FATAL] <<"no binary specimen specified; see --help\n";
59 exit(1);
60 }
62
64 Partitioner2::Partitioner::Ptr partitioner = engine->partition(specimen);
66
68 Partitioner2::FunctionCallGraph callgraph = partitioner->functionCallGraph(Partitioner2::AllowParallelEdges::NO);
70
72 std::ofstream output(settings.outputName.c_str());
73 Partitioner2::GraphViz::CgEmitter emitter(partitioner, callgraph);
74 emitter.emitCallGraph(output);
76}
AnyParser< T >::Ptr anyParser(T &storage)
Factory for value parsers.
ROSE_DLL_API Sawyer::Message::Facility mlog
Diagnostic facility for the ROSE library as a whole.
Settings settings
Command-line settings for the rosebud tool.
@ FATAL
Messages that indicate an abnormal situation from which the program was unable to recover.
Definition Message.h:332

Finding static strings in a binary specimen

This example parses and disassembles a binary specimen and search for all static strings, similar to the Unix "strings" command. This simple example can be a starting point for a more in depth strings analysis than what's possible with the Unix command.

#include <rose.h>
#include <Rose/BinaryAnalysis/Architecture/Base.h>
#include <Rose/BinaryAnalysis/Disassembler/Base.h>
#include <Rose/BinaryAnalysis/Partitioner2/EngineBinary.h>
#include <Rose/BinaryAnalysis/String.h>
#include <boost/foreach.hpp>
using namespace Rose;
using namespace Rose::BinaryAnalysis;

Include headers. The "rose.h" must always be before other ROSE headers.

ROSE_INITIALIZE; // see Rose::initialize
std::string purpose = "finds static strings in a binary specimen";
std::string description =
"This tool disassembles a binary specimen and then scans the "
"read-only parts of memory to find static strings. It looks for "
"C-style NUL-termianted printable ASCII strings, zero-terminated "
"UTF-16 little-endian strings, two-byte little-endian length-encoded "
"ASCII strings, and some other common formats.";
std::vector<std::string> specimen =
engine->parseCommandLine(argc, argv, purpose, description).unreachedArgs();

Yet another way to parse a command-line. This time we're trying to avoid calling Partitioner2::Engine::frontend because we don't actually need to disassemble or partition anything–string searching operates on raw memory rather than instructions, so we can save some time.

MemoryMap::Ptr map = engine->loadSpecimens(specimen);
ByteOrder::Endianness sex = engine->architecture()->byteOrder();

The engine.loadSpecimens method parses the speicmen container if present (e.g., Linux ELF) and determines how the specimen should be mapped into memory. The result is a MemoryMap that describes memory segments with addresses, permissions, names, etc. Try inserting a call to MemoryMap::dump here to get an idea of what it contains.

The byte order will be needed by the string decoder in order to know how to decode the multi-byte length fields. We could have gotten this same information any number of other ways, but this is the most convenient in this situation. Note that knowledge of the byte order depends upon knowledge of the specimen architecture even though we don't actually need to disassemble anything.

Strings::StringFinder finder; // the string analyzer
finder.settings().minLength = 5; // no strings shorter than 5 characters
finder.settings().maxLength = 8192; // no strings longer than 8k characters
finder.insertCommonEncoders(sex); // match common encodings of strings
finder.find(map->require(MemoryMap::READABLE).prohibit(MemoryMap::WRITABLE));
Analysis to find encoded strings.
Definition String.h:806
StringFinder & insertCommonEncoders(ByteOrder::Endianness)
Inserts common encodings.
const Settings & settings() const
Property: Analysis settings often set from a command-line.
Definition String.h:866
StringFinder & find(const MemoryMap::ConstConstraints &, Sawyer::Container::MatchFlags flags=0)
Finds strings by searching memory.
AddressMapConstraints< const AddressMap > require(unsigned x) const
Constraint: required access bits.
size_t maxLength
Maximum length of matched strings.
Definition String.h:822
size_t minLength
Minimum length of matched strings.
Definition String.h:816

Here we perform the string searching. Most binary analysis algorithms are packaged into a class. The idea is that one instantiates the analyzer, configures it, calls some method to perform the analysis (here it's find), and then queries the results.

The MemoryMap::require and MemoryMap::prohibit methods are a form of filtering. They're filtering the memory map so that the string analyzer only sees memory that's readable but not writable.

// Output, or just do "std::cout <<finder" if you're not picky.
for (const Strings::EncodedString &string : finder.strings()) {
std::cout <<"string at " <<string.address() <<" for " <<string.size() <<" bytes\n";
std::cout <<"encoding: " <<string.encoder()->name() <<"\n";
std::cout <<"narrow value: \"" <<StringUtility::cEscape(string.narrow()) <<"\"\n";
}

Generating the output is a matter of iterating over the strings that were found and printing some information. Most analyzer objects also know how to print themselves although those defaults are not always suitable for a polished tool.

Here's the entire program:

1
2#include <rose.h>
3
4#include <Rose/BinaryAnalysis/Architecture/Base.h>
5#include <Rose/BinaryAnalysis/Disassembler/Base.h>
6#include <Rose/BinaryAnalysis/Partitioner2/EngineBinary.h>
7#include <Rose/BinaryAnalysis/String.h>
8
9#include <boost/foreach.hpp>
10
11using namespace Rose;
12using namespace Rose::BinaryAnalysis;
14
15int
16main(int argc, char *argv[]) {
18 ROSE_INITIALIZE; // see Rose::initialize
19 std::string purpose = "finds static strings in a binary specimen";
20 std::string description =
21 "This tool disassembles a binary specimen and then scans the "
22 "read-only parts of memory to find static strings. It looks for "
23 "C-style NUL-termianted printable ASCII strings, zero-terminated "
24 "UTF-16 little-endian strings, two-byte little-endian length-encoded "
25 "ASCII strings, and some other common formats.";
26
28 std::vector<std::string> specimen =
29 engine->parseCommandLine(argc, argv, purpose, description).unreachedArgs();
31
33 MemoryMap::Ptr map = engine->loadSpecimens(specimen);
34 ByteOrder::Endianness sex = engine->architecture()->byteOrder();
36
38 Strings::StringFinder finder; // the string analyzer
39 finder.settings().minLength = 5; // no strings shorter than 5 characters
40 finder.settings().maxLength = 8192; // no strings longer than 8k characters
41 finder.insertCommonEncoders(sex); // match common encodings of strings
42 finder.find(map->require(MemoryMap::READABLE).prohibit(MemoryMap::WRITABLE));
44
46 // Output, or just do "std::cout <<finder" if you're not picky.
47 for (const Strings::EncodedString &string : finder.strings()) {
48 std::cout <<"string at " <<string.address() <<" for " <<string.size() <<" bytes\n";
49 std::cout <<"encoding: " <<string.encoder()->name() <<"\n";
50 std::cout <<"narrow value: \"" <<StringUtility::cEscape(string.narrow()) <<"\"\n";
51 }
53}

Graph dominators and post dominators

Loosely speaking, the dominator of a vertex of a graph is another vertex that's visited on every path from some starting vertex to the vertex in question. Most of the time, we're interested in an immediate dominator, which is the closest dominator to the vertex in question. A more rigorous definition can be found in Wikipedia, among other places. Post-dominators are similar in that a post dominator of a vertex is some other vertex that's visited on every path from the vertex in question to some ending vertex.

ROSE uses the Sawyer Graph API for all binary analysis graphs, and Sawyer has functions for calculating dominators and post-dominators. The following example is a tool that finds the dominator for each vertex in the control flow graph of each function.

static const char *programPurpose = "prints dominators for all function CFGs";
static const char *programDescription =
"The purpose of this tool is to demonstrate how to calculate dominators (and by extension, post-dominators) for all "
"vertices of a function control flow graph (and by extension, any Sawyer graph). It parses, loads, disassembles, and "
"partitions a binary specimen and then iterates over the functions. For each function, it obtains a function CFG, "
"calculates the immediate dominator for each vertex, and shows the results.";
#include <rose.h>
#include <Rose/BinaryAnalysis/Partitioner2/BasicBlock.h>
#include <Rose/BinaryAnalysis/Partitioner2/EngineBinary.h>
#include <Rose/BinaryAnalysis/Partitioner2/Partitioner.h>
#include <Sawyer/GraphAlgorithm.h>
Binary function detection.

The first step, above, is to include the appropriate declarations. Our convention for writing tools is to describe the tool with a couple of strings that appear at the very top of the source code. These strings are used later in the program to generate the documentation for the "--help" output.

This tool is so simple that everything else is in "main". First we initialize things:

int
main(int argc, char *argv[])
{
ROSE_INITIALIZE;
P2::Engine::Ptr engine = P2::EngineBinary::instance();
P2::Engine::Settings &settings = engine->settings();
settings.partitioner.doingPostAnalysis = false; // not needed for this tool, and faster without
settings.partitioner.namingSyscalls = false; // for consistent results w.r.t. the answer file since the system...
settings.partitioner.syscallHeader = "/dev/null"; // ...call mapping comes from run-time files.
std::vector<std::string> specimen = engine->parseCommandLine(argc, argv, programPurpose, programDescription).unreachedArgs();
P2::Partitioner::Ptr partitioner = engine->partition(specimen);

The ROSE_INITIALIZE macro is always the first ROSE-related statement and checks that this tool was compiled with header files that are binary compatible with the ROSE library. Then we create a partitioning engine to parse the command-line to get the specimen resources. The specimen is some combination of a ELF or PE file name, raw memory dumps, S-Records, and/or process IDs. The engine.partition call does all the hard work of parsing containers, loading data into the virtual address space, decoding CPU instructions, and creating basic blocks and functions. It returns a partitioner object that stores all the results, which include a global control flow graph.

Then we iterate over the functions:

for (P2::Function::Ptr function : partitioner->functions()) {
std::cout <<"CFG dominators for function " <<Rose::StringUtility::addrToString(function->address()) <<":\n"; // or function->printableName()
ROSE_UTIL_API std::string addrToString(uint64_t value, size_t nbits=0)
Convert a virtual address to a string.

We want to create a function CFG, but all we have from the partitioner is a global CFG. We can use the fact that a function CFG is a subset of the global CFG and therefore create the function CFG by copying the global CFG and removing all vertices and edges that are not part of the function CFG. Although this isn't the most efficient way to create function CFGs in a loop over all functions, it is very simple.

We'll need to know the entry vertex of the function's CFG:

P2::ControlFlowGraph cfg = partitioner->cfg();
P2::ControlFlowGraph::VertexIterator entryVertex = cfg.findVertex(partitioner->findPlaceholder(function->address())->id());

We found the entry vertex in the function CFG by first finding it in the global CFG and then using a feature of Sawyer graphs, namely, when a graph is copied its vertices and edges have the same IDs as those in the source graph, and we can obtain a vertex pointer (iterator) in constant time if we know its ID. We have to convert the ID to a vertex pointer before removing the non-function vertices because IDs are not stable across erase operations, but iterators are.

Next we erase the non-function vertices and edges:

P2::ControlFlowGraph::VertexIterator vi = cfg.vertices().begin();
while (vi != cfg.vertices().end()) {
if (!vi->value().isOwningFunction(function)) { // a basic block can belong to multiple functions
ASSERT_forbid(vi == entryVertex);
cfg.eraseVertex(vi++);
} else {
++vi;
}
}

This is the simple way to build a function CFG from a global CFG. It's important to use a post-increment in the eraseVertex call. Since we're inside a loop iterating over every function, a more efficient implementation would have created all the function CFGs in a single pass over the global CFG. As of June 2017, ROSE does not have functions for returning function CFGs – it only has a global CFG – because there are many details that need to be considered depending on the situation (e.g., function calls and returns are two cases that typically need massaging in a function CFG depending on the purpose of the CFG).

Finally, we get to the real meat of this example: finding the immediate dominator for every vertex in the function CFG given the CFG's entry vertex:

std::vector<P2::ControlFlowGraph::VertexIterator> idoms = Sawyer::Container::Algorithm::graphDominators(cfg, entryVertex);
std::vector< typename GraphTraits< Graph >::VertexIterator > graphDominators(Graph &g, typename GraphTraits< Graph >::VertexIterator root)
Find immediate pre-dominators.

The graphDominators function can handle any type of Sawyer graph, thus we could also pass a function call graph or a data-flow graph.

The return value is a vector indexed by vertex ID, whose values point to either the corresponding immediate dominator or the vertex end iterator (not all vertices have an immediate dominator). Using a vector indexed by vertex ID is an idiom used throughout ROSE whenever we need to associated extra data with an existing graph. Since a vertex pointer (iterator) can be converted to a vertex ID in constant time, and indexing into the vector is constant time, we can always find the extra data in constant time. And since vertex IDs are consecutive integers beginning at zero, this is also a space-efficient way to represent data that's present for most, if not all, vertices.

Finally, let us iterate over the results to print them:

for (size_t i=0; i<idoms.size(); ++i) {
std::cout <<" function " <<Rose::StringUtility::addrToString(function->address())
<<" dominator of " <<P2::Partitioner::vertexName(*cfg.findVertex(i))
<<" is ";
if (cfg.isValidVertex(idoms[i])) {
std::cout <<P2::Partitioner::vertexName(*idoms[i]) <<"\n";
} else {
std::cout <<"none\n";
}
}

Since vector indexes are equivalent to vertex IDs, we can obtain a vertex with a constant-time findVertex call and a constant-time iterator dereference. Since this is a CFG from the partitioner (or at least a copy thereof), we can use the vertexName static method to print some info about it. Using the static method (that takes a vertex value) is important; the non-static method (that takes a vertex iterator) will think that we're handing it a pointer to some vertex in the partitioner's own global CFG.

Here's the entire program:

1// Extensive documentation for this program is in the Binary Analysis Tutorial in doxygen.
2// Do not change the output format -- this tool is also used by tests/nonsmoke/functional/BinaryAnalysis
3
5static const char *programPurpose = "prints dominators for all function CFGs";
6static const char *programDescription =
7 "The purpose of this tool is to demonstrate how to calculate dominators (and by extension, post-dominators) for all "
8 "vertices of a function control flow graph (and by extension, any Sawyer graph). It parses, loads, disassembles, and "
9 "partitions a binary specimen and then iterates over the functions. For each function, it obtains a function CFG, "
10 "calculates the immediate dominator for each vertex, and shows the results.";
11
12#include <rose.h>
13#include <Rose/BinaryAnalysis/Partitioner2/BasicBlock.h>
14#include <Rose/BinaryAnalysis/Partitioner2/EngineBinary.h>
15#include <Rose/BinaryAnalysis/Partitioner2/Partitioner.h>
16#include <Sawyer/GraphAlgorithm.h>
17
20
22int
23main(int argc, char *argv[])
24{
25 ROSE_INITIALIZE;
26 P2::Engine::Ptr engine = P2::EngineBinary::instance();
27 P2::Engine::Settings &settings = engine->settings();
28 settings.partitioner.doingPostAnalysis = false; // not needed for this tool, and faster without
29 settings.partitioner.namingSyscalls = false; // for consistent results w.r.t. the answer file since the system...
30 settings.partitioner.syscallHeader = "/dev/null"; // ...call mapping comes from run-time files.
31 std::vector<std::string> specimen = engine->parseCommandLine(argc, argv, programPurpose, programDescription).unreachedArgs();
32 P2::Partitioner::Ptr partitioner = engine->partition(specimen);
34
36 for (P2::Function::Ptr function : partitioner->functions()) {
37 std::cout <<"CFG dominators for function " <<Rose::StringUtility::addrToString(function->address()) <<":\n"; // or function->printableName()
39
41 P2::ControlFlowGraph cfg = partitioner->cfg();
42 P2::ControlFlowGraph::VertexIterator entryVertex = cfg.findVertex(partitioner->findPlaceholder(function->address())->id());
44 ASSERT_require(cfg.isValidVertex(entryVertex));
45
47 P2::ControlFlowGraph::VertexIterator vi = cfg.vertices().begin();
48 while (vi != cfg.vertices().end()) {
49 if (!vi->value().isOwningFunction(function)) { // a basic block can belong to multiple functions
50 ASSERT_forbid(vi == entryVertex);
51 cfg.eraseVertex(vi++);
52 } else {
53 ++vi;
54 }
55 }
57
59 std::vector<P2::ControlFlowGraph::VertexIterator> idoms = Sawyer::Container::Algorithm::graphDominators(cfg, entryVertex);
61 ASSERT_require(idoms.size() == cfg.nVertices());
62
64 for (size_t i=0; i<idoms.size(); ++i) {
65 std::cout <<" function " <<Rose::StringUtility::addrToString(function->address())
66 <<" dominator of " <<P2::Partitioner::vertexName(*cfg.findVertex(i))
67 <<" is ";
68 if (cfg.isValidVertex(idoms[i])) {
69 std::cout <<P2::Partitioner::vertexName(*idoms[i]) <<"\n";
70 } else {
71 std::cout <<"none\n";
72 }
73 }
75 }
76}

Next steps

Most binary analysis capabilities are documented in the Rose::BinaryAnalysis namespace. The test directory, "$ROSE_SOURCE/tests/nonsmoke/functional/roseTests/BinaryTests", also has many examples, some of which are slightly hard to read since there main purpose is to test rather than demonstrate. The "$ROSE_SOURCE/projects/BinaryAnalysisTools" directory (as well as some other project directories) has real programs that use the binary analysis interface.