ROSE 0.11.145.192
|
Getting started guide for binary analysis.
See also Rose::BinaryAnalysis.
The rest of this document contains older examples that possibly use older approaches and APIs. Use with care.
This example shows how to load a specimen into analysis space and generate a control flow graph for each function. First we include rose.h
and then any additional binary analysis header files:
Then, in the main
program, we create a disassembling and partitioning engine and use it to parse the command-line. This allows our tool to easily support the multitude of partitioner settings and container formats. The command-line parser and its documentation is easily customizable for more advanced tools, but the following works fine for this tool:
The "mlog[FATAL]" in the previous code uses ROSE's diagnostics facility, a large set of coordinated streams for different software components and message severity levels, all of which can be controlled from the command-line.
Next, now that the engine is configured and we know the name(s) or resources for the specimen, we can parse the specimen container, load the specimen into memory, disassemble, and discover functions. The results are stored in a Partitioner object:
Finally, we'll iterate over those functions to create function control flow graphs and print each graph. The cfg method returns a const reference to a global control flow graph, which normally serves as the starting point for many analyses. But in this case, we want only the subgraph that represents an individual function, so we copy the global CFG and then remove those vertices and edges that are uninteresting:
In the above code, the Partitioner2::ControlFlowGraph is a specialization of Sawyer::Graph, which is a generic yet easy-to-use graph implementation with very good time complexity for most operations. So the previous "for" loop has linear time complexity with respect to the total number of vertices and edges.
The Partitioner2::DataFlow namespace has a buldDfCfg function that creates a slightly different kind of control flow graph–one that's more useful for data-flow–which also permits intra- and inter-procedural analysis based on user criteria.
Here's the entire program:
This example is similar to the bintut_helloworld example, but demonstrates how to analyze function call information to construct a function call graph (CG) and then emit that graph as a GraphViz file. The output can be converted to a picture with the "dot" command, or visualized interactively with ZGRViewer.
As before, the "rose.h" header is the first of the ROSE headers to be included, followed by the binary analysis headers in any order.
This time we'll use a few namespaces to reduce our typing. The Rose::Diagnostics namespace brings into scope the diagnostic support (mlog
and FATAL
in this example) which is accessed through the "--log" command-line switches and controls what kinds of diagnostic output is produced. If you run this program with "--log=all" you'll get traces and debugging from all the ROSE components that use this mechanism (lots of output); see "--log=help" for info about fine tuning this.
Many of the binary analysis tools find that holding all command-line settings in a single struct is a convenient way to organize things. This example only has one tool-specific setting–the name of the output file for the call graph.
The bintut_helloworld example showed the simplest way to parse a command-line, but this time we'll do something a little more complex. This time we want to augment the command-line parser so it knows about the switches that this tool supports. The parser itself comes from the Sawyer library, part of which is distributed along with the ROSE source code. We obtain the default parser from the disassembly and partitioning engine, and augment it with a switch group containing the switches specific to this tool. Our "--output" or "-O" switch takes a single argument, a file name that can be any string. The "doc" property is the documentation for our switch and will appear in the Unix man page produced by running with "--help". Finally, we invoke the parser with our additional switch group on the argument list in argv
. If the parsing is successful we apply the results to our settings
and then return the rest of the command line (probably information about the specimen).
In the main
program we initialize our settings
and instantiate a disassembly/partitioning engine, then parse the command-line and get the list of non-switch arguments. If there are none, give the user some help; this is often an indication that the user invoked the command with no arguments at all in order to get an error message that hopefully contains some usage hints.
Instead of calling engine.frontend, this time we call engine.partition in order to get access to the partitioning analysis results. No AST is created in this case, although we could get one if we wanted by querying the engine for it. The data structures used by the partitioner are much more efficiently tuned for analysis than an AST, so we'll stick with the partitioner.
The partitioner knows how to construct a call graph from its internal data structures. The FunctionCallGraph class can also be manipulated directly. Now's a good time to point out that many binary analysis data structures use pointers to shared objects. The objects are reference counted and deleted automatically. Classes that are used in this way do not have public constructors, but rather instance
methods (and sometimes additional factories as well) that allocate and initialize the object and return a smart pointer. In this example, if we were to delete the partitioner
or engine
the callgraph
would still point to valid functions, which still point to valid instructions, etc.
Finally we can emit the call graph as a GraphViz file. This is done through a CgEmitter that specializes a more general GraphViz emitter. The emitter API allows you to fine tune the output by adjusting colors and other GraphViz edge and vertex properties.
Here's the full listing. Compile it using the same instructions as for the bintut_helloworld example.
This example parses and disassembles a binary specimen and search for all static strings, similar to the Unix "strings" command. This simple example can be a starting point for a more in depth strings analysis than what's possible with the Unix command.
Include headers. The "rose.h" must always be before other ROSE headers.
Yet another way to parse a command-line. This time we're trying to avoid calling Partitioner2::Engine::frontend because we don't actually need to disassemble or partition anything–string searching operates on raw memory rather than instructions, so we can save some time.
The engine.loadSpecimens method parses the speicmen container if present (e.g., Linux ELF) and determines how the specimen should be mapped into memory. The result is a MemoryMap that describes memory segments with addresses, permissions, names, etc. Try inserting a call to MemoryMap::dump here to get an idea of what it contains.
The byte order will be needed by the string decoder in order to know how to decode the multi-byte length fields. We could have gotten this same information any number of other ways, but this is the most convenient in this situation. Note that knowledge of the byte order depends upon knowledge of the specimen architecture even though we don't actually need to disassemble anything.
Here we perform the string searching. Most binary analysis algorithms are packaged into a class. The idea is that one instantiates the analyzer, configures it, calls some method to perform the analysis (here it's find), and then queries the results.
The MemoryMap::require and MemoryMap::prohibit methods are a form of filtering. They're filtering the memory map so that the string analyzer only sees memory that's readable but not writable.
Generating the output is a matter of iterating over the strings that were found and printing some information. Most analyzer objects also know how to print themselves although those defaults are not always suitable for a polished tool.
Here's the entire program:
Loosely speaking, the dominator of a vertex of a graph is another vertex that's visited on every path from some starting vertex to the vertex in question. Most of the time, we're interested in an immediate dominator, which is the closest dominator to the vertex in question. A more rigorous definition can be found in Wikipedia, among other places. Post-dominators are similar in that a post dominator of a vertex is some other vertex that's visited on every path from the vertex in question to some ending vertex.
ROSE uses the Sawyer Graph API for all binary analysis graphs, and Sawyer has functions for calculating dominators and post-dominators. The following example is a tool that finds the dominator for each vertex in the control flow graph of each function.
The first step, above, is to include the appropriate declarations. Our convention for writing tools is to describe the tool with a couple of strings that appear at the very top of the source code. These strings are used later in the program to generate the documentation for the "--help" output.
This tool is so simple that everything else is in "main". First we initialize things:
The ROSE_INITIALIZE macro is always the first ROSE-related statement and checks that this tool was compiled with header files that are binary compatible with the ROSE library. Then we create a partitioning engine to parse the command-line to get the specimen resources. The specimen is some combination of a ELF or PE file name, raw memory dumps, S-Records, and/or process IDs. The engine.partition call does all the hard work of parsing containers, loading data into the virtual address space, decoding CPU instructions, and creating basic blocks and functions. It returns a partitioner object that stores all the results, which include a global control flow graph.
Then we iterate over the functions:
We want to create a function CFG, but all we have from the partitioner is a global CFG. We can use the fact that a function CFG is a subset of the global CFG and therefore create the function CFG by copying the global CFG and removing all vertices and edges that are not part of the function CFG. Although this isn't the most efficient way to create function CFGs in a loop over all functions, it is very simple.
We'll need to know the entry vertex of the function's CFG:
We found the entry vertex in the function CFG by first finding it in the global CFG and then using a feature of Sawyer graphs, namely, when a graph is copied its vertices and edges have the same IDs as those in the source graph, and we can obtain a vertex pointer (iterator) in constant time if we know its ID. We have to convert the ID to a vertex pointer before removing the non-function vertices because IDs are not stable across erase operations, but iterators are.
Next we erase the non-function vertices and edges:
This is the simple way to build a function CFG from a global CFG. It's important to use a post-increment in the eraseVertex
call. Since we're inside a loop iterating over every function, a more efficient implementation would have created all the function CFGs in a single pass over the global CFG. As of June 2017, ROSE does not have functions for returning function CFGs – it only has a global CFG – because there are many details that need to be considered depending on the situation (e.g., function calls and returns are two cases that typically need massaging in a function CFG depending on the purpose of the CFG).
Finally, we get to the real meat of this example: finding the immediate dominator for every vertex in the function CFG given the CFG's entry vertex:
The graphDominators function can handle any type of Sawyer graph, thus we could also pass a function call graph or a data-flow graph.
The return value is a vector indexed by vertex ID, whose values point to either the corresponding immediate dominator or the vertex end iterator (not all vertices have an immediate dominator). Using a vector indexed by vertex ID is an idiom used throughout ROSE whenever we need to associated extra data with an existing graph. Since a vertex pointer (iterator) can be converted to a vertex ID in constant time, and indexing into the vector is constant time, we can always find the extra data in constant time. And since vertex IDs are consecutive integers beginning at zero, this is also a space-efficient way to represent data that's present for most, if not all, vertices.
Finally, let us iterate over the results to print them:
Since vector indexes are equivalent to vertex IDs, we can obtain a vertex with a constant-time findVertex
call and a constant-time iterator dereference. Since this is a CFG from the partitioner (or at least a copy thereof), we can use the vertexName static method to print some info about it. Using the static method (that takes a vertex value) is important; the non-static method (that takes a vertex iterator) will think that we're handing it a pointer to some vertex in the partitioner's own global CFG.
Here's the entire program:
Most binary analysis capabilities are documented in the Rose::BinaryAnalysis namespace. The test directory, "$ROSE_SOURCE/tests/nonsmoke/functional/roseTests/BinaryTests", also has many examples, some of which are slightly hard to read since there main purpose is to test rather than demonstrate. The "$ROSE_SOURCE/projects/BinaryAnalysisTools" directory (as well as some other project directories) has real programs that use the binary analysis interface.