ROSE  0.9.9.109
Namespaces | Classes | Typedefs | Functions | Variables
Rose::BinaryAnalysis::InstructionSemantics2 Namespace Reference

Description

Binary instruction semantics.

Entities in this namespace deal with the semantics of machine instructions, and with the process of "executing" a machine instruction in a particular semantic domain. Instruction "execution" is a very broad term and can refer to execution in the tranditional sense where each instruction modifies the machine state (registers and memory) in a particular domain (concrete, interval, sign, symbolic, user-defined). But it can also refer to any kind of analysis that depends on semantics of individual machine instructions (def-use, taint, etc). It can even refer to the transformation of machine instructions in ROSE internal representation to some other representation (e.g., ROSE RISC or LLVM) where the other representation is built by "executing" the instruction.

Components of instruction semantics

ROSE's binary semantics framework has two major components: the dispatchers and the semantic domains. The instruction dispatcher "executes" a machine instruction by translating it into a sequence of RISC-like operations, and the semantics domain defines what the RISC-operators do (e.g., change a concrete machine state, produce an output listing of RISC operations, build an LLVM representation).

ROSE defines one dispatcher class per machine architecture. In this respect, the dispatcher is akin to the microcontroller for a CISC architecture, such as the x86 microcontroller within an x86 CPU. The base class for all dispatchers is BaseSemantics::Dispatcher.

The semantic domain is a loose term that refers to at least three parts taken as a whole: a value type, a machine state type, and the RISC operators. Semantic domains have names like "concrete domain", "interval domain", "sign domain", "symbolic domain", etc. The term is used loosely since one could have different implementations of, say, a "concrete domain" by using different combinations of dispatcher, state, and value type classes. For instance, one concrete domain might use the PartialSymbolicSemantics classes in a concrete way, while another might use custom classes tuned for higher performance. ROSE defines a set of semantic domains–each defined by grouping its three components (value type, machine state type, and RISC operations) into a single name space or class.

The values of a semantic domain (a.k.a., "svalues") are defined by a class type for that domain. For instance, a concrete domain's value type would likely hold bit vectors of varying sizes. Instances of the value type are used for register contents, memory contents, memory addresses, and temporary values that exist during execution of a machine instruction. Every value has a width measured in bits. For instance, an x86 architecture needs values that are 1, 5, 8, 16, 32, and 64 bits wide (the 1-bit values are for Booleans in the EFLAGS register; the five-bit values are shift counts on a 32-bit architecutre; the 64-bit values are needed for integer multiply on a 32-bit architecture; this list is likely not exhaustive). Various kinds of value type form a class hierarchy whose root is BaseSemantics::SValue.

As instructions execute they use inputs and generate outputs, which are read from and written to a machine state. The machine state consists of registers and memory, each of which holds a value which is instantiated from the domain's value type. Furthermore, memory addresses are also described by instances of the domain's value type (although internally, they can use a different type as long as a translation is provided to and from that type). The names and inter-relationships of the architecture's registers are contained in a RegisterDictionary while the state itself contains the values stored in those registers. The organization of registers and memory within the state is defined by the state. Various kinds of states form a class hierarchy whose root is BaseSemantics::State.

The RISC operators class provides the implementations for the RISC operators. Those operators are documented in the BaseSemantics::RiscOperators class, which is the root of a class hierarchy. Most of the RISC operators are pure virtual.

In order to use binary instruction semantics the user must create the various parts and link them together in a lattice. The parts are usually built from the bottom up since higher-level parts take lower-level parts as their constructor arguments: svalue, register state, memory state, RISC operators, and dispatcher. However, most of the RiscOperators classes have a default (or mostly default) constructor that builds the prerequisite objects and links them together, so the only time a user would need to do it explicitly is when they want to mix in a custom part.

Memory Management

Most of the instruction semantics components have abstract base classes. Instances of concrete subclasses thereof are passed around by pointers, and in order to simplify memory management issues, those objects are reference counted. Most objects use boost::shared_ptr, but SValue objects use a faster custom smart pointer (it also uses a custom allocator, and testing showed a substantial speed improvement over Boost when compiled with GCC's "-O3" switch). In any case, to alleviate the user from having to remember which kind of objects use which smart pointer implementation, pointer typedefs are created for each class—their names are the same as the class but suffixed with "Ptr". Users will almost exclusively work with pointers to the objects rather than objects themselves. In fact, holding only a normal pointer to an object is a bit dangerous since the object will be deleted when the last smart pointer disappears.

In order to encourage users to use the provided smart pointers and not allocate semantic objects on the stack, the normal constructors are protected. To create a new object from a class name known at compile time, use the static instance() method which returns a smart pointer. This is how users will typically create the various semantic objects.

// user code
#include <IntervalSemantics.h>
// no need to ever delete the object that 'value' points to

Most of the semantic objects also provide virtual constructors. The this pointer is used only to obtain the dynamic type of the object in order to call the correct virtual constructor. The virtual constructor will create a new object having the same dynamic type and return a smart pointer to it. The object on which the virtual constructor is invoked is a "prototypical object". For instance, when a RegisterState object is created its constructor is supplied with a prototypical SValue (a "protoval") which will be used to create new values whenever one is needed (such as when setting the initial values for the registers). Virtual constructors are usually named create(), but some classes, particularly SValue, define other virtual constructors as well. Virtual constructors are most often used when a function overrides a declaration from a base class, such as when a user defines their own RISC operation:

BaseSemantics::SValuePtr my_accumulate(const BaseSemantics::SValuePtr &operand, int addend) {
BaseSemantics::SValuePtr retval = operand->create(operand->sum + addend);
return retval;
}

Some of the semantic objects have a virtual copy constructor named copy(). This operates like a normal copy constructor but also adjusts reference counts.

Specialization

The instruction semantics architecture is designed to allow users to specialize nearly every part of it. ROSE defines triplets (value type, state type, RISC operators) that are designed to work together to implement a particular semantic domain, but users are free to subclass any of those components to build customized semantic domains. For example, the x86 simulator (in "projects/simulator2") subclasses the PartialSymbolicSemantics state in order to use memory mapped via ROSE's MemoryMap class, and to handle system calls (among other things).

When writing a subclass the author should implement three versions of each constructor: the real constructor, the static allocating constructor, and the virtual constructor. Fortunately, amount amount of extra code needed is not substantial since the virtual constructor can call the static allocating constructor, which can call the real constructor. The three versions in more detail are:

  1. Real Constructors: These are the normal C++ constructors. They should have protected access and are used only by authors of subclasses.
  2. Static Allocating Constructors: These are class methods that allocate a specific kind of object on the heap and return a smart pointer to the object. They are named "instance" to emphasize that they instantiate a new instance of a particular class and they return the pointer type that is specific to the class (i.e., not one of the BaseSemantics pointer types). When an end user constructs a dispatcher, RISC operators, etc., they have particular classes in mind and use those classes' "instance" methods to create objects. Static allocating constructors are seldom called by authors of subclasses; instead the author usually has an object whose provenance can be traced back to a user-created object (such as a prototypical object), and he invokes one of that object's virtual constructors.
  3. Virtual Constructors: A virtual constructor creates a new object having the same run-time type as the object on which the method is invoked. Virtual constructors are often named "create" with the virtual copy constructor named "clone", however the SValue class hierarchy follows a different naming scheme for historic reason–its virtual constructors end with an underscore. Virtual constructors return pointer types that defined in BaseSemantics. Subclass authors usually use this kind of object creation because it frees them from having to know a specific type and allows their classes to be easily subclassed.

When writing a subclass the author should implement the three versions for each constructor inherited from the super class. The author may also add any additional constructors that are deemed necessary, realizing that all subclasses of his class will also need to implement those constructors.

The subclass may define a public virtual destructor that will be called by the smart pointer implementation when the final pointer to the object is destroyed.

Here is an example of specializing a class that is itself derived from something in ROSE semantics framework.

// Smart pointer for the subclass
typedef boost::shared_ptr<class MyThing> MyThingPtr;
// Class derived from OtherThing, which eventually derives from a class
// defined in BinarySemantics::InstructionSemantics2::BaseSemantics--lets
// say BaseSemantics::Thing -- a non-existent class that follows the rules
// outlined above.
class MyThing: public OtherThing {
private:
char *data; // some data allocated on the heap w/out a smart pointer
// Real constructors. Normally this will be all the same constructors as
// in the super class, and possibly a few new ones. Thus anything you add
// here will need to also be implemented in all subclasses hereof. Lets
// pretend that the super class has two constructors: a copy constructor
// and one that takes a pointer to a register state.
protected:
explicit MyThing(const BaseSemantics::RegisterStatePtr &rstate)
: OtherThing(rstate), data(NULL) {}
MyThing(const MyThing &other)
: OtherThing(other), data(copy_string(other.data)) {}
// Define the virtual destructor if necessary. This won't be called until
// the last smart pointer reference to this object is destroyed.
public:
virtual ~MyThing() {
delete data;
}
// Static allocating constructors. One static allocating constructor
// for each real constructor, including the copy constructor.
public:
static MyThingPtr instance(const BaseSemantics::RegisterStatePtr &rstate) {
return MyThingPtr(new MyThing(rstate));
}
static MyThingPtr instance(const MyThingPtr &other) {
return MyThingPtr(new MyThing(*other));
}
// Virtual constructors. One virtual constructor for each static allocating
// constructor. It is of utmost importance that we cover all the virtual
// constructors from the super class. These return the most super type
// possible, usually something from BaseSemantics.
public:
virtual BaseSemantics::ThingPtr create(const BaseSemantics::RegisterStatePtr &rstate) {
return instance(rstate);
}
// Name the virtual copy constructor "clone" rather than "create".
virtual BaseSemantics::ThingPtr clone(const BaseSemantics::ThingPtr &other_) {
MyThingPtr other = MyThing::promote(other_);
return instance(other);
}
// Define the checking dynamic pointer cast.
public:
static MyThingPtr promomte(const BaseSemantics::ThingPtr &obj) {
MyThingPtr retval = boost::dynamic_pointer_cast<MyThingPtr>(obj);
assert(retval!=NULL);
return NULL;
}
// Define the methods you need for this class.
public:
virtual char *get_data() const {
return data; // or maybe return a copy in case this gets deleted?
}
virtual void set_data(const char *s) {
data = copy_string(s);
}
private:
void char *copy_string(const char *s) {
if (s==NULL)
return NULL;
char *retval = new char[strlen(s)+1];
strcpy(retval, s);
return retval;
}
};

Other major changes

The new API exists in the Rose::BinaryAnalysis::InstructionSemantics2 name space and can coexist with the original API in Rose::BinaryAnalysis::InstructionSemantics—a program can use both APIs at the same time.

The mapping of class names (and some method) from old API to new API is:

The biggest difference between the APIs is that almost everything in the new API is allocated on the heap and passed by pointer instead of being allocated on the stack and passed by value. However, when converting from old API to new API, one does not need to add calls to delete objects since this happens automatically.

The dispatchers are table driven rather than having a giant "switch" statement. While nothing prevents a user from subclassing a dispatcher to override its processInstruction() method, its often easier to just allocate a new instruction handler and register it with the dispatcher. This also makes it easy to add semantics for instructions that we hadn't considered in the original design. See DispatcherX86 for some examples.

The interface between RiscOperators and either MemoryState or RegisterState has been formalized somewhat. See documentation for RiscOperators::readMemory and RiscOperators::readRegister.

Future work

Floating-point instructions. Floating point registers are defined in the various RegisterDictionary objects but none of the semantic states actually define space for them, and we haven't defined any floating-point RISC operations for policies to implement. As for existing machine instructions, the dispatchers will translate machine floating point instructions to RISC operations, and the specifics of those operations will be defined by the various semantic policies. For instance, the RISC operators for a concrete semantic domain might use the host machine's native IEEE floating point to emulate the target machine's floating-point operations.

Example

See actual source code for examples since this interface is an active area of ROSE development (as of Jan-2013). The tests/nonsmoke/functional/roseTests/binaryTests/semanticSpeed.C has very simple examples for a variety of semantic domains. In order to use one of ROSE's predefined semantic domains you'll likely need to define some types and variables. Here's what the code would look like when using default components of the Symbolic domain:

// New API
// Old API for comparison
using namespace Rose::BinaryAnalysis::InstructionSemantics;
typedef SymbolicSemantics::Policy<> Policy;
Policy policy;
X86InstructionSemantics<Policy, SymbolicSemantics::ValueType> semantics(policy);

And here's almost the same example but explicitly creating all the parts. Normally you'd only write it this way if you were replacing one or more of the parts with your own class, so we'll use MySemanticValue as the semantic value type:

// New API, constructing the lattice from bottom up.
// Almost copied from SymbolicSemantics::RiscOperators::instance()
BaseSemantics::SValuePtr protoval = MySemanticValue::instance();
// The old API was a bit more concise for the user, but was not able to override all the
// components as easily, and the implementation of MySemanticValue would certainly have been
// more complex, not to mention that it wasn't even possible for end users to always correctly
// override a particular method by subclassing.
using namespace Rose::BinaryAnalysis::InstructionSemantics;
typedef SymbolicSemantics::Policy<SymbolicSemantics::State, MySemanticValue> Policy;
Policy policy;
X86InstructionSemantics<Policy, MySemanticValue> semantics(policy);

In order to analyze a sequence of instructions, one calls the dispatcher's processInstruction() method one instruction at a time. The dispatcher breaks the instruction down into a sequence of RISC-like operations and invokes those operations in the chosen semantic domain. The RISC operations produce domain-specific result values and/or update the machine state (registers, memory, etc). Each RISC operator domain provides methods by which the user can inspect and/or modify the state. In fact, in order to follow flow-of-control from one instruction to another, it is customary to read the x86 EIP (instruction pointer register) value to get the address for the next instruction fetch.

One can find actual uses of instruction semantics in ROSE by searching for DispatcherX86. Also, the simulator2 project (in projects/simulator2) has many examples how to use instruction semantics–in fact, the simulator defines its own concrete domain by subclassing PartialSymbolicSemantics in order to execute specimen programs.

Namespaces

 BaseSemantics
 Base classes for instruction semantics.
 
 ConcreteSemantics
 A concrete semantic domain.
 
 IntervalSemantics
 An interval analysis semantic domain.
 
 LlvmSemantics
 A semantic domain to generate LLVM.
 
 MultiSemantics
 Semantic domain composed of subdomains.
 
 NullSemantics
 Semantic domain that does nothing, but is well documented.
 
 PartialSymbolicSemantics
 A fast, partially symbolic semantic domain.
 
 SourceAstSemantics
 Generate C source AST from binary AST.
 
 StaticSemantics
 Generate static semantics and attach to the AST.
 
 SymbolicSemantics
 A fully symbolic semantic domain.
 
 TraceSemantics
 A semantics domain wrapper that prints and checks all RISC operators as they occur.
 

Classes

class  DispatcherM68k
 
class  DispatcherPowerpc
 
class  DispatcherX86
 
class  TestSemantics
 Provides functions for testing binary instruction semantics. More...
 

Typedefs

typedef boost::shared_ptr< class DispatcherM68kDispatcherM68kPtr
 Shared-ownership pointer to an M68k instruction dispatcher. More...
 
typedef boost::shared_ptr< class DispatcherPowerpcDispatcherPowerpcPtr
 Shared-ownership pointer to a PowerPC instruction dispatcher. More...
 
typedef boost::shared_ptr< class DispatcherX86DispatcherX86Ptr
 Shared-ownership pointer to an x86 instruction dispatcher. More...
 

Functions

void initDiagnostics ()
 Initialize diagnostics for instruction semantics. More...
 

Variables

Sawyer::Message::Facility mlog
 Diagnostics logging facility for instruction semantics. More...
 

Typedef Documentation

Shared-ownership pointer to an M68k instruction dispatcher.

See Shared ownership.

Definition at line 16 of file DispatcherM68k.h.

Shared-ownership pointer to a PowerPC instruction dispatcher.

See Shared ownership.

Definition at line 19 of file DispatcherPowerpc.h.

Shared-ownership pointer to an x86 instruction dispatcher.

See Shared ownership.

Definition at line 21 of file DispatcherX86.h.

Function Documentation

void Rose::BinaryAnalysis::InstructionSemantics2::initDiagnostics ( )

Initialize diagnostics for instruction semantics.

Variable Documentation

Sawyer::Message::Facility Rose::BinaryAnalysis::InstructionSemantics2::mlog

Diagnostics logging facility for instruction semantics.