Unstripping Stripped Binaries

Tavis Ormandy

$Id: a07cf90837a3c4373b82d6724b97593810766af7 $

Intro

I’ve written before about how much I enjoy vintage software. Lately I’ve been tinkering with WordPerfect for UNIX.

It’s working great, combined with Lotus 1-2-3 you can have a full-featured office suite in an xterm! 😂

WordPerfect

Debugging

These are 30 year old stripped binaries that I’ve somehow managed to patch into a working state. As you can imagine, when something doesn’t work, tracking down what went wrong can be a real challenge.

What I really want is to take the names and types I’ve figured out from my disassembler, and make them visible to gdb.

I’ve found a simple solution to this that’s been working well for me, here are some notes on it.

Stabs

You’re probably familiar with DWARF, the debugging format used everywhere in Linux. DWARF is really neat, it’s capable of expressing the most complex locations and types possible.

In this context, “location” means explaining to a debugger how to find a variable. In some cases that’s really easy, but it gets complicated fast. Perhaps the variable moves in and out of registers or requires complex address calculation logic (e.g. a bitfield struct member stored in a stack frame).

Figuring out these locations requires the debugger to execute little bytecode programs called DWARF expressions – crazy stuff!

Before DWARF, STABS (Symbol TABle Strings), was the predominant debugging format. STABS takes simple locations and types, encodes them into strings, and then stuffs them into the symbol table. It’s not very elegant, but it worked.

STABS is entirely obsolete, DWARF is superior in every way. However… STABS does have one benefit that DWARF doesn’t – expressing simple stuff (very simple) is so easy you can even do it by hand…

Tools

All the GNU tools still have native support for STABS. In fact, I’ve noticed the GNU assembler even has pseudo-instructions that you can write manually:

https://sourceware.org/binutils/docs/as/Stab.html:

There are three directives that begin ‘.stab’. All emit symbols (see Symbols), for use by symbolic debuggers. Up to five fields are required:

string
This is the symbol’s name.

type
An absolute expression. The symbol’s type is set to the low 8 bits of this expression. Any bit pattern is permitted, but ld and debuggers choke on silly bit patterns.

[…]

value
An absolute expression which becomes the symbol’s value.

If you just want basic native types and simple locations, nothing could be simpler than this.

Examples

Let’s say I’ve figured out there is a function at 0x8005bba like this:

void example(unsigned int *foo, unsigned long bar);

The stab to declare that is just this:

.stabs "example:f-11", N_FUN, 0, 0, 0x8005bba

Here f means this is a function, and -11 is the pre-defined type for void. If you only want to use basic types, you don’t even have to define them!

Here is a list of some of the predefined types that GDB recognizes:

Num Type
-1 int, 32 bit signed integral type.
-2 char, 8 bit type holding a character.
-4 long, 32 bit signed integral type.
-5 unsigned char, 8 bit unsigned integral type.
-6 signed char, 8 bit signed integral type.
-7 unsigned short, 16 bit unsigned integral type.
-8 unsigned int, 32 bit unsigned integral type.
-9 unsigned, 32 bit unsigned integral type.
-10 unsigned long, 32 bit unsigned integral type.
-11 void, type indicating the lack of a value.
-31 long long, 64 bit signed integral type.
-32 unsigned long long, 64 bit unsigned integral type.

There are some confusing choices in there, but there are about 30 predefined types that GDB knows about.

Notice I said “knows about”, that’s because there is no STABS standard, just analyses of what crazy incompatible things all the 90s UNIX vendors were doing!

This cygnus document on stabs is great; it’s well written and thorough, but occasional glimpses of frustration with Sun and IBM for their incompatible undocumented extensions seep through.

https://sourceware.org/gdb/onlinedocs/stabs.pdf

Parameters

Okay, functions are working, what about function parameters? If they’re one of the predefined types and this is a standard cdecl function, that’s easy too!

.stabs "foo:p*-8", N_PSYM, 0, 0, 8
.stabs "bar:p-10", N_PSYM, 0, 0, 12

This means there is a parameter foo, a pointer to an unsigned int at bp+8 and an unsigned long called bar at bp+12.

I wrote some gas macros to make this less laborious, and now I can just write this:

function main, 0x8128000, %int
    param argc, %int
    param argv, ** %char
    param envp, ** %char

Pretty neat!

You don’t even need to specify the offset – gas macros can store counters between invocations, so I just keep incrementing it for each new parameter, then reset it when you start a new function!

They’re really simple, they look like this (some code ommitted):

.macro function name, address, ptr=, type
    .set _arg, 0
    .stabs "\name:f\ptr\type", N_FUN, 0, 0, \address
.endm

.macro param name, ptr=, type
    .set _arg, _arg + 1
    .stabs "\name:p\ptr\type", N_PSYM, 0, 0, PARAM_SIZE+_arg*PARAM_SIZE
.endm

It works great, here is a sample gdb session. You can see I set breakpoints, examine values, print types, and so on.

(gdb) add-symbol-file symbols.dbg
(gdb) pt rddec
type = boolean (char **, unsigned short *)
(gdb) x/i rddec
0x81dac90 <rddec>:   push   ebp
(gdb) b rddec
Breakpoint 1 at 0x81dac96
(gdb) c
Breakpoint 1, 0x081dac96 in rddec (numstr=0xffffc044, num=0x8350dbe)
(gdb) pt numstr
type = char **
(gdb) p *numstr
$1 = 0xffffc048 "06"

Even conditional breakpoints on parameter values work, it’s just like unstripping the binary.

In order to get a symbol file, I assemble them like this:

$ as --32 -gstabs -o wp.o wp.s
$ as --32 -gstabs -o types.o types.s
$ ld -m elf_i386 -shared -Tdata=082d7938 -Ttext=0804a5f0 -Tbss=083377c0 -o wp.dbg wp.o types.o
$ strip --only-keep-debug wp.dbg

It’s important to have the sections lined up with the target binary, or gdb will get confused.

Now you can just do this:

(gdb) add-symbol-file wp.dbg
add symbol table from file "wp.dbg"
Reading symbols from wp.dbg...

I haven’t tried it, but I bet objcopy --add-gnu-debuglink would work too!

Usage

I can write these symbols manually when I need to, but also wrote a quick script to export these from my disassembler.

The output is just thousands of lines like this:

function g_init, 0x0815EFA0, %int
function dflt_init, 0x0815F1A0, %int
function tool_init, 0x0815F620, %int
function g_close, 0x0815F630, %int
function g_inits, 0x0815F710, %int
function g_dint, 0x0815F720, %int
function g_dot, 0x0815F730, %int
function sub_8160290, 0x08160290, * %void

Putting it all together, my stripped binary now has symbols and parameter information in gdb, woohoo!

Breakpoint 2, 0x0815f1b9 in dflt_init ()
(gdb) bt
#0  0x0815f1b9 in dflt_init ()
#1  0x0815f02a in g_init ()
#2  0x0815acfc in int_dsp_xxx ()
#3  0x0814f8ad in gshow_init ()

Symbol Porting

I actually have a huge advantage here that I didn’t mention.

After a few of hours digging around on archive.org, I found a binary for an older version of WordPerfect that wasn’t stripped! It must have been a mistake while building the final RTM binaries.

$ ls -l wp
-rw------- 1 taviso taviso 4.5M Jul 31  1996 wp
$ file wp
wp: ELF 32-bit MSB executable, SPARC, version 1 (SYSV), dynamically linked, interpreter /usr/lib/ld.so.1, not stripped

Unfortunately it’s not only an older version, but also for a different architecture and operating system, Solaris SPARC. BinDiff does do pretty well at matching these symbols to my i386 binary.

I’ve found that people are sometimes surprised this works! Most of the clever tricks BinDiff uses to match functions between two binaries are actually architecture neutral. That means that If you have a stripped binary for ARM and an unstripped version of the same binary for MIPS, BinDiff can figure out which functions are which for you.

String References

Some of the techniques used are easy to understand, like string reference matching. If only a single function references the string “error in function foo”, then it doesn’t matter if it’s SPARC or x86, clearly this is the same function, so you just learned a symbol name!

CFG

It’s rare it’s that easy though, and BinDiff is full of insanely clever tricks. Many of them involve CFG matching.

If you generate a graph of all the function calls in two similar binaries, there should be lots of matches.

If function foo calls function bar, which calls function baz and quux, then you don’t need to know what architecture this is or what the functions do. You can simply find the same unique graph and be confident these are the same functions!

Control Flow Graph Matching

DOT Code…

digraph {
    rankdir="LR"
    sub_123 -> sub_456
    sub_456 -> sub_789
    sub_456 -> sub_abc

    foo -> bar
    bar -> baz
    bar -> quux
}

This is just a trivial example, you can read about some more of the BinDiff matching strategies here.

These tricks combined have let me debug and track down some pretty gnarly issues!

Code

If the idea of using stabs to debug stripped code has piqued your interest, you can take a look at my macros and code on github here.

Hopefully this could be a useful base for someone elses project!