Yesterday, I was presented with a file format reverse engineering puzzle. My mom had a old address book database that she wanted to recover. It had been created in Microsoft Works 6.0 (I think) database format (the files with the .wdb extension). My first thought was to run strings(1) on it, and see if the data was in connected ascii text. This worked, mostly, as the text did show up; but it was not organized into fields and records, which would make it harder to recreate a database from it. When she explained that she was going to need the whole thing, not just a few entries, I said I would work on recovering more of the format.
My first attempts at finding a wdb parser involved a web search. Google can parse wdb files - but they don't publish their translation programs, and I didn't want to upload our personal address book to the web and wait for google to index it. I found various commerical programs that claimed to be able to recover wdb files, but I wanted to understand the format, not just buy a product which, might, or might not, work. I was unable to find any discussion in the free software/open source world on the details of the format(which is the main reason I am writing this right now). I did find various tutorials on reverse engineering, including a nice page on reversing file formats by John Rittenhouse, and the Reverse Engineering Wikibook. Both of these helped give me some pointers in my work. I also found a few other wdb files to examine via google, which I used a little in my work on the format.
Since I was unable to find a already written tool to do the job, I decided to investigate the format and write one myself. I used Python, along with Emacs, (and a little use of hexdump) as my tools of choice, as they are what I am most familiar with. I examined the files fruitlessly for a while - identifying various often used characters, but not getting a sense of what they meant. The breakthrough happened when I realized there was a recurring, 10 byte header before each of the ascii text strings representing the values of the fields(which were terminated by a NULL). This header always began with a \x0f byte and NULL, followed by 4 data bytes alternated with NULLs.
The format, as I currently understand it, is as follows:
I've written a program to parse the format and translate it to CSV format; it is available here. If you know anything else about the format, or have any questions, please let me know. Any feedback gratefully accepted.
This tool works fine for me! thanks for the job. --eldraco (talk) 00:01, 11 May 2008 (UTC)