| Oct | NOV | Dec |
| 26 | ||
| 2019 | 2020 | 2021 |
COLLECTED BY
Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.
History is littered with hundreds of conflicts over the future of a community, group, location or business that were "resolved" when one of the parties stepped ahead and destroyed what was there. With the original point of contention destroyed, the debates would fall to the wayside. Archive Team believes that by duplicated condemned data, the conversation and debate can continue, as well as the richness and insight gained by keeping the materials. Our projects have ranged in size from a single volunteer downloading the data to a small-but-critical site, to over 100 volunteers stepping forward to acquire terabytes of user-created data to save for future generations.
The main site for Archive Team is at archiveteam.org and contains up to the date information on various projects, manifestos, plans and walkthroughs.
This collection contains the output of many Archive Team projects, both ongoing and completed. Thanks to the generous providing of disk space by the Internet Archive, multi-terabyte datasets can be made available, as well as in use by the Wayback Machine, providing a path back to lost websites and work.
Our collection has grown to the point of having sub-collections for the type of data we acquire. If you are seeking to browse the contents of these collections, the Wayback Machine is the best first stop. Otherwise, you are free to dig into the stacks to see what you may find.
The Archive Team Panic Downloads are full pulldowns of currently extant websites, meant to serve as emergency backups for needed sites that are in danger of closing, or which will be missed dearly if suddenly lost due to hard drive crashes or server failures.
Collection: Archive Team: URLs
.exe to something else, you
could open it in notepad! And if you renamed something else to a .exe,
you'd get a neat error dialog.
Clearly, something was different about these files. Seen from notepad, they
were mostly gibberish, but there had to be order in that chaos. 12-year-old
me knew that, although he didn't quite know how or where to dig to make sense
of it all.
So, this series is dedicated to my past self. In it we'll attempt to
understand how Linux executables are organized, how they are executed, and
how to make a program that takes an executable fresh off the linker and
compresses it - just because we can.
Since the last big series, Making our own ping,
was all about Windows, this one will be focused on 64-bit Linux.
As a reminder, Filippo's Searchable Linux Syscall Table is quite excellent. We can build and link it quite easily:X86 Assembly; in `hello.asm` global _start section .text _start: mov rdi, 1 ; stdout fd mov rsi, msg mov rdx, 9 ; 8 chars + newline mov rax, 1 ; write syscall syscall xor rdi, rdi ; return code 0 mov rax, 60 ; exit syscall syscall section .data msg: db "hi there", 10
Good! Now, our executable is about 8.68 KiB as-is. If we useShell session$ nasm -f elf64 hello.asm $ ld hello.o -o hello $ file hello hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped $ ./hello hi there
gzip -9 on it, we
can easily get that down to 372 B, so I'm kinda curious to see what's in
there, to be quite honest.
Cool bear's hot tip
This is a /genuine/ curious, not a "oh god why" curious.
This series is not about lamenting the good old days when everything used
to fit on a floppy. It's not about how modern software is bloated and slow.
It's not about to go into a tangent about developers being "lazy" and it
"not being that hard if only someone tried".
We are not developing something practical here, we are picking a challenge
purely for the sake of learning stuff.
If you need a chunky dose of nostalgia instead, feel free to check out gestures
vaguely at almost all of the internet.
If we look at an hex dump of hello, we see the ELF string early on, followed
by a bunch of binary data:
But we can pretty easily see that most of the file is made out of null (zero) bytes:Shell session$ xxd < hello | head 00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 00000010: 0200 3e00 0100 0000 0010 4000 0000 0000 ..>.......@..... 00000020: 4000 0000 0000 0000 3821 0000 0000 0000 @.......8!...... 00000030: 0000 0000 4000 3800 0300 4000 0600 0500 ....@.8...@..... 00000040: 0100 0000 0400 0000 0000 0000 0000 0000 ................ 00000050: 0000 4000 0000 0000 0000 4000 0000 0000 ..@.......@..... 00000060: e800 0000 0000 0000 e800 0000 0000 0000 ................ 00000070: 0010 0000 0000 0000 0100 0000 0500 0000 ................ 00000080: 0010 0000 0000 0000 0010 4000 0000 0000 ..........@..... 00000090: 0010 4000 0000 0000 2500 0000 0000 0000 ..@.....%.......
We can also find a section that contains a bunch of names, maybe they have some sort of meaning?Shell session$ xxd < hello | tail -60 | head 00001f00: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00001f10: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00001f20: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00001f30: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00001f40: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00001f50: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00001f60: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00001f70: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00001f80: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00001f90: 0000 0000 0000 0000 0000 0000 0000 0000 ................
Now, there's a whole host of tools that would let us poke this ELF file from the comfort of a terminal. Ohh yes. A slew of them. Tools and tools as far as the eye can see. But we shan't be using them today. Not today, matey. Today, we parse things ourselves. Using theShell session$ xxd < hello | tail -32 | head 000020c0: 0920 4000 0000 0000 0000 0000 0000 0000 . @............. 000020d0: 2b00 0000 1000 0200 1020 4000 0000 0000 +........ @..... 000020e0: 0000 0000 0000 0000 0068 656c 6c6f 2d6f .........hello-o 000020f0: 7269 6769 6e61 6c2e 6173 6d00 6d73 6700 riginal.asm.msg. 00002100: 5f5f 6273 735f 7374 6172 7400 5f65 6461 __bss_start._eda 00002110: 7461 005f 656e 6400 002e 7379 6d74 6162 ta._end...symtab 00002120: 002e 7374 7274 6162 002e 7368 7374 7274 ..strtab..shstrt 00002130: 6162 002e 7465 7874 002e 6461 7461 0000 ab..text..data.. 00002140: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00002150: 0000 0000 0000 0000 0000 0000 0000 0000 ................
nom crate, which we used in the
Making our own ping series, and armed with
which I would be willing to parse darn near anything - even PSD.
For that, though, we'll need a few pointers. The Wikipedia page for
ELF is not half
bad, but it's not the greatest overview - in part because it bothers itself
with 32-bit ELF, which we will, for the entire length of this series,
conveniently forget.
Cool bear's hot tip
In other words, we're going to "pull a Canonical".
So, here's what a 64-bit ELF file header looks like:
62in the file, there's an "index of entry with section names".
For our hello executable, that's...
The bytesShell session$ # -s = seek, -l = length $ xxd -s 62 -l 2 ./hello 0000003e: 0500 ..
05and 00- now, we're dealing with a little-endian file, so that means 0x0005,
which is just 5. So the fifth section header in the table contains section names.
At this point we have no idea what sections are, but I think it's safe to say
that the file is divided into them and that their beginning and size is stored
in those section headers.
The section header table starts at...
Shell session$ # -g = group size, -e = little-endian $ xxd -s 40 -l 8 -g 8 -e ./hello 00000028: 0000000000002140
0x2140, which is...
Cool bear's hot tip Even thoughShell session$ echo $((0x2140)) 8512
xxd itself doesn't support hexadecimal notation, we can
perform shell arithmetic with $((expr))!
According to the Wikipedia page on ELF,
every section header contains the offset in the file where the section data is stored... at offset 0x18!
Every section header has length 0x40, and we want the fifth one, so...
That means that the data of the section that contains section names should be atShell session$ xxd -s $((0x2140 + 0x40 * 5 + 0x18)) -l 8 -g 8 -e ./hello 00002298: 0000000000002118 .!......
0x2118. Let's check it out:
...those are the strings we noticed earlier, when randomly paging through the dump! Well, that's one mystery solved. Onto the next hundred or so. It was fun looking through the file by hand, and we learned to useShell session$ xxd -s $((0x2118)) ./hello | head -4 00002118: 002e 7379 6d74 6162 002e 7374 7274 6162 ..symtab..strtab 00002128: 002e 7368 7374 7274 6162 002e 7465 7874 ..shstrtab..text 00002138: 002e 6461 7461 0000 0000 0000 0000 0000 ..data.......... 00002148: 0000 0000 0000 0000 0000 0000 0000 0000 ................
xxd to
do exactly that (which might definitely come in handy in the nuclear winter
if a graphical hex viewer/editor isn't available), but we probably want
to start writing an actual parser now.
We'll name our library delf, for demystify ELF
For convenience, we'll add a parse module with anShell session$ cargo new --lib delf $ cd delf $ cargo add nom Adding nom v5.1.2 to dependencies
Input and Result type:
Rust code// in `delf/src/lib.rs` mod parse;
We can parse the first 16 bytes pretty trivially:Rust code// in `delf/src/parse.rs` pub type Input<'a> = &'a [u8]; pub type Result<'a, O> = nom::IResult<Input<'a>, O, nom::error::VerboseError<Input<'a>>>;
That looks reasonable. Let's make another crate to test theRust code// in `delf/src/lib.rs` #[derive(Debug)] pub struct File { // we'll add fields as we go } impl File { const MAGIC: &'static [u8] = &[0x7f, 0x45, 0x4c, 0x46]; pub fn parse(i: parse::Input) -> parse::Result<Self> { use nom::{ bytes::complete::{tag, take}, error::context, sequence::tuple, }; let (i, _) = tuple(( // ------- context("Magic", tag(Self::MAGIC)), context("Class", tag(&[0x2])), context("Endianness", tag(&[0x1])), context("Version", tag(&[0x1])), context("OS ABI", nom::branch::alt((tag(&[0x0]), tag(&[0x3])))), // ------- context("Padding", take(8_usize)), ))(i)?;Ok((i, Self {})) } }
delf crate.
We'll name it "elk", for "Executable & Linker Kit"
Shell session$ cargo new --bin elk $ cargo add ../delf Adding delf (unknown version) to dependencies
And take it for a spin:Rust code// in `src/elk/main.rs` use std::{env, error::Error, fs}; fn main() -> Result<(), Box<dyn Error>> { let input_path = env::args().nth(1).expect("usage: elk FILE"); let input = fs::read(&input_path)?; delf::File::parse(&input[..]).map_err(|e| format!("{:?}", e))?; println!("input is a supported ELF file!");Ok(()) }
Good! Good start. Now we'll have to deal with some enumsShell session$ # in elk/ $ cargo build -q $ ./target/debug/elk /bin/true input is a supported ELF file!
Debug for printing, Clone and
Copy so it has copy semantics (instead of transmitting ownership),
PartialEq and Eqto compare Type values for equality.
When reading our ELF file though, we won't get aRust code// in `delf/src/lib.rs` #[derive(Debug, Clone, Copy, PartialEq, Eq)] pub enum Type { None, Rel, Exec, Dyn, Core, }
Type - we'll get a u16.
And similarly, when we write out an ELF file, we'll also need a u16.
We could do the mapping ourselves, like so:
Or we could just set the representation of theRust code// in `delf/src/lib.rs` impl Type { to_u16(&self) -> u16 { match self { Self::None => 0, Self::Rel => 1, Self::Exec => 2, Self::Dyn => 3, // etc. } } }
Type enum to u16 - and
then we'd get that for free:
Now, we can use theRust code// in `delf/src/lib.rs` #[derive(Debug, Clone, Copy, PartialEq, Eq)] #[repr(u16)] pub enum Type { None = 0x0, Rel = 0x1, Exec = 0x2, Dyn = 0x3, Core = 0x4, }
asoperator to cast our Typetou16 - let's
write a quick test to validate our assumptions:
And run it:Rust code// in `delf/src/lib.rs` #[cfg(test)] mod tests { #[test] fn type_to_u16() { assert_eq!(super::Type::Dyn as u16, 0x3); } }
Splendid! Now there's the small matter of going the other way - converting anShell session$ # t == test, --lib == only library unit tests (not doc tests) $ cargo t --lib Finished test [unoptimized + debuginfo] target(s) in 0.02s Running target/debug/deps/delf-d6fdd5529c793a0b running 1 test test tests::type_to_u16 ... ok test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
u16 to a Type. The problem here, of course, is that not all u16 values
are valid Type values.
In fact, had we written from_u16 ourselves, we would have had to deal
with this problem head-on:
The code above fails to compile, with reason:Rust code// in `delf/src/lib.rs` impl Type { pub fn from_u16(x: u16) -> Self { match x { 0 => Self::None, 1 => Self::Rel, 2 => Self::Exec, 3 => Self::Dyn, 4 => Self::Core, } } }
To deal with that possibility, we could return anShell session# b = build, -q = quiet $ cargo b -q error[E0004]: non-exhaustive patterns: `5u16..=std::u16::MAX` not covered --> src/lib.rs:17:15 | 17 | match x { | ^ pattern `5u16..=std::u16::MAX` not covered | = help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
Option<Self> instead:
And that works!Rust code// in `delf/src/lib.rs` impl Type { pub fn from_u16(x: u16) -> Option<Self> { match x { 0 => Some(Self::None), 1 => Some(Self::Rel), 2 => Some(Self::Exec), 3 => Some(Self::Dyn), 4 => Some(Self::Core), _ => None, } } } #[cfg(test)] mod tests { // omitted: previous tests #[test] fn type_from_u16() { assert_eq!(super::Type::from_u16(0x3), Some(super::Type::Dyn)); assert_eq!(super::Type::from_u16(0xf00d), None); } }
...but it's kinda tedious. We already specify aShell session$ cargo t --lib Finished test [unoptimized + debuginfo] target(s) in 0.02s Running target/debug/deps/delf-d6fdd5529c793a0b running 2 tests test tests::type_to_u16 ... ok test tests::type_from_u16 ... ok test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Type <-> u16 mapping, when we
first define the enum. Why should we repeat ourselves?
As it turns out.. eyes light up in wonder there is a crate for that!
derive-try-from-primitive
solves exactly that problem. Let's give it a shot:
Shell session$ cargo add derive-try-from-primitive Adding derive-try-from-primitive v1.0.0 to dependencies
Cool bear's hot tip In a previous version of this article, we usedRust code// in `delf/src/lib.rs` use derive_try_from_primitive::TryFromPrimitive; #[derive(Debug, Clone, Copy, PartialEq, Eq, TryFromPrimitive)] #[repr(u16)] pub enum Type { None = 0x0, Rel = 0x1, Exec = 0x2, Dyn = 0x3, Core = 0x4, } #[derive(Debug, Clone, Copy, PartialEq, Eq, TryFromPrimitive)] #[repr(u16)] pub enum Machine { X86 = 0x03, X86_64 = 0x3e, } #[cfg(test)] mod tests { use super::Machine; use std::convert::TryFrom; #[test] fn try_enums() { assert_eq!(Machine::X86_64 as u16, 0x3E); assert_eq!(Machine::try_from(0x3E), Ok(Machine::X86_64)); assert_eq!(Machine::try_from(0xFA), Err(0xFA)); } }
derive-try-from-primitive v0.1.0,
which returned an Option<T>. But, since then, derive-try-from-primitive v1.0.0
was released, which returns a Result<T, E> since it implements the standard
TryFrom interface.
Sure enough, it works as expected!
Now, we can finally parse both the type and the machine. First let's add them to theShell session$ cargo t Finished test [unoptimized + debuginfo] target(s) in 0.01s Running /home/amos/ftl/elf-series/target/debug/deps/delf-2d44b198a598eda8 running 1 test test tests::try_enums ... ok test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out Doc-tests delf running 0 tests test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
File struct:
But wait... isn'tRust code// in `delf/src/lib.rs` #[derive(Debug)] pub struct File { pub type: Type, pub machine: Machine, }
type a keyword?
It is! We could useShell session$ cargo b -q error: expected identifier, found keyword `type` --> src/lib.rs:24:5 | 24 | type: Type, | ^^^^ expected identifier, found keyword | help: you can escape reserved keywords to use them as identifiers | 24 | r#type: Type, | ^^^^^^
typ instead - but just this once, let's use the
escaped form instead, like the compiler suggests.
Better! Kinda awkward, but at least this compiles. Let's parse:Rust code#[derive(Debug)] pub struct File { pub r#type: Type, pub machine: Machine, }
To try that, we'll switch back to our binary crate,Rust code// in `delf/src/lib.rs` // new! use std::convert::TryFrom; impl File { // omitted: magic constant pub fn parse(i: parse::Input) -> parse::Result<Self> { use nom::{ bytes::complete::{tag, take}, error::context, sequence::tuple, combinator::map, number::complete::le_u16, }; // omitted: parse magic, etc. let (i, (r#type, machine)) = tuple(( context("Type", map(le_u16, |x| Type::try_from(x).unwrap())), context("Machine", map(le_u16, |x| Machine::try_from(x).unwrap())), ))(i)?; let res = Self { machine, r#type };Ok((i, res)) } }
elk, for a second,
and print the File struct, now that it has fields!
Rust code// in `elk/src/main.rs` fn main() -> Result<(), Box<dyn Error>> { let input_path = env::args().nth(1).expect("usage: elk FILE"); let input = fs::read(&input_path)?; let (_, file) = delf::File::parse(&input[..]).map_err(|e| format!("{:?}", e))?; // new! println!("{:#?}", file);Ok(()) }
Wonderful! What happens if we run it on a 32-bit ELF?Shell session$ # in elk/ $ cargo b -q $ ./target/debug/elk /bin/true File { type: Dyn, machine: X86_64, }
Oh my. That's too much output. And we'd rather see hexadecimal notation here, wouldn't we? Let's make a quick helper:Shell session$ ./target/debug/elk /usr/lib32/libc.so Error: "Error(VerboseError { errors: [([47, 42, 32, 71, 78, 85, 32, 108, 100, 32, 115, 99, 114, 105, 112, 116, 10, 32, 32, 32, 85, 115, 101, 32, 116, 104, 101, 32, 115, 104, 97, 114, 101, 100, 32, 108, 105, 98, 114, 97, 114, 121, 44, 32, 98, 117, 116, 32, 115, 111, 109, 101, 32, 102, 117, 110, 99, 116, 105, 111, 110, 115, 32, 97, 114, 101, 32, 111, 110, 108, 121, 32, 105, 110, 10, 32, 32, 32, 116, 104, 101, 32, 115, 116, 97, 116, 105, 99, 32, 108, 105, 98, 114, 97, 114, 121, 44, 32, 115, 111, 32, 116, 114, 121, 32, 116, 104, 97, 116, 32, 115, 101, 99, 111, 110, 100, 97, 114, 105, 108, 121, 46, 32, 32, 42, 47, 10, 79, 85, 84, 80, 85, 84, 95, 70, 79, 82, 77, 65, 84, 40, 101, 108, 102, 51, 50, 45, 105, 51, 56, 54, 41, 10, 71, 82, 79, 85, 80, 32, 40, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 105, 98, 99, 46, 115, 111, 46, 54, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 105, 98, 99, 95, 110, 111, 110, 115, 104, 97, 114, 101, 100, 46, 97, 32, 32, 65, 83, 95, 78, 69, 69, 68, 69, 68, 32, 40, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 100, 45, 108, 105, 110, 117, 120, 46, 115, 111, 46, 50, 32, 41, 32, 41, 10], Nom(Tag)), ([47, 42, 32, 71, 78, 85, 32, 108, 100, 32, 115, 99, 114, 105, 112, 116, 10, 32, 32, 32, 85, 115, 101, 32, 116, 104, 101, 32, 115, 104, 97, 114, 101, 100, 32, 108, 105, 98, 114, 97, 114, 121, 44, 32, 98, 117, 116, 32, 115, 111, 109, 101, 32, 102, 117, 110, 99, 116, 105, 111, 110, 115, 32, 97, 114, 101, 32, 111, 110, 108, 121, 32, 105, 110, 10, 32, 32, 32, 116, 104, 101, 32, 115, 116, 97, 116, 105, 99, 32, 108, 105, 98, 114, 97, 114, 121, 44, 32, 115, 111, 32, 116, 114, 121, 32, 116, 104, 97, 116, 32, 115, 101, 99, 111, 110, 100, 97, 114, 105, 108, 121, 46, 32, 32, 42, 47, 10, 79, 85, 84, 80, 85, 84, 95, 70, 79, 82, 77, 65, 84, 40, 101, 108, 102, 51, 50, 45, 105, 51, 56, 54, 41, 10, 71, 82, 79, 85, 80, 32, 40, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 105, 98, 99, 46, 115, 111, 46, 54, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 105, 98, 99, 95, 110, 111, 110, 115, 104, 97, 114, 101, 100, 46, 97, 32, 32, 65, 83, 95, 78, 69, 69, 68, 69, 68, 32, 40, 32, 47, 117, 115, 114, 47, 108, 105, 98, 51, 50, 47, 108, 100, 45, 108, 105, 110, 117, 120, 46, 115, 111, 46, 50, 32, 41, 32, 41, 10], Context(\"Magic\"))] })"
And try it. Remember there are two failure modes inRust code// in `delf/src/lib.rs` pub struct HexDump<'a>(&'a [u8]); use std::fmt; impl<'a> fmt::Debug for HexDump<'a> { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { for &xin self.0.iter().take(20) { write!(f, "{:02x} ", x)?; } Ok(()) } }
nom:
●nom::Err::Error is recoverable. Maybe another branch will work
(if we're trying several parsers in a row), or maybe we just
need to get more input.
●nom::Err::Failure is unrecoverable - we've tried all the parsers,
more input won't help, something is just plain wrong.
In either case, they just wrap a nom::error::VerboseError, which itself
can contain multiple nom::error::ErrorKind, along with the relevant input slice.
I know, I know - that's a lot! But thanks to pattern matching, it's not so bad.
We can make a much better error printer in just a few lines. We'll add it
directly in delf, so that we don't have use nom types in elk:
Now we can use it fromRust code// in `delf/src/lib.rs` impl File { pub fn parse(i: parse::Input) -> parse::Result<Self> { let original_i = i; use nom::{ error::{ErrorKind, ParseError, VerboseError}, number::complete::le_u16, Err, }; let (i, x) = le_u16(i)?; match Self::try_from(x) { Ok(res) =>Ok((i, res)), Err(_) => Err(Err::Failure(VerboseError::from_error_kind( original_i, ErrorKind::Alt, ))), } } }
elk:
Rust code// in `elk/src/main.rs` fn main() -> Result<(), Box<dyn Error>> { // omitted: grabbing first argument let file = match delf::File::parse_or_print_error(&input[..]) { Some(f) => f, None => std::process::exit(1), }; println!("{:#?}", file); }
Wonderful! Well. It could be improved a lot - for now we'll just add the position of each error in the file, as a byte offset:Shell session$ cargo b -q $ ./target/debug/elk /usr/lib32/libc.so Parsing failed: Nom(Tag) at: 2f 2a 20 47 4e 55 20 6c 64 20 73 63 72 69 70 74 0a 20 20 20 Context("Magic") at: 2f 2a 20 47 4e 55 20 6c 64 20 73 63 72 69 70 74 0a 20 20 20
And here's the result:Rust code// in `delf/src/lib.rs` impl File { pub fn parse_or_print_error(i: parse::Input) -> Option<Self> { match Self::parse(i) { Ok((_, file)) => Some(file), Err(nom::Err::Failure(err)) | Err(nom::Err::Error(err)) => { eprintln!("Parsing failed:"); for (input, err) in err.errors { // new! use nom::Offset; let offset = i.offset(input); eprintln!("{:?} at position {}:", err, offset); eprintln!("{:>08x}: {:?}", offset, HexDump(input)); } None } Err(_) => panic!("unexpected nom error"), } }
Cool bear's hot tip We could get much, much nicer output, but let's not get distracted - there is so much to do! If you want to read more about error handling (and pretty printing!) withShell session$ cargo b -q && ./target/debug/elk /usr/lib32/libz.so Parsing failed: Nom(Tag) at position 4: 00000004: 01 01 01 00 00 00 00 00 00 00 00 00 03 00 03 00 01 00 00 00 Context("Class") at position 4: 00000004: 01 01 01 00 00 00 00 00 00 00 00 00 03 00 03 00 01 00 00 00
nom, read Making our own ping: Part 9.
Let's not waste any more time and move forward with our parser. One thing
I don't like about our current parser is that it panics if it encounters
say, a value for Machine it doesn't support.
We're going to allow ourselves to panic in elk - the binary crate, our CLI
tool, but we'll avoid panicking in delf - the library crate. Luckily, we've
already learned how to do that.
In our case, we want to implement parse on the Machine type. It should
return a parse::Result - like any other parser, and if we find a variant
we don't support, we shall return one of the built-in nom error kinds, say...
Alt. Because what are enums, if not a miserable little pile of alternatives?
Okay. So. That's a little bit on the verbose side. Also, there's no context! We won't know which enum failed to parse. I'm sure we can do a bit better than that. We've used theRust code// in `delf/src/lib.rs` impl Machine { pub fn parse(i: parse::Input) -> parse::Result<Self> { let original_i = i; use nom::{ error::{ErrorKind, ParseError, VerboseError}, number::complete::le_u16, Err, }; let (i, x) = le_u16(i)?; match Self::try_from(x) { Some(res) =>Ok((i, res)), None => Err(Err::Failure(VerboseError::from_error_kind( original_i, ErrorKind::Alt, ))), } } }
map combinator before, but it won't work here - map works when
all outputs of the previous parser are valid, whereas in our case, try_from can fail.
Luckily, there's another combinator, map_res, that fits our needs. It allows
us to return a Result<O2, E2> instead - where O2 is the output of our mapping
function, and E2 is the errors our mapping function can return.
Better! NowRust code// in `delf/src/lib.rs` impl Machine { pub fn parse(i: parse::Input) -> parse::Result<Self> { use nom::{combinator::map_res, error::ErrorKind, number::complete::le_u16}; map_res(le_u16, |x| match Self::try_from(x) { Ok(x) =>Ok(x), Err(_) => Err(ErrorKind::Alt), })(i) } }
nom takes care of making a "verbose error", and picks an Err
variant for us. It even annotates with the location, since map_res has a
reference to the input before le_u16 even runs.
But our code isn't ideal right now - try_from returns a Result<T, E>, and
we're using pattern matching to turn it into another Result<T, E>. And as
much as I like pattern matching, there's a better tool for the job: map_err.
Let's look at its prototype (keep in mind it's implemented on Result<T, E>):
So, if the Result is a theRust code// rust standard library pub fn map_err<F, O>(self, op: O) -> Result<T, F> where O: FnOnce(E) -> F
Err variant, map_err will call the function we
provide, op, to change it to another error type - from EtoF.
This means we can just write:
We can even add some context now. It's all just combinators!Rust code// in `delf/src/lib.rs` impl Machine { pub fn parse(i: parse::Input) -> parse::Result<Self> { use nom::{combinator::map_res, error::ErrorKind, number::complete::le_u16}; map_res(le_u16, |x| Self::try_from(x).map_err(|_| ErrorKind::Alt))(i) } }
And even though this implementation is much more pleasant, I wouldn't want it to write it by hand more than once, so.. let's make a macro! We'll slap it in theRust codeimpl Machine { pub fn parse(i: parse::Input) -> parse::Result<Self> { use nom::{ combinator::map_res, error::{context, ErrorKind}, number::complete::le_u16, }; context( "Machine", map_res(le_u16, |x| Self::try_from(x).map_err(|_| ErrorKind::Alt)), )(i) } }
parse module:
And now we can implementRust code// in `delf/src/parse.rs` // omitted: definitions for `Input` and `Result` #[macro_export] macro_rules! impl_parse_for_enum { ($type: ident, $number_parser: ident) => { impl $type { pub fn parse(i: parse::Input) -> parse::Result<Self> { use nom::{ combinator::map_res, error::{context, ErrorKind}, number::complete::$number_parser, }; let parser = map_res($number_parser, |x| { Self::try_from(x).map_err(|_| ErrorKind::Alt) }); context(stringify!($type), parser)(i) } } }; }
parse for both Type and Machine:
And finally use those in our main parser:Rust code// in `delf/src/lib.rs` // omitted: struct definitions for `Type` and `Machine` impl_parse_for_enum!(Type, le_u16); impl_parse_for_enum!(Machine, le_u16);
Phew. I feel lighter already. Nothing like a good abstraction to clear the mind. Does everything still work?Rust codeimpl File { const MAGIC: &'static [u8] = &[0x7f, 0x45, 0x4c, 0x46]; pub fn parse(i: parse::Input) -> parse::Result<Self> { // note: you can remove `map` and `le_u16` imports // omitted: rest of parser // before: let (i, (r#type, machine)) = tuple(( context("Type", map(le_u16, |x| Type::try_from(x).unwrap())), context("Machine", map(le_u16, |x| Machine::try_from(x).unwrap())), ))(i)?; // after: let (i, (r#type, machine)) = tuple((Type::parse, Machine::parse))(i)?; let res = Self { machine, r#type };Ok((i, res)) } }
It does! Good, good. I guess it's time to parse the rest of the ELF header. Next up is theShell session$ cargo b -q && ./target/debug/elk /bin/true File { type: Dyn, machine: X86_64, }
entry point - in ELF64, it's.. a 64-bit integer. Well, it's
more of a memory address. We'll want to print that in hexadecimal. So let's make
a wrapper type!
We might end up doing some arithmetic with those addresses, so we'll quickly
grab a crate to derive more traits: derive_more:
Shell session$ cargo add derive_more Adding derive_more v0.99.9 to dependencies
Well, that was a mouthful. But it should be smooth sailing from now on! We can add it to ourRust code// in `delf/src/lib.rs` use derive_more::*; // "Add" and "Sub" are in `derive_more` #[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Add, Sub)] pub struct Addr(pub u64); impl fmt::Debug for Addr { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { write!(f, "{:08x}", self.0) } } impl fmt::Display for Addr { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { fmt::Debug::fmt(self, f) } } // This will come in handy when serializing impl Into<u64> for Addr { fn into(self) -> u64 { self.0 } } // This will come in handy when indexing / sub-slicing slices impl Into<usize> for Addr { fn into(self) -> usize { self.0 as usize } } // This will come in handy when parsing impl From<u64> for Addr { fn from(x: u64) -> Self { Self(x) } } impl Addr { pub fn parse(i: parse::Input) -> parse::Result<Self> { use nom::{combinator::map, number::complete::le_u64}; map(le_u64, From::from)(i) } }
File struct, and our parser:
And then giveRust code// in `delf/src/lib.rs` #[derive(Debug)] pub struct File { pub r#type: Type, pub machine: Machine, // new! pub entry_point: Addr, } impl File { pub fn parse(i: parse::Input) -> parse::Result<Self> { // omitted: most of parser use nom::{combinator::verify, number::complete::le_u32}; let (i, (r#type, machine)) = tuple((Type::parse, Machine::parse))(i)?; // this 32-bit integer should always be set to 1 in the current // version of ELF, see the diagram. We don't *have* to check it, // but it's so easy to, let's anyway! let (i, _) = context("Version (bis)", verify(le_u32, |&x| x == 1))(i)?; let (i, entry_point) = Addr::parse(i)?; let res = Self { machine, r#type, entry_point, };Ok((i, res)) } }
elk a go! We don't need to change anything in it -
it refers to delf by path, and we've updated delf's File to have a
new field, so we just need to build and run it again:
Nice! But what doesShell session$ cargo b -q && ./target/debug/elk /bin/true File { type: Dyn, machine: X86_64, entry_point: 00002130, }
entry_point mean anyway?
Our diagram said it's the "address at which program execution starts".
For example, if we run it on our sample assembly program from above:
How can we verify our assumption? Well, we can debug our program! When researching this article series, I went hunting for a good GDB GUI - and I'm sorry to report that the state of GDB graphical user interfaces is... not great. I mean, Nemiver still runs, but I didn't have to look very far to find several annoying bugs. So, naturally, I went hunting on https://crates.io, and found... ugdb! It's a TUI (text user interface), but it's nicer than the built-in GDB TUI, has syntax highlighting, a cozy default pane layout. I like it a lot! To install it, simply run:Shell session$ cargo b -q && ./target/debug/elk samples/hello File { type: Exec, machine: X86_64, entry_point: 00401000, }
We can then load upShell session$ # note: you'll need cmake. use --force if updating $ cargo install ugdb
samples/helloinugdb simply with:
And be greeted with this:Shell session$ ugdb ./samples/hello
Now, if we just use the start GDB command, our program will run entirely,
and we'll see its output in the right pane:
(gdb) start
But that's not very interesting. GDB will not break anywhere in our program by
default, because our program doesn't crash - it just makes two syscalls:
write and exit.
Cool bear's hot tip
To quit GDB (and similarly, ugdb), use the quit command (orq, for short). If it asks
for a confirmation, reply with y(and press enter).
We can't set breakpoint on syscalls... but we can set catchpoints!
(gdb) catch syscall Catchpoint 1 (any syscall) (gdb) start
Nice! GDB paused program execution right before the write syscall was executed.
We can already see in this screenshot that the beginning of _start - our entry point -
is at 0x401000. Everything checks out so far!
Another way we could've checked this is to set a breakpoint at address 0x401000,
like so:
(gdb) break *0x401000 Breakpoint 1 at 0x401000 (gdb) start
While in ugdb, we can use the stepi command (orsi for short) to advance to
the next instruction.
Here's a quick video:
There's one thing that bothers me though - the disassembly shown in ugdb doesn't
really look like our source program. We wrote mov rdi, 1, not mov $0x1, %edi.
The explanation is simple: GDB is showing us AT&T assembly syntax, whereas we got
used to Intel syntax. Luckily, we can adjust that by creating a ~/.gdbinit file:
Now GDB (and ugdb) shows us disassembly in Intel syntax!# in ~/.gdbinit set disassembly-flavor intel
Let's try doing all that on /bin/true - the entry point for it was 0x2130, so,
Oh. That's unfortunate. What address does(gdb) break *0x2130 (gdb) start Starting program: /usr/bin/true Warning: Cannot insert breakpoint 1. Cannot access memory at address 0x2130
/bin/true start at, then?
And how can we break at the entry point, if it's not 0x2130?
Recent versions of gdb come with the starti command, which does exactly that.
We can use it to inspect at which address /bin/true actually starts:
Oh. 0x7ffff7fd4100. Okay. Well, that's 140 terabytes. I sorta doubt /bin/true
needs that much memory to proceed, but uh...
Okay, forget /bin/true for a minute - let's make a C program that prints
the address of main. That should be the entry point, right?
C code// in `elk/samples/entry_point.c` #include <stdio.h> int main() { printf("main is at %p\n", &main); }
Okay. Uh. Let's try gdb?Shell session$ # in `elk/samples/` $ gcc entry_point.c -o entry_point $ ../target/debug/elk File { type: Dyn, machine: X86_64, entry_point: 00001040, } $ ./entry_point main is at 0x562601e0f139 $ ./entry_point main is at 0x56311bd0c139 $ ./entry_point main is at 0x5588a0a39139 $ ./entry_point main is at 0x559c3e3df139 $ ./entry_point main is at 0x557578dc4139 $ ./entry_point main is at 0x55c80e189139
Well. That doesn't answer any of my questions. If anything, I have more questions now. ●Why is the entry point stored in an ELF file sometimes the same one we see in GDB and sometimes not? ●Why is the entry point a valid offset into the ELF file and sometimes not? ●Why is the entry point always the same in GDB but not when running the executable directly? Clearly, we have a lot more detective work to do... What did we learn? The ELF binary file format is old, but still in use today! It contains information about the Operating System's Application Binary Interface (OS ABI) and the processor architecture (ie. the(gdb) starti Starting program /home/amos/ftl/elk/samples/entry_point Program stopped. 0x0000000000401ac0 in _start () (gdb) c Continuing. main is at 0x401be5 [Inferior 1 (process 7394) exited normally] (gdb) starti Starting program: /home/amos/ftl/elk/samples/entry_point Program stopped. 0x0000000000401ac0 in _start () (gdb) c Continuing. main is at 0x401be5 [Inferior 1 (process 7397) exited normally] (gdb) starti Starting program: /home/amos/ftl/elk/samples/entry_point Program stopped. 0x0000000000401ac0 in _start () (gdb) c Continuing. main is at 0x401be5 [Inferior 1 (process 7398) exited normally]
Machine), among other things.
It is easily parsed with nom. For Rust
enums, we can use the
derive-try-from-primitive
crate to convert back and forth between the integer representation (u8, u16, u32) and the Rust enum variant.
The ELF header also contains the "entry point" for a program - where it should start executing. Although, when we run a program, we sometimes find a different address
being executed.
We can use xxd to visualize the contents of a binary file - it's a
"hexdump" (hexadecimal.. printer) tool. It accepts options such as: offset,
length, endianness, and how many bytes to group together when printing. It does
not understand hexadecimal offsets, but we can use shell arithmetic - $(()) -
to do that.
GDB, the GNU debugger, will let us start a program and step through it, even
if doesn't have debug or source information - in that case, it will show us
disassembly of the program. We can even ask GDB to use Intel syntax when
showing assembly. ugdb is a nice text UI frontend
to GDB.
This article was made possible thanks to my patrons:
Alex Rudy, Romet Tagobert, Douglas Creager, Gus W, Corey Alexander, Molly
Howell, knutwalker, Zachary Dremann, Sebastian Ziebell, Amber Kowalski, T,
Juniper Wilde, Paul Kline, Kristoffer Ström, Astrid Bek, Yoh Deadfall, Justin
Ossevoort, taziden, Harsh Shandilya, Tomáš Duda, Jeremy Banks, Rasmus Larsen,
Torben Clasen, Sam Rose, C J Silverio, Walther, Pete Bevin, Shane Sveller,
Thomas Schultz, Ivan Dubrov, jer, Wonwoo Choi, João Veiga, Adam Perry,
Benjamin Röjder Delnavaz, Justin Liew, Matt Jadczak, tavr, Mara Bos, Jonathan
Knapp, Mayfield Reynolds, Maximilian, Seth Stadick, brianloveswords, Sean
Bryant, Sebastian Zimmer, Fernando, Makoto Nakashima, Geert Depuydt, Geoff
Cant, Geoffroy Couprie, Michael Alyn Miller, o0Ignition0o, Zaki, Raphael
Gaschignard, Romain Ruetschi, Ignacio Vergara, Pascal, Jane Lusby, Nicolas
Goy, Ted Mielczarek, Someone, Jérémy Gtld, Ryszard Sommefeldt, Jesús
Higueras, Aurora.
This article is part 1 of the
Making our own executable packer
series.
Read the next part
Discuss this article on /r/fasterthanlime
If you liked this article, please support my work on Patreon!
Become a Patron
Looking for the homepage?
Another article: Working with strings in Rust
@fasterthanlime
—
/r/fasterthanlime
— Patreon
— Atom feed