Lecture 2: A First Compiler – Neonate + x86 Basics
Today we’re going to implement a compiler. It will be called Neonate, because it’s fun to name things and the name will fit a theme in future weeks.
It’s not going to be terrifically useful, as it will only compile a very small
language —
1 The Big Picture
The heart of each compiler we write will be a Rust program that takes an input program and generates assembly code. That leaves open a few questions:
How will the input program be handed to, and represented in, Rust?
How will the generated assembly code be run?
Our answer to the first question is going to be simple for today: we’ll expect
that all programs are files containing a single integer, so there’s little
“front-end” for the compiler to consider. Most of this lab is about the
second question —
2 The Wrapper
(The idea here is directly taken from Abdulaziz Ghuloum).
Our model for the code we generate is that it will start from a C-style function call. This allows us to do a few things:
We can use a Rust program as the wrapper around our code, which makes it somewhat more cross-platform than it would be otherwise
We can defer some details to our Rust wrapper that we want to skip or leave until later
So, our wrapper will be a Rust program stub.rs
with a traditional main that calls a
function that we will define with our generated code:
#[link(name = "compiled_code")]
extern "C" {
fn start_here() -> i64;
}
fn main() {
let output = unsafe { start_here() };
println!("Assembly code returned: {}", output);
}
So right now, our compiled program had better return an integer, and
our wrapper will handle printing it out for us. The extern
block tells the rust compiler that we are expecting
in a library called "compiled_code"
#[link(name = "compiled_code")]
...there will be some functions using the C-calling convention
extern "C"
...specifically, one called "start_here" which expects no arguments and returns an
i64
.
The main function is mostly normal, except it uses an unsafe
block. Rust as a language was designed to have nice programming
properties like memory safety, but when we call external libraries we
implemented in assembly code, the compiler can no longer guarantee
that those libraries respect Rust’s invariants. So when we call
external functions, we have to wrap them in an unsafe
block to
tell the Rust compiler we are willing to accept the risks of stepping
outside the nice guarantees of safe Rust. For this course, our
compiler will never use unsafe code, but our runtime system will use
it a great deal because it is interacting directly with our compiled
assembly code.
If we try to compile stub.rs
now we get an error.
$ rustc stub.rs
...
note: ld: library not found for -lcompiled_code
This says that the linker couldn’t find a library with the name "compiled_code". So let’s implement one!
3 Hello, x64
Our next goal is to:
Write an assembly program that defines
our_code_starts_here
Link that program with
stub.rs
and create an executable
In order to write assembly, we need to pick a syntax and an instruction set. We’re going to generate 64-bit x64 assembly, and use the so-called Intel syntax (there’s also an AT&T syntax, for those curious), because I like a particular guide that uses the Intel syntax, and because it works with the particular assembler we’ll use.
Here’s a very simple assembly program, matching the above constraints,
that will act like a C function of no arguments and return a constant
number (37
) as the return value:For Mac OSX, you will need
to write _start_here
with an extra underscore
section .text
global start_here
start_here:
mov rax, 37
ret
The pieces mean, line by line:
section .text
—Here comes some code, in text form! global start_here
—This assembly code defines a globally-accessible symbol called start_here
. This is what makes it so that when we generate an object file later, the linker will know what names come from where.start_here:
—Here’s where the code for this symbol starts. If other code jumps to start_here
, this is where it begins.mov rax, 37
—Take the constant number 37 and put it in the register called rax
. This register is the one that compiled C programs expect to find return values in, so we should put our “answer” there.ret
—Do mechanics related to managing the stack which we will talk about in much more detail later, then jump to wherever the caller of our_code_starts_here
left off.
We can put this in a file called compiled_code.s
(.s
is a typical extension for
assembly code), and then we just need to know how to assemble and link it with
the main we wrote.
4 Hello, nasm
We will be using a program called nasm as our
assembler, because it works well across a few platforms, and is simple to use.
The main way we will use it is to take assembly (.s
) files and turn them
into object (.o
) files. The command we’ll use to build with nasm (in Linux) is:
$ nasm -f elf64 -o compiled_code.o compiled_code.s
This creates a file called compiled_code.o
in
Executable and Linkable Format.
We won’t go into detail about this binary structure. For our
purposes, it’s simply a version of the assembly we wrote that our
particular operating system understands.
If you are on OSX, you can use -f macho64
rather than -f
elf64
, which will produce an OSX-compatible object file. If you are
on Windows, you can try -f win64
and share on Piazza if it
works.
Next, to link with Rust code, we need to turn our object file into the
type of file rustc expects for libraries. We will use a static library
so that our assembled code is put directly into the executable
file. On Mac and Linux this means producing an archive file
libcompiled_code.a
using the following command:
$ ar r libcompiled_code.a compiled_code.o
Finally, we need to compile our rust file while informing the compiler
to look for libraries in the current directory (-L
):
$ rustc stub.rs -L .
This builds an executable we can run
$ ./our_code
37
5 Hello, Compiler
With this pipeline in place, the only step left is to write a Rust program that can generate assembly programs. Then we can automate the process and get a pipeline from user program all the way to executable.
A very simple compiler might just take the name of a file, and output the
compiled assembly code on standard output. Let’s try that; here’s a simple
main.rs
that takes a file as a command line argument, expects it to
contain a single integer on one line, and generates the corresponding assembly
code:
type AST = i64;
fn main() {
use std::fs;
let args: Vec<String> = std::env::args().collect(); // get the program arguments as a Vec<String>
let inp = fs::read_to_string(&args[1]).unwrap(); // read arg[1] into a String
let num = parse(&inp).unwrap();
print!("{}", compile(num));
}
fn parse(s: &str) -> Result<AST, String> {
match i64::from_str_radix(s.trim(), 10) { // .trim() removes leading and trailing whitespace
Ok(x) => Ok(x),
Err(e) => Err(e.to_string())
}
}
fn compile(n: AST) -> String { // Add _ to the front of the label for Mac OS X
format!("\
section .text
global start_here
start_here:
mov rax, {}
ret\n",
n)
}
Make a new cargo project and put this into src/main.rs
, then
create another file 2021.int
that contains just the number
2021, then run:
$ cargo run 2021.int
...
section .text
global start_here
start_here:
mov rax, 2021
ret
How exciting! We can redirect the output to a file, and get an entire
pipeline of compilation to work out (assuming stub.rs
is in
the same directory):
$ cargo run 2021.int > 2021.s
$ nasm -f elf64 -o 2021.o 2021.s
$ ar r libcompiled_code.a 2021.o
$ rustc stub.rs -L . -o 2021.run
$ ./2021.run
Assembly returned: 2021
Then we can use Makefiles or custom scripts to pipe these all together.
Of course, this is “just” a bunch of boilerplate. It got us to the point
where we have a Rust program that’s defining our translation from input
program to assembly code. Our input programs are pretty boring, so those will
need to get more sophisticated, and correspondingly the function compile
will need to become more impressive. That’s where our focus will be in the
coming weeks.
6 x86-64 Basics
x86-64 has 16 64-bit registers that all can hold a 64-bit value:
rax
rcx
rdx
rbx
rsp
rbp
rsi
rdi
r8-r15
We will learn more about them as we dive deeper into the stack and
calling conventions, but for today, we just need to know that
rax
is where return values go in the C calling convention we
use to interface with Rust.
We also discussed two instructions in more depth: mov
and
add
. The basic semantics of mov x, y
are that it moves
whatever is in y
to x
. x and y might be registers, memory
references or immediates. Only the following 5 combinations make
sense:
mov reg, reg
move from a register to another registermov reg, imm64
move a 64-bit integer value into a registermov reg, mem
load from memory into a registermov mem, reg
store the contents of a register at a memory locationmov mem, imm32
store a 32-bit integer value at a memory location
Note that in particular we cannot directly move from one memory location into another.
Next, add x, y
acts like the +=
operation, its semantics is
to put x + y
in x
. The combinations allowed for add
are quite similar to mov
with one notable exception:
add reg, reg
add reg, imm32
add a 32-bit integer value to a register valuemov reg, mem
mov mem, reg
mov mem, imm32
add
only allows for a 32-bit integer immediate to be added to a
register. In fact, mov
is unique among the instructions we will
use in that it allows for a full 64-bit immediate.