Lecture 2: A First Compiler -- Neonate + x86 Basics

8.7

Lecture 2: A First Compiler – Neonate + x86 Basics

Today we’re going to implement a compiler. It will be called Neonate, because it’s fun to name things and the name will fit a theme in future weeks.

It’s not going to be terrifically useful, as it will only compile a very small language — integers. That is, it will take a user program (a number), and create an executable binary that prints the number. There are no files in this repository because the point of the lab is for you to see how this is built up from scratch. That way, you’ll understand the infrastructure that future assignments’ support code will use.

1 The Big Picture

The heart of each compiler we write will be a Rust program that takes an input program and generates assembly code. That leaves open a few questions:

How will the input program be handed to, and represented in, Rust?
How will the generated assembly code be run?

Our answer to the first question is going to be simple for today: we’ll expect that all programs are files containing a single integer, so there’s little “front-end” for the compiler to consider. Most of this lab is about the second question — how we take our generated assembly and meaningfully run it while avoiding both (a) the feeling that there’s too much magic going on, and (b) getting bogged down in system-level details that don’t enlighten us about compilers.

2 The Wrapper

(The idea here is directly taken from Abdulaziz Ghuloum).

Our model for the code we generate is that it will start from a C-style function call. This allows us to do a few things:

We can use a Rust program as the wrapper around our code, which makes it somewhat more cross-platform than it would be otherwise
We can defer some details to our Rust wrapper that we want to skip or leave until later

So, our wrapper will be a Rust program stub.rs with a traditional main that calls a function that we will define with our generated code:

#[link(name = "compiled_code")]
extern "C" {
    fn start_here() -> i64;
}

fn main() {
    let output = unsafe { start_here() };
    println!("Assembly code returned: {}", output);
}

So right now, our compiled program had better return an integer, and our wrapper will handle printing it out for us. The extern block tells the rust compiler that we are expecting

in a library called "compiled_code" #[link(name = "compiled_code")]...
there will be some functions using the C-calling convention extern "C"...
specifically, one called "start_here" which expects no arguments and returns an i64.

The main function is mostly normal, except it uses an unsafe block. Rust as a language was designed to have nice programming properties like memory safety, but when we call external libraries we implemented in assembly code, the compiler can no longer guarantee that those libraries respect Rust’s invariants. So when we call external functions, we have to wrap them in an unsafe block to tell the Rust compiler we are willing to accept the risks of stepping outside the nice guarantees of safe Rust. For this course, our compiler will never use unsafe code, but our runtime system will use it a great deal because it is interacting directly with our compiled assembly code.

If we try to compile stub.rs now we get an error.

$ rustc stub.rs
...
  note: ld: library not found for -lcompiled_code

This says that the linker couldn’t find a library with the name "compiled_code". So let’s implement one!

3 Hello, x64

Our next goal is to:

Write an assembly program that defines our_code_starts_here
Link that program with stub.rs and create an executable

In order to write assembly, we need to pick a syntax and an instruction set. We’re going to generate 64-bit x64 assembly, and use the so-called Intel syntax (there’s also an AT&T syntax, for those curious), because I like a particular guide that uses the Intel syntax, and because it works with the particular assembler we’ll use.

Here’s a very simple assembly program, matching the above constraints, that will act like a C function of no arguments and return a constant number (37) as the return value:For Mac OSX, you will need to write _start_here with an extra underscore

        section .text
        global start_here
start_here:
        mov rax, 37
        ret

The pieces mean, line by line:

section .text — Here comes some code, in text form!
global start_here — This assembly code defines a globally-accessible symbol called start_here. This is what makes it so that when we generate an object file later, the linker will know what names come from where.
start_here: — Here’s where the code for this symbol starts. If other code jumps to start_here, this is where it begins.
mov rax, 37 — Take the constant number 37 and put it in the register called rax. This register is the one that compiled C programs expect to find return values in, so we should put our “answer” there.
ret — Do mechanics related to managing the stack which we will talk about in much more detail later, then jump to wherever the caller of our_code_starts_here left off.

We can put this in a file called compiled_code.s (.s is a typical extension for assembly code), and then we just need to know how to assemble and link it with the main we wrote.

4 Hello, `nasm`

We will be using a program called nasm as our assembler, because it works well across a few platforms, and is simple to use. The main way we will use it is to take assembly (.s) files and turn them into object (.o) files. The command we’ll use to build with nasm (in Linux) is:

$ nasm -f elf64 -o compiled_code.o compiled_code.s

This creates a file called compiled_code.o in Executable and Linkable Format. We won’t go into detail about this binary structure. For our purposes, it’s simply a version of the assembly we wrote that our particular operating system understands.

If you are on OSX, you can use -f macho64 rather than -f elf64, which will produce an OSX-compatible object file. If you are on Windows, you can try -f win64 and share on Piazza if it works.

Next, to link with Rust code, we need to turn our object file into the type of file rustc expects for libraries. We will use a static library so that our assembled code is put directly into the executable file. On Mac and Linux this means producing an archive file libcompiled_code.a using the following command:

$ ar r libcompiled_code.a compiled_code.o

Finally, we need to compile our rust file while informing the compiler to look for libraries in the current directory (-L):

$ rustc stub.rs -L .

This builds an executable we can run

$ ./our_code
37

5 Hello, Compiler

With this pipeline in place, the only step left is to write a Rust program that can generate assembly programs. Then we can automate the process and get a pipeline from user program all the way to executable.

A very simple compiler might just take the name of a file, and output the compiled assembly code on standard output. Let’s try that; here’s a simple main.rs that takes a file as a command line argument, expects it to contain a single integer on one line, and generates the corresponding assembly code:

type AST = i64;

fn main() {
    use std::fs;

    let args: Vec<String> = std::env::args().collect(); // get the program arguments as a Vec<String>
    let inp = fs::read_to_string(&args[1]).unwrap();    // read arg[1] into a String
    let num = parse(&inp).unwrap();
    print!("{}", compile(num));
}

fn parse(s: &str) -> Result<AST, String> {
    match i64::from_str_radix(s.trim(), 10) { // .trim() removes leading and trailing whitespace
        Ok(x) => Ok(x),
        Err(e) => Err(e.to_string())
    }
}

fn compile(n: AST) -> String { // Add _ to the front of the label for Mac OS X
    format!("\
        section .text
        global start_here
start_here:
        mov rax, {}
        ret\n",
    n)
}

Make a new cargo project and put this into src/main.rs, then create another file 2021.int that contains just the number 2021, then run:

$ cargo run 2021.int
...
        section .text
        global start_here
start_here:
        mov rax, 2021
        ret

How exciting! We can redirect the output to a file, and get an entire pipeline of compilation to work out (assuming stub.rs is in the same directory):

$ cargo run 2021.int > 2021.s
$ nasm -f elf64 -o 2021.o 2021.s
$ ar r libcompiled_code.a 2021.o
$ rustc stub.rs -L . -o 2021.run
$ ./2021.run
Assembly returned: 2021

Then we can use Makefiles or custom scripts to pipe these all together.

Of course, this is “just” a bunch of boilerplate. It got us to the point where we have a Rust program that’s defining our translation from input program to assembly code. Our input programs are pretty boring, so those will need to get more sophisticated, and correspondingly the function compile will need to become more impressive. That’s where our focus will be in the coming weeks.

6 x86-64 Basics

x86-64 has 16 64-bit registers that all can hold a 64-bit value:

rax
rcx
rdx
rbx
rsp
rbp
rsi
rdi
r8-r15

We will learn more about them as we dive deeper into the stack and calling conventions, but for today, we just need to know that rax is where return values go in the C calling convention we use to interface with Rust.

We also discussed two instructions in more depth: mov and add. The basic semantics of mov x, y are that it moves whatever is in y to x. x and y might be registers, memory references or immediates. Only the following 5 combinations make sense:

mov reg, reg move from a register to another register
mov reg, imm64 move a 64-bit integer value into a register
mov reg, mem load from memory into a register
mov mem, reg store the contents of a register at a memory location
mov mem, imm32 store a 32-bit integer value at a memory location

Note that in particular we cannot directly move from one memory location into another.

Next, add x, y acts like the += operation, its semantics is to put x + y in x. The combinations allowed for add are quite similar to mov with one notable exception:

add reg, reg
add reg, imm32 add a 32-bit integer value to a register value
mov reg, mem
mov mem, reg
mov mem, imm32

add only allows for a 32-bit integer immediate to be added to a register. In fact, mov is unique among the instructions we will use in that it allows for a full 64-bit immediate.

1	The Big Picture
2	The Wrapper
3	Hello, x64
4	Hello, `nasm`
5	Hello, Compiler
6	x86-64 Basics