Lecture 2: A First Compiler – Neonate +   x86 Basics
1 The Big Picture
2 The Wrapper
3 Hello, x64
4 Hello, nasm
5 Hello, Compiler
6 x86-64 Basics
8.2

Lecture 2: A First Compiler – Neonate + x86 Basics

Today we’re going to implement a compiler. It will be called Neonate, because it’s fun to name things and the name will fit a theme in future weeks.

It’s not going to be terrifically useful, as it will only compile a very small language — integers. That is, it will take a user program (a number), and create an executable binary that prints the number. There are no files in this repository because the point of the lab is for you to see how this is built up from scratch. That way, you’ll understand the infrastructure that future assignments’ support code will use.

1 The Big Picture

The heart of each compiler we write will be a Rust program that takes an input program and generates assembly code. That leaves open a few questions:

Our answer to the first question is going to be simple for today: we’ll expect that all programs are files containing a single integer, so there’s little “front-end” for the compiler to consider. Most of this lab is about the second question — how we take our generated assembly and meaningfully run it while avoiding both (a) the feeling that there’s too much magic going on, and (b) getting bogged down in system-level details that don’t enlighten us about compilers.

2 The Wrapper

(The idea here is directly taken from Abdulaziz Ghuloum).

Our model for the code we generate is that it will start from a C-style function call. This allows us to do a few things:

So, our wrapper will be a Rust program stub.rs with a traditional main that calls a function that we will define with our generated code:

#[link(name = "compiled_code")]
extern "C" {
    fn start_here() -> i64;
}

fn main() {
    let output = unsafe { start_here() };
    println!("Assembly code returned: {}", output);
}

So right now, our compiled program had better return an integer, and our wrapper will handle printing it out for us. The extern block tells the rust compiler that we are expecting

The main function is mostly normal, except it uses an unsafe block. Rust as a language was designed to have nice programming properties like memory safety, but when we call external libraries we implemented in assembly code, the compiler can no longer guarantee that those libraries respect Rust’s invariants. So when we call external functions, we have to wrap them in an unsafe block to tell the Rust compiler we are willing to accept the risks of stepping outside the nice guarantees of safe Rust. For this course, our compiler will never use unsafe code, but our runtime system will use it a great deal because it is interacting directly with our compiled assembly code.

If we try to compile stub.rs now we get an error.

$ rustc stub.rs
...
  note: ld: library not found for -lcompiled_code

This says that the linker couldn’t find a library with the name "compiled_code". So let’s implement one!

3 Hello, x64

Our next goal is to:

In order to write assembly, we need to pick a syntax and an instruction set. We’re going to generate 64-bit x64 assembly, and use the so-called Intel syntax (there’s also an AT&T syntax, for those curious), because I like a particular guide that uses the Intel syntax, and because it works with the particular assembler we’ll use.

Here’s a very simple assembly program, matching the above constraints, that will act like a C function of no arguments and return a constant number (37) as the return value:For Mac OSX, you will need to write _start_here with an extra underscore

        section .text
        global start_here
start_here:
        mov rax, 37
        ret

The pieces mean, line by line:

We can put this in a file called compiled_code.s (.s is a typical extension for assembly code), and then we just need to know how to assemble and link it with the main we wrote.

4 Hello, nasm

We will be using a program called nasm as our assembler, because it works well across a few platforms, and is simple to use. The main way we will use it is to take assembly (.s) files and turn them into object (.o) files. The command we’ll use to build with nasm (in Linux) is:

$ nasm -f elf64 -o compiled_code.o compiled_code.s

This creates a file called compiled_code.o in Executable and Linkable Format. We won’t go into detail about this binary structure. For our purposes, it’s simply a version of the assembly we wrote that our particular operating system understands.

If you are on OSX, you can use -f macho64 rather than -f elf64, which will produce an OSX-compatible object file. If you are on Windows, you can try -f win64 and share on Piazza if it works.

Next, to link with Rust code, we need to turn our object file into the type of file rustc expects for libraries. We will use a static library so that our assembled code is put directly into the executable file. On Mac and Linux this means producing an archive file libcompiled_code.a using the following command:

$ ar r libcompiled_code.a compiled_code.o

Finally, we need to compile our rust file while informing the compiler to look for libraries in the current directory (-L):

$ rustc stub.rs -L .

This builds an executable we can run

$ ./our_code
37

5 Hello, Compiler

With this pipeline in place, the only step left is to write a Rust program that can generate assembly programs. Then we can automate the process and get a pipeline from user program all the way to executable.

A very simple compiler might just take the name of a file, and output the compiled assembly code on standard output. Let’s try that; here’s a simple main.rs that takes a file as a command line argument, expects it to contain a single integer on one line, and generates the corresponding assembly code:

type AST = i64;

fn main() {
    use std::fs;

    let args: Vec<String> = std::env::args().collect(); // get the program arguments as a Vec<String>
    let inp = fs::read_to_string(&args[1]).unwrap();    // read arg[1] into a String
    let num = parse(&inp).unwrap();
    print!("{}", compile(num));
}

fn parse(s: &str) -> Result<AST, String> {
    match i64::from_str_radix(s.trim(), 10) { // .trim() removes leading and trailing whitespace
        Ok(x) => Ok(x),
        Err(e) => Err(e.to_string())
    }
}

fn compile(n: AST) -> String { // Add _ to the front of the label for Mac OS X
    format!("\
        section .text
        global start_here
start_here:
        mov rax, {}
        ret\n",
    n)
}

Make a new cargo project and put this into src/main.rs, then create another file 2021.int that contains just the number 2021, then run:

$ cargo run 2021.int
...
        section .text
        global start_here
start_here:
        mov rax, 2021
        ret

How exciting! We can redirect the output to a file, and get an entire pipeline of compilation to work out (assuming stub.rs is in the same directory):

$ cargo run 2021.int > 2021.s
$ nasm -f elf64 -o 2021.o 2021.s
$ ar r libcompiled_code.a 2021.o
$ rustc stub.rs -L . -o 2021.run
$ ./2021.run
Assembly returned: 2021

Then we can use Makefiles or custom scripts to pipe these all together.

Of course, this is “just” a bunch of boilerplate. It got us to the point where we have a Rust program that’s defining our translation from input program to assembly code. Our input programs are pretty boring, so those will need to get more sophisticated, and correspondingly the function compile will need to become more impressive. That’s where our focus will be in the coming weeks.

6 x86-64 Basics

x86-64 has 16 64-bit registers that all can hold a 64-bit value:

We will learn more about them as we dive deeper into the stack and calling conventions, but for today, we just need to know that rax is where return values go in the C calling convention we use to interface with Rust.

We also discussed two instructions in more depth: mov and add. The basic semantics of mov x, y are that it moves whatever is in y to x. x and y might be registers, memory references or immediates. Only the following 5 combinations make sense:

Note that in particular we cannot directly move from one memory location into another.

Next, add x, y acts like the += operation, its semantics is to put x + y in x. The combinations allowed for add are quite similar to mov with one notable exception:

add only allows for a 32-bit integer immediate to be added to a register. In fact, mov is unique among the instructions we will use in that it allows for a full 64-bit immediate.