Lecture 3: Abstract Syntax Trees and Interpreters

8.15

Lecture 3: Abstract Syntax Trees and Interpreters🔗

1 Growing our Language: Basic Computation🔗

Last time, we considered the following miniscule language:

‹expr›: NUMBER

Our Rust representation of this program was simply an n: i64 and our compiler simply placed that integer in the appropriate place in the assembly:

section .text
global start_here
start_here:
  mov rax, n
  ret

Let’s extend our language so that it does some actual computation, adding increment and decrement operations. Say, that we have access to some kind of functions add1 and sub1. Here are a few examples of programs that we might write in this language, and what answer we expect them to evaluate to:

Concrete Syntax		Answer
`42`		`42`
`add1(42)`		`43`
`sub1(42)`		`41`
`sub1(add1(add1(42)))`		`43`

Now how can we go about implementing this language? The first step is to take in our program as an input string and (1) check that it is a well-formed program and (2) rewrite it into a more convenient representation to use.

1.1 Concrete Syntax🔗

How programs are written as strings is called the concrete syntax of the programming language. We can describe the syntax for our Snake language using a grammar like the following:

‹expr›: | NUMBER | add1 ( ‹expr› ) | sub1 ( ‹expr› )

A grammar like this describes the concrete syntax of a language by specifying how you could generate the terms in the language. In this case we have one form called expr and to create an expr we either pick a number, or we call an add1 or sub1 function on another expr, with the argument wrapped in parentheses. By convention, we ignore whitespace in this description. We’ll keep this all a bit informal for now and make it completely rigorous in the last month of the course.

Even for a simple language like this, writing a program that checks if a given string conforms to this grammar is fairly tedious. Thankfully there are widely available programs called parser generators that will generate code for a parser from fairly simple descriptions like this. In this course we use one called LALRPOP (pronounced kind of like "lollipop"). A LALRPOP implementation of the grammar for this language is given by:

pub Expr = {
    <Num>,
    "add1" "(" <Expr> ")",
    "sub1" "(" <Expr> ")",
}

Num = {
  <r"[+-]?[0-9]+">
}

Which is almost exactly what we’ve written above, except that it’s more precise in defining a description of the syntax for numbers. Running the lalrpop tool1You can install it yourself with cargo install lalrpop on this code produces 500 lines of fairly inscrutable Rust code we are happy not to have to write by hand.

1.2 Abstract Syntax and Parsing🔗

The lalrpop program above doesn’t really define a parser, but instead a recognizer, i.e., a function that takes in an input string and tell us whether or not the input string matches our language’s grammar. A recognizer alone is not very useful for a programming language implementation. After we have identified the input string is in the language, we are still left with just a string, which is not a very convenient input type to writing an interpreter or a compiler. When we write the remainder of the implementation, we would prefer to use a representation where we can very easily answer some basic questions:

Is the expression a number or an operation?
If it’s a number, what is that number represented as an i64?
If it’s an operation, is it an add1 or a sub1 and what is the sub-expression?

We could answer these questions using just a string as our representation, but it would be tedious and involve scanning the string, which is what the recognizer just did anyway! For more complex languages, this becomes completely infeasible.

Instead what we’ll do is represent our programs as special kinds of trees called abstract syntax trees. For this language, a tree is either a leaf, which should have a number, or a node which should be labeled as add1 or sub1 and have one child. One of the main reasons we use Rust in this course is that Rust has very good support for defining such tree types:

enum Expression {
    Number(i64),
    Add1(Box<Expression>),
    Sub1(Box<Expression>),
}

This is a Rust enum type, so-called because it is a type defined by enumerating the possible values. You can read more about enum types in Rust book chapter 6. This definitions matches our abstract description of the grammar: either the input is a number, or an add1 operation performed on another expression, or a sub1 operation performed on another.

Here are a couple of examples of programs and how to construct their corresponding abstract syntax trees:

Concrete Syntax

Abstract Syntax

Number(5)

sub1(add1(sub1(17)))

Sub1(Box::new(Add1(Box::new(Sub1(Box::new(Number(5)))))))

But notice how we’ve abstracted away the details from the concrete syntax. This same abstract syntax could apply to many different concrete syntaxes. For example, we could use an imperative syntax

17;
sub1;
add1;
sub1

or a lisp-style syntax

(sub1 (add1 (sub1 17)))

but almost2the one thing that compilers do often keep around is source-location information, for providing better error messages. none of those details are relevant to the rest of compilation.

To extract an abstract syntax tree from our input string, we update our lalrpop file to generate an output expression in addition to specifying the grammar:

pub Expr: Expression = {
    <n: Num> => Expression::Number(n),
    "add1" "(" <e: Expr> ")" => Expression::Add1(Box::new(e)),
    "sub1" "(" <e: Expr> ")" => Expression::Sub1(Box::new(e)),
}

Num: i64 = {
  <s:r"[+-]?[0-9]+"> => s.parse().unwrap()
}

For this language, our type of abstract syntax trees is very simple, but this approach scales up to very complicated languages. The analogous enum in the Rust compiler (which is written in Rust) contains about 40 different cases!

1.3 Semantics🔗

Before we define a compiler for our new language, we should first specify its semantics, i.e., how programs should be evaluated. A simple method for doing this is to define a corresponding interpreter. Like our previous language, the programs here should output a single integer number, but now add1 and sub1 should perform those mathematical operations.

We can implement this functionality by using Rust’s pattern-matching feature, which allows us to perform a case-analysis of which constructor was used, and get access to the data contained within.

fn interpret(e: &Expression) -> i64 {
    match e {
        Expression::Number(n) => *n,
        Expression::Add1(arg) => interpret(arg) + 1,
        Expression::Sub1(arg) => interpret(arg) - 1,
    }
}

Exercise
Why is the input to this interpreter a &Expression rather than a Expression? What difference does this decision make?

This pattern of programming is very common when we work with abstract syntax trees: we define a function by matching on the tree, and then recursively call the function on the sub-tree(s).

1.4 Compilation🔗

Next, let’s instead write a compiler, where the additions and subtractions actually happen in assembly code rather than in our Rust implementation.

1.4.1 x86 Basics🔗

Recall that we want to compile our program to a function in assembly code that takes no inputs and returns a 64-bit integer. The calling convention we use (System V AMD64) dictates that the result of such a function is stored in a register called rax. Registers like rax are a part of the abstract machine we program against in x86. Each of the general purpose registers rax, rcx, rdx, rbx, rdi, rsi, rsp, rbp, r8-r15 stores a 64-bit value and can be manipulated using a large variety of assembly code instructions. These registers are mostly indistinguishable except by conventions3A notable exception is that rsp is treated as a stack pointer by many instructions..

The only instructions we’ll need today are mov, add/sub and ret. First, the mov instruction

mov dest, src

copies the value from src to a dest. mov is a surprisingly complex operation in x86, encompassing loads, stores, register-to-register copies, constants, and some memory offset calculations. In its full generality it is even Turing complete. For today let’s use it in a very simple form: dest can be a register and src can be a constant or another register, in which case it stores the value of the constant or the current value of src into dest.

Next, the two arithmetic instructions

add dest, arg
sub dest, arg

continue the pattern with mov in that the first argument is updated by the instruction. The semantics of these instructions is to add (or sub) the value of arg from the current value of dest, and update dest with the result. I.e., you can think of add like a += operation and sub as a -= operation:

add dest, arg ;;; dest += arg
sub dest, arg ;;; dest -= arg

Finally, we have

ret

Which, if the stack is set up accordingly, will return to the caller. We can be sure that ret will work properly to implement a function return as long as we make sure when it is executed that we have not updated rsp, and the value we want to return is in rax.

1.4.2 Compiling to x86🔗

We already know how to compile a number to a function that returns it. We simply mov the number in rax and then execute the ret. We can generalize this to an Adder expression like add1(sub1(add1(7))) by using the rax register as a place to store our intermediate results:

~hl:1:s~mov rax, 7~hl:1:e~ ;; rax holds 7
~hl:2:s~add rax, 1~hl:2:e~ ;; rax holds add1(7)
~hl:3:s~sub rax, 1~hl:3:e~ ;; rax holds sub1(add1(7))
~hl:4:s~add rax, 1~hl:4:e~ ;; rax holds add1(sub1(add1(7)))
~hl:4:s~ret~hl:4:e~        ;; return to the caller with rax holding the correct output

We have commented the above assembly code with the correspondence with the source program, and we see that in a way the assembly code is "reversed" from the original concrete syntax. Now how do we turn this implementation strategy into a compiler for abitrary expressions? Just like the interpreter, we can implement the compiler by recursive descent on the input expression. If we look at the comments above, we see a recursive pattern: after each line is executed, rax holds the result of a larger and larger sub-expression. Then we can develop a recursive strategy for implementation: compile the source program to a sequence of x86 instructions that places the result into rax. For constants this is simply a mov, and for the recursive cases of add1 and sub1, we can append a corresponding assembly operation. Finally, we append a ret instruction at the end.

Here’s an implementation that emits the assembly instructions directly to stdout:

fn compile(e: &Expression) {
    fn compile_rec(e: &Expression) {
        match e {
            Expression::Number(n) => println!("mov rax, {}", n),
            Expression::Add1(arg) => {
                compile_rec(arg);
                println!("add rax, 1");
            }
            Expression::Sub1(arg) => {
                compile_rec(arg);
                println!("sub rax, 1");
            }
        }
    }
    println!(
        "        section .text
        global start_here
start_here:"
    );
    compile_rec(e);
    println!("ret");
}

Notice here that it would be counter-productive to directly use compile itself on the recursive sub-trees. Instead we define a helper function compile_rec that has a more useful behavior to just produce the instructions that move the value into rax.

Having a compiler and an interpreter is also helpful for testing. We can write automated tests that check that our compiler and interpreter have the same behavior on all of our examples..

1.4.3 Optimization🔗

For this simple language there is an obvious way to produce an optimized assembly code program, one that uses as few instructions as possible. We can simply run the interpreter and compile to a program that returns the number that is output by the interpreter.

fn optimized_compile(e: &Expression) {
    println!(
        "        section .text
        global start_here
start_here:
mov rax, {}
ret",
        interpret(&e)
    );
}

Here we have improved the efficiency of the compiled program by doing more work at compile-time. Usually this is a good tradeoff, as programs are run many more times than they are compiled.

2 Extending the language: Dynamically determined Input🔗

Our programs were very easy to optimize because all the information we needed to determine the result was available at compile-time. Obviously this isn’t typical: useful programs interact with the external world, e.g., by making network requests, inspecting files or reading command-line arguments. Let’s extend our language to take an input from the command line.

‹prog›: def main ( x ) : ‹expr› ‹expr›: | NUMBER | x | add1 ( ‹expr› ) | sub1 ( ‹expr› )

Now our program consists not of a single expression, but a main "function" that takes in an argument named x. For today, let’s say that the parameter is always called x, we’ll talk about a more robust treatment of variables next time.

What is the impact on our abstract syntax? We just need to add a new kind of leaf node to our abstract syntax trees that for when the program uses the input variable.

pub enum Expression {
    Variable(),
    Number(i64),
    Add1(Box<Expression>),
    Sub1(Box<Expression>),
}

And now our interpreter takes an additional argument, which corresponds to the input:

fn interpret(e: &Expression, x: i64) -> i64 {
    match e {
        Expression::Variable() => x,
        Expression::Number(n) => *n,
        Expression::Add1(arg) => interpret(arg, x) + 1,
        Expression::Sub1(arg) => interpret(arg, x) - 1,
    }
}

We correspondingly need to change our Rust stub.rs wrapper to provide an input argument:

#[link(name = "compiled_code", kind = "static")]
extern "sysv64" {
    #[link_name = "\x01entry"]
    fn entry(param: i64) -> i64;
}

fn main() {
    let args = std::env::args();
    if args.len() != 1 {
        eprintln!("usage: {} number");
    }
    let arg = args[1]
        .parse::<i64>()
        .expect("input must be a 64-bit integer");
    let output = unsafe { entry(arg) };
    println!("{}", output);
}

Now the external function entry is defined to take an integer in as an argument. If we consult the System V AMD64 calling convention, we find that the first input argument is placed in the register rdi. Then we can compile the Variable case quite similarly to Number, but moving from rdi rather than a constant:

fn compile(e: &Expression) {
    fn compile_rec(e: &Expression) {
        match e {
            Expression::Variable() => println!("mov rax, rdi"),
            Expression::Number(n) => println!("mov rax, {}", n),
            Expression::Add1(arg) => {
                compile_rec(arg);
                println!("add rax, 1");
            }
            Expression::Sub1(arg) => {
                compile_rec(arg);
                println!("sub rax, 1");
            }
        }
    }
    println!(
        "        section .text
            global start_here
    start_here:"
    );
    compile_rec(e);
    println!("ret");
}

Exercise
How would you write an optimized version of this compiler, such that the output program always uses of at most 3 instructions?

1You can install it yourself with cargo install lalrpop

2the one thing that compilers do often keep around is source-location information, for providing better error messages.

3A notable exception is that rsp is treated as a stack pointer by many instructions.

1	Growing our Language: Basic Computation
2	Extending the language: Dynamically determined Input