Lecture 3: Abstract Syntax Trees and Interpreters
1 Growing our Language: Basic Computation
Last time, we considered the following miniscule language:
Our Rust representation of this program was simply an n: i64
and our compiler simply placed that integer in the appropriate place in the
assembly:
section .text
global start_here
start_here:
mov rax, n
ret
Let’s extend our language so that it does some actual computation,
adding increment and decrement operations. Say, that we have access
to some kind of functions add1
and sub1
. Here are a
few examples of programs that we might write in this language, and
what answer we expect them to evaluate to:
Concrete Syntax |
| Answer |
|
|
|
|
|
|
|
|
|
|
|
|
Now how can we go about implementing this language? The first step is to take in our program as an input string and (1) check that it is a well-formed program and (2) rewrite it into a more convenient representation to use.
1.1 Concrete Syntax
How programs are written as strings is called the concrete syntax of the programming language. We can describe the syntax for our Snake language using a grammar like the following:
A grammar like this describes the concrete syntax of a language by
specifying how you could generate the terms in the language. In this
case we have one form called expr
and to create an
expr
we either pick a number, or we call an add1
or
sub1
function on another expr, with the argument wrapped in
parentheses. By convention, we ignore whitespace in this
description. We’ll keep this all a bit informal for now and make it
completely rigorous in the last month of the course.
Even for a simple language like this, writing a program that checks if a given string conforms to this grammar is fairly tedious. Thankfully there are widely available programs called parser generators that will generate code for a parser from fairly simple descriptions like this. In this course we use one called LALRPOP (pronounced kind of like "lollipop"). A LALRPOP implementation of the grammar for this language is given by:
pub Expr = {
<Num>,
"add1" "(" <Expr> ")",
"sub1" "(" <Expr> ")",
}
Num = {
<r"[+-]?[0-9]+">
}
Which is almost exactly what we’ve written above, except that it’s
more precise in defining a description of the syntax for numbers.
Running the lalrpop
tool1You can install it yourself with
cargo install lalrpop
on this code produces 500 lines of
fairly inscrutable Rust code we are happy not to have to write by
hand.
1.2 Abstract Syntax and Parsing
The lalrpop program above doesn’t really define a parser, but instead a recognizer, i.e., a function that takes in an input string and tell us whether or not the input string matches our language’s grammar. A recognizer alone is not very useful for a programming language implementation. After we have identified the input string is in the language, we are still left with just a string, which is not a very convenient input type to writing an interpreter or a compiler. When we write the remainder of the implementation, we would prefer to use a representation where we can very easily answer some basic questions:
Is the expression a number or an operation?
If it’s a number, what is that number represented as an
i64
?If it’s an operation, is it an
add1
or asub1
and what is the sub-expression?
We could answer these questions using just a string as our representation, but it would be tedious and involve scanning the string, which is what the recognizer just did anyway! For more complex languages, this becomes completely infeasible.
Instead what we’ll do is represent our programs as special kinds of
trees called abstract syntax trees. For this language, a
tree is either a leaf, which should have a number, or a node which
should be labeled as add1
or sub1
and have one
child. One of the main reasons we use Rust in this course is that Rust
has very good support for defining such tree types:
enum Expression {
Number(i64),
Add1(Box<Expression>),
Sub1(Box<Expression>),
}
This is a Rust enum
type, so-called because it is a type
defined by enumerating the possible values. You can read more
about enum
types in
Rust
book chapter 6. This definitions matches our abstract description of
the grammar: either the input is a number, or an add1
operation performed on another expression, or a sub1
operation
performed on another.
Here are a couple of examples of programs and how to construct their corresponding abstract syntax trees:
Concrete Syntax |
| Abstract Syntax |
|
|
|
|
|
|
But notice how we’ve abstracted away the details from the concrete syntax. This same abstract syntax could apply to many different concrete syntaxes. For example, we could use an imperative syntax
17;
sub1;
add1;
sub1
or a lisp-style syntax
(sub1 (add1 (sub1 17)))
but almost2the one thing that compilers do often keep around is source-location information, for providing better error messages. none of those details are relevant to the rest of compilation.
To extract an abstract syntax tree from our input string, we update our lalrpop file to generate an output expression in addition to specifying the grammar:
pub Expr: Expression = {
<n: Num> => Expression::Number(n),
"add1" "(" <e: Expr> ")" => Expression::Add1(Box::new(e)),
"sub1" "(" <e: Expr> ")" => Expression::Sub1(Box::new(e)),
}
Num: i64 = {
<s:r"[+-]?[0-9]+"> => s.parse().unwrap()
}
For this language, our type of abstract syntax trees is very simple,
but this approach scales up to very complicated languages. The
analogous enum
in the Rust compiler (which is written in Rust)
contains
about
40 different cases!
1.3 Semantics
Before we define a compiler for our new language, we should first
specify its semantics, i.e., how programs should be evaluated.
A simple method for doing this is to define a corresponding
interpreter. Like our previous language, the programs here
should output a single integer number, but now add1
and
sub1
should perform those mathematical operations.
We can implement this functionality by using Rust’s pattern-matching feature, which allows us to perform a case-analysis of which constructor was used, and get access to the data contained within.
fn interpret(e: &Expression) -> i64 {
match e {
Expression::Number(n) => *n,
Expression::Add1(arg) => interpret(arg) + 1,
Expression::Sub1(arg) => interpret(arg) - 1,
}
}
Exercise
Why is the input to this interpreter a
&Expression
rather than aExpression
? What difference does this decision make?
This pattern of programming is very common when we work with abstract syntax trees: we define a function by matching on the tree, and then recursively call the function on the sub-tree(s).
1.4 Compilation
Next, let’s instead write a compiler, where the additions and subtractions actually happen in assembly code rather than in our Rust implementation.
1.4.1 x86 Basics
Recall that we want to compile our program to a
function in assembly code that takes no inputs and returns a
64-bit integer. The calling convention we use (System V AMD64)
dictates that the result of such a function is stored in a register
called rax
. Registers like rax
are a part of the abstract
machine we program against in x86. Each of the general purpose
registers rax
, rcx
, rdx
, rbx
, rdi
,
rsi
, rsp
, rbp
, r8
-r15
stores a 64-bit
value and can be manipulated using a large variety of assembly code
instructions. These registers are mostly indistinguishable except by
conventions3A notable exception is that rsp
is treated as
a stack pointer by many instructions..
The only instructions we’ll need today are mov
, add
/sub
and ret
.
First, the mov
instruction
mov dest, src
src
to a dest
. mov
is a
surprisingly complex operation in x86, encompassing loads, stores,
register-to-register copies, constants, and some memory offset
calculations. In its full generality it is even
Turing
complete. For today let’s use it in a very simple form: dest
can be a register and src
can be a constant or another register,
in which case it stores the value of the constant or the current value
of src
into dest
.Next, the two arithmetic instructions
add dest, arg
sub dest, arg
mov
in that the first argument is updated by the instruction. The semantics of these instructions is to add
(or sub
) the value of arg
from the current value of dest
, and update dest
with the result. I.e., you can think of add
like a +=
operation and sub
as a -=
operation:
add dest, arg ;;; dest += arg
sub dest, arg ;;; dest -= arg
Finally, we have
ret
ret
will work properly to implement
a function return as long as we make sure when it is executed that we
have not updated rsp
, and the value we want to return is in
rax
.1.4.2 Compiling to x86
We already know how to compile a number to a function that returns
it. We simply mov
the number in rax
and then execute the
ret
. We can generalize this to an Adder expression like
add1(sub1(add1(7)))
by using the rax
register as a place to store our intermediate results:
~hl:1:s~mov rax, 7~hl:1:e~ ;; rax holds 7
~hl:2:s~add rax, 1~hl:2:e~ ;; rax holds add1(7)
~hl:3:s~sub rax, 1~hl:3:e~ ;; rax holds sub1(add1(7))
~hl:4:s~add rax, 1~hl:4:e~ ;; rax holds add1(sub1(add1(7)))
~hl:4:s~ret~hl:4:e~ ;; return to the caller with rax holding the correct output
We have commented the above assembly code with the correspondence with
the source program, and we see that in a way the assembly code is
"reversed" from the original concrete syntax. Now how do we turn this
implementation strategy into a compiler for abitrary expressions? Just
like the interpreter, we can implement the compiler by recursive
descent on the input expression. If we look at the comments above, we
see a recursive pattern: after each line is executed, rax
holds
the result of a larger and larger sub-expression. Then we can develop
a recursive strategy for implementation: compile the source program to
a sequence of x86 instructions that places the result into
rax
. For constants this is simply a mov
, and for the
recursive cases of add1
and sub1
, we can append a
corresponding assembly operation. Finally, we append a ret
instruction at the end.
Here’s an implementation that emits the assembly instructions directly to stdout:
fn compile(e: &Expression) {
fn compile_rec(e: &Expression) {
match e {
Expression::Number(n) => println!("mov rax, {}", n),
Expression::Add1(arg) => {
compile_rec(arg);
println!("add rax, 1");
}
Expression::Sub1(arg) => {
compile_rec(arg);
println!("sub rax, 1");
}
}
}
println!(
" section .text
global start_here
start_here:"
);
compile_rec(e);
println!("ret");
}
Notice here that it would be counter-productive to directly use
compile
itself on the recursive sub-trees. Instead we define a
helper function compile_rec
that has a more useful behavior to
just produce the instructions that move the value into rax
.
Having a compiler and an interpreter is also helpful for testing. We can write automated tests that check that our compiler and interpreter have the same behavior on all of our examples..
1.4.3 Optimization
For this simple language there is an obvious way to produce an optimized assembly code program, one that uses as few instructions as possible. We can simply run the interpreter and compile to a program that returns the number that is output by the interpreter.
fn optimized_compile(e: &Expression) {
println!(
" section .text
global start_here
start_here:
mov rax, {}
ret",
interpret(&e)
);
}
Here we have improved the efficiency of the compiled program by doing more work at compile-time. Usually this is a good tradeoff, as programs are run many more times than they are compiled.
2 Extending the language: Dynamically determined Input
Our programs were very easy to optimize because all the information we needed to determine the result was available at compile-time. Obviously this isn’t typical: useful programs interact with the external world, e.g., by making network requests, inspecting files or reading command-line arguments. Let’s extend our language to take an input from the command line.
‹prog› def main ( x ) : ‹expr› ‹expr› NUMBER x add1 ( ‹expr› ) sub1 ( ‹expr› )
Now our program consists not of a single expression, but a main
"function" that takes in an argument named x
. For today, let’s
say that the parameter is always called x
, we’ll talk about a
more robust treatment of variables next time.
What is the impact on our abstract syntax? We just need to add a new kind of leaf node to our abstract syntax trees that for when the program uses the input variable.
pub enum Expression {
Variable(),
Number(i64),
Add1(Box<Expression>),
Sub1(Box<Expression>),
}
And now our interpreter takes an additional argument, which corresponds to the input:
fn interpret(e: &Expression, x: i64) -> i64 {
match e {
Expression::Variable() => x,
Expression::Number(n) => *n,
Expression::Add1(arg) => interpret(arg, x) + 1,
Expression::Sub1(arg) => interpret(arg, x) - 1,
}
}
We correspondingly need to change our Rust stub.rs
wrapper to
provide an input argument:
#[link(name = "compiled_code", kind = "static")]
extern "sysv64" {
#[link_name = "\x01entry"]
fn entry(param: i64) -> i64;
}
fn main() {
let args = std::env::args();
if args.len() != 1 {
eprintln!("usage: {} number");
}
let arg = args[1]
.parse::<i64>()
.expect("input must be a 64-bit integer");
let output = unsafe { entry(arg) };
println!("{}", output);
}
Now the external function entry
is defined to take an integer
in as an argument. If we consult the System V AMD64 calling
convention, we find that the first input argument is placed in the
register rdi
. Then we can compile the Variable
case quite
similarly to Number
, but moving from rdi
rather than a
constant:
fn compile(e: &Expression) {
fn compile_rec(e: &Expression) {
match e {
Expression::Variable() => println!("mov rax, rdi"),
Expression::Number(n) => println!("mov rax, {}", n),
Expression::Add1(arg) => {
compile_rec(arg);
println!("add rax, 1");
}
Expression::Sub1(arg) => {
compile_rec(arg);
println!("sub rax, 1");
}
}
}
println!(
" section .text
global start_here
start_here:"
);
compile_rec(e);
println!("ret");
}
Exercise
How would you write an optimized version of this compiler, such that the output program always uses of at most 3 instructions?
1You can install it yourself with
cargo install lalrpop
2the one thing that compilers do often keep around is source-location information, for providing better error messages.
3A notable exception is that rsp
is treated as
a stack pointer by many instructions.