Assignment 2: Adder: A starter language

8.2

Assignment 2: Adder: A starter language

Due: Tue 09/21 at 9:00pm

git clone

In this assignment you’ll implement a compiler for a small language called Adder (because it primarily adds things). Click for scary snake picture!

1 The Adder Language

In each of the next several assignments, we’ll introduce a language that we’ll implement. We’ll start small, and build up features incrementally. We’re starting with Adder, which has just a few features — defining variables, and primitive operations on numbers.

There are a few pieces that go into defining a language for us to compile.

A description of the concrete syntax — the text the programmer writes
A description of the abstract syntax — how to express what the programmer wrote in a data structure our compiler uses.
The semantics — or description of the behavior — of the abstract syntax, so our compiler knows what the code it generates should do.

1.1 Concrete Syntax

For this assignment we will be using an s-expression based syntax so you can see how your s-expression parser from last time fits into the compiler pipeline. Next week we resume using a more python/ruby-like syntax.

The concrete syntax of Adder is (remember EPSILON means the empty string):

‹expr›: | NUMBER | IDENTIFIER | ( let ( ‹bindings› ) ‹expr› ) | ( add1 ‹expr› ) | ( sub1 ‹expr› ) ‹bindings›: | EPSILON | ( IDENTIFIER ‹expr› ) ‹bindings›

The main difference from the language we discussed in class (besides the S-expression style) is that a let expression can have zero, one or many bindings. An IDENTIFIER is any non-sequence of letters and digits (starting with a letter). Any highlighted text is a literal token, meaning the programmer must type exactly those characters or keywords.

1.2 Abstract Syntax

The abstract syntax of Adder is a Rust datatype, and corresponds nearly one-to-one with the concrete syntax. We parameterize the AST by a type of annotations.

pub enum Exp<Ann> {
    Num(i64, Ann),
    Var(String, Ann),
    Prim1(Prim1, Box<Exp<Ann>>, Ann),
    Let { bindings: Vec<(String, Exp<Ann>)>, // new binding declarations
          body: Box<Exp<Ann>>,  // the expression in which the new variables are bound
          ann: Ann
        }
}

pub enum Prim1 {
    Add1,
    Sub1,
}

1.3 Semantics

An Adder program always evaluates to a single integer. Nums evaluate to themselves (so a program just consisting of Num(5) should evaluate to the integer 5). Primitive expressions perform addition or subtraction by one on their argument. Let Let bindings should evaluate all the binding expressions to values one by one, and after each, store a mapping from the given name to the corresponding value in both (a) the rest of the bindings, and (b) the body of the let expression. Identifiers evaluate to whatever their current stored value is. There are several examples further down to make this concrete.

The compiler should return an error if

There is a binding list containing two or more bindings with the same name
An identifier is unbound (there is no surrounding let binding for it)

These errors are encoded in the CompileError type in compile.rs.

Here are some examples of Adder programs (ignoring the Ann written in the style of Debug):

Concrete Syntax	Abstract Syntax	Answer
`5`	`Num(5)`	`5`
`(sub1 (add1 (sub1 5)))`	`Prim1(Sub1, Prim1(Add1, Prim1(Sub1, Num(5))))`	`4`
`(let ((x 5)) (add1 x))`	`Let{ bindings: [("x", Num(5))], body: Prim1(Add1, Var("x")) }`	`6`
`(let ((x 5) (y (sub1 x))) (sub1 y))`	`Let{ bindings: [("x", Num(5)), ("y", Prim1(Sub1, Var("x")))], body: Prim1(Sub1, Var("y"))}`	`3`

Note that as discussed in class, shadowing of variables is allowed, but all variable names in a single binding list must be unique. So

(let ((x 3))
  (let ((x 4)) x))

is a valid program, but

(let ((x 3)
      (x 4))
  x)

should produce a compile-time error.

2 Starter code for this assignment

You’ve been given a starter codebase that has several pieces of infrastructure:

A file syntax.rs that defines tokens, s-expressions and abstract syntax trees and the associated tokenizer/parsers. This includes an s-expression parser that shows one approach to the previous homework.
A file asm.rs that defines types registers, instructions and their arguments. Read over these types carefully to see how we encode the invariants of x86-64 using Rust’s type system.
A file compile.rs that defines the compiler to a sequence of instructions.
lib.rs which declares the modules in the package. You don’t need to edit this.
A main program (main.rs) that uses the support code in runner.rs and interp.rs to implement interactively running the compiler, linking with the runtime and using the reference interpreter. You don’t need to edit either of these files.
A file runtime/stub.rs that links with your compilation output to form executables.
A file tests/examples.rs in which you will write integration tests (see below).
A file examples/parse_error.adder to demonstrate where to put example files.

Your edits — which will be to write the compiler for Adder, and test it — will happen in test/examples.rs, compile.rs and asm.rs.

3 Implementing a Compiler for Adder

3.1 Writing the Compiler

The primary task of writing the Adder compiler is simple to state: take an instance of the Expr datatype and turn it into a list of assembly instructions. The provided compiler skeleton is set up to do just this, broken up over a few functions.

The first is in compile.rs:

compile_to_instrs(&Expr<Span>) -> Result<Vec<Instr>, CompileErr>

which takes a expr value (abstract syntax, tagged with source location information) and turns it into a list of assembly instructions, represented by the instruction type or returns one of the possible compilation errors defined in the file. Use only the provided instruction types for this assignment; we will be gradually expanding this set of instructions as the semester progresses. This function has an associated helper that takes some extra arguments to track the variable environment and stack offset.

The other component you need to implement is in asm.rs:

instrs_to_string(is: &[Instr]) -> String

which renders individual instances of the instruction datatype into a string representation of the instruction (this is done for you for mov). Additionally there are several helper functions for printing registers, memory references etc that you will need to fill in. This part is straightforward, but forces you to understand the syntax of the assembly code you are generating. Most of the compiler concepts happen in the first step, where we are generating assembly instructions from abstract syntax. Do use this assembly guide — or ask! — if you have questions about the concrete syntax of an instruction. Just note that this guide is written for 32-bit x86, so registers have names like eax, esp, but otherwise the syntax is mostly unchanged.

3.2 Testing the Compiler

The code-generation portion of pipeline is not as amenable to unit testing since we do not want to over-constrain what specific instructions we use for a given input. Instead, this week we will be introducing integration tests, which test the entire compiler pipeline.

Cargo looks for integration tests in the tests/ directory. Included in the startup code are two macros, mk_test and mk_fail_test for writing end-to-end compiler tests. To use them, add a file to the examples/ directory and then call mk_test with a name for the test, the name of the file in the example directory you are testing, and the expected number for it to output. For failing tests, you can use mk_fail_test to include a string that you expect to occur in the error message.

You may still unit test your code, especially if you have tricky helper functions you define, but only integration test coverage will be graded this time. In future assignments, when we have multiple analyses and transformations, we will resume unit testing.

3.3 Running main and compiling to binary

main.rs provides some convenient infrastructure for running your compiler and interpreter from a terminal. To see the expected arguments simply run cargo run with no options. Note that to distinguish the flags for your program from Cargo’s flags you need to insert -- before your flags. For example, to run the interpreter on a file example.adder you can use

$ cargo run -- --interp example.adder

Alternatively, you can compile main.rs to a binary using cargo build --release. Then the binary will be located at ./target/release/snake.

4 Cross-platform Issues

I have tried to make the building and linking as cross-platform as possible. I have tested that the programs work on Linux and Mac OS X, and I think they should work on Windows as well. On any platform you need to have nasm and ar installed. Ask on Piazza if you have any issues installing these.

Here are the explicit commands the runner uses to build an executable from an assembly code file compiled_code.s and the rust stub stub.rs. First build an object file using nasm, passing the appropriate format flag for your platform.

$ nasm -felf64 compiled_code.s -o compiled_code.o   # Linux
$ nasm -fmacho64 compiled_code.s -o compiled_code.o # Mac
$ nasm -fwin64 compiled_code.s -o compiled_code.o   # Windows

Then build a static library file to link with the rust file

$ ar r libcompiled_code.a compiled_code.o # Linux & Mac
$ ar r compiled_code.lib compiled_code.o  # Windows

Finally, compile the rust file, indicating where to look for the static library.

$ rustc stub.rs -L . -o stub.exe

This should make an executable that you can run.

$ ./stub.exe

I recommend you try this out with the sample file

section .text
global start_here
start_here:
  mov RAX, 483
  ret

To make sure you have installed nasm and ar correctly. Note that you do not need an underscore in front of start_here on Mac anymore as I found an undocumented workaround to that issue.

5 Rust Tips

To read more about pattern matching on structs like MemRef and enums with named arguments like Exp see this chapter of the Rust book.

Sometimes you need to convert between different numeric types in Rust, typically using the Into and TryInto traits. These traits can be a bit tedious to use in my opinion because they often need type annotations in weird places. To make things simple for you, I’ve included a function usize_to_i32 in compile.rs that tries to convert a usize to an i32, panicking if it does not fit. I used this in the reference solution so you may find it useful.

Finally, in checking binding-lists for duplicates you might want to use a data structure to keep track of the names. You can use a vector, like we use in the interpreter, or you may want to try out the HashSet module in the standard library.

6 List of Deliverables

Include (at least) the following files in your zip submission:

Cargo.toml
src/compile.rs
src/asm.rs
src/syntax.rs
src/interp.rs
src/runner.rs
src/lib.rs
any additional modules you saw fit to write
your compiler integration tests (tests/examples.rs)
The test input programs (examples/*.adder files) you wrote

7 Grading Standards

For this assignment, you will be graded on

Whether your code implements the specification (functional correctness),
the clarity and cleanliness of your code, and
the comprehensiveness of your test coverage

8 Submission

Wait! Please read the assignment again and verify that you have not forgotten anything!

Please submit your homework on gradescope by the above deadline.

1	The Adder Language
2	Starter code for this assignment
3	Implementing a Compiler for Adder
4	Cross-platform Issues
5	Rust Tips
6	List of Deliverables
7	Grading Standards
8	Submission