Lecture 5: Binary Operations and Sequential Form

8.10

Lecture 5: Binary Operations and Sequential Form

Today we will extend the compiler to support binary arithmetic operations and not just increment and decrement. This is surprisingly difficult as it introduces the ambiguity of evaluation order into the language, and so we will need to add a new pass to the compiler that makes the evaluation order obvious in the structure of the term.

1 Growing the language: adding infix operators

Again, we follow our standard recipe:

Its impact on the concrete syntax of the language
Examples using the new enhancements, so we build intuition of them
Its impact on the abstract syntax and semantics of the language
Any new or changed transformations needed to process the new forms
Executable tests to confirm the enhancement works as intended

1.1 The new concrete syntax

‹expr›: ... | ‹expr› + ‹expr› | ‹expr› - ‹expr› | ‹expr› * ‹expr› | ( ‹expr› )

1.2 Examples and semantics

These new expression forms should be familiar from standard arithmetic notation.

The parser will take care of operator precedence. I.e., the term

(2 - 3) + 4 * 5

will parse the same way as

(2 - 3) + (4 * 5)

according to the PEMDAS rules.

However, consider what would happen if we added new forms print0(expr) and print1(expr) function to our language, that print 0 or 1 respectively before evaluating expr 1We will see how to implement printing soon. Then how should the following program evaluate?

print0(6) * print1(7)

First, no matter what the expression should evaluate to 6 * 7 = 42, so the question is what should it print? There are several reasonable choices:

Prints "01", this is left-to-right evaluation order
Prints "10", this is right-to-left evaluation order
May print either "01" or "10", this means the evaluation order is unspecified, or implementation dependent

Which do you prefer? Either of the first two seem very reasonable, with left-to-right seeming more reasonable to match the way we write English. The third option is something probably only a compiler writer would choose, because it means it is easier to optimize the program because you can arbitrarily re-order things!

We’ll go with the first choice: left-to-right evaluation order.

Note that doing things left-to-right like this is not quite the same as the PEMDAS rules. For instance the following arithmetic expression evaluates:

    (2 - 3) + 4 * 5
==> -1 + (4 * 5)
==> -1 + 20
==> 19

rather than the possible alternative of doing the multiplication first. The alternative of following PEMDAS to do the evaluation order would be very confusing:

(print0(2) - 3) + print1(4) * 5

If we follow our left-to-right evaluation then this would print "01" but if we follow PEMDAS literally we would probably print "10", which I hope you’ll agree is quite counter-intuitive.

1.3 Enhancing the abstract syntax

enum Prim2 {
    Add,
    Sub,
    Mul,
}

enum Exp<Ann> {
  ...
  Prim2(Prim2, Box<Exp<Ann>>, Box<Exp<Ann>>, Ann),
}

We simply add a new constructor describing our primitive binary operations, and an enumeration of what those operations might be. The parser will do the hard work of figuring out the correct tree structure for un-parenthesized expressions like "1 - 2 + x * y".

1.4 Enhancing the transformations: a new intermediate representation (IR)

Exercise
What goes wrong with our current naive transformations? How can we fix them?

Let’s try manually “compiling” some simple binary-operator expressions to assembly:

Original expression

Compiled assembly

(2 + 3) + 4

mov RAX, 2
add RAX, 3
add RAX, 4

(4 - 3) - 2

mov RAX, 4
sub RAX, 3
sub RAX, 2

((4 - 3) - 2) * 5

mov RAX, 4
sub RAX, 3
sub RAX, 2
mul RAX, 5

(2 - 3) + (4 * 5)

mov RAX, 2
sub RAX, 3
?????

Do Now!
Convince yourself that using a let-bound variable in place of any of these constants will work just as well.

So far, our compiler has only ever had to deal with a single active expression at a time: it moves the result into RAX, increments or decrements it, and then potentially moves it somewhere onto the stack, for retrieval and later use. But with our new compound expression forms, that won’t suffice: the execution of (2 - 3) + (4 * 5) above clearly must stash the result of (2 - 3) somewhere, to make room in RAX for the subsequent multiplication. We might try to use another register (RBX, maybe?), but clearly this approach won’t scale up, since there are only a handful of registers available. What to do?

1.4.1 Immediate expressions

Do Now!
Why did the first few expressions compile successfully?

Notice that for the first few expressions, all the arguments to the operators were immediately ready:

They required no further computation to be ready.
They were either constants, or variables that could be read off the stack.

Perhaps we can salvage the final program by transforming it somehow, such that all its operations are on immediate values, too.

Do Now!
Try to do this: Find a program that computes the same answer, in the same order of operations, but where every operator is applied only to immediate values.

Note that conceptually, our last program is equivalent to the following:

let first = 2 - 3 in
let second = 4 * 5 in
first + second

This program has decomposed the compound addition expression into the sum of two let-bound variables, each of which is a single operation on immediate values. We can easily compile each individual operation, and we already know how to save results to the stack and restore them for later use, which means we can compile this transformed program to assembly successfully.

Come to think of it, compiling operations when they are applied to immediate values is so easy, wouldn’t it be nice if we did the same thing for unary primitives and if? This way every intermediate result gets a name, which will then be assigned a place on the stack (or later on when we get to register allocation, a register) instead of every intermediate result necessarily going through rax.

1.5 Testing

Do Now!
Once you’ve completed the section below, run the given source programs through our compiler pipeline. It should give us exactly the handwritten assembly we intend. If not, debug the compiler until it does.

2 Sequential Form

Our goal is to transform our program such that every operator is applied only to immediate values (constants/variables), and every expression (besides let) does exactly one thing with no other internal computation necessary. We will call such a form Sequential Form2This is the name I have chosen to use in this class. The most common name for this intermediate representation is monadic normal form. There are many names for quite similar intermediate representations: SSA (static-single assignment) is the most common, used in the LLVM framework. Additionally, there are CPS (continuation-passing style) and ANF (A-normal form). See here for more on the comparison between this form and SSA.

There are at least two ways to implement this. Firstly, we could write a function sequentialize(&Exp) -> Exp that puts our expressions into a sequential form. This type makes sense because the sequential expressions form a subset of all expressions. However, this type signature is imprecise in that the output doesn’t reflect the fact that the output is sequential. This means when we write the next function compile_to_instrs(&Exp) -> Vec<Instr> we will still have to cover all expressions in our code, likely by using panic! when the input is not sequential. Instead we can eliminate this mismatch by developing a new type SeqExp that allows for expressing only those programs in sequential form. We also need to make a type ImmExp for describing the subset of immediate expressions.

enum ImmExp {
    Num(i64),
    Var(String),
}

enum SeqExp<Ann> {
    Imm(ImmExp, Ann),
    Prim1(Prim1, ImmExp, Ann),
    Prim2(Prim2, ImmExp, ImmExp, Ann),
    Let { var:       String,
          bound_exp: Box<SeqExp<Ann>>,
          body:      Box<SeqExp<Ann>>,
          ann:       Ann
    },
    If { cond: ImmExp,
         thn: Box<SeqExp<Ann>>,
         els: Box<SeqExp<Ann>>,
         ann: Ann
    },
}

Do Now!
Why did we choose to make cond an immediate, but not thn and els? Why?

So Prim1, Prim2 require that their arguments are immediates, while in the Let form we require only that the two sub-expressions are in sequential form themselves. For the If case the branches are allowed to be arbitrary sequential expressions, since we don’t want to evaluate them unless they are selected by the condition. The condition, on the other hand, is an immediate since it will always be evaluated.

While we already knew how to compile Prim1 and If with full sub-expressions, requiring the sub-expressions to be immediates simplifies the code-generation pass since all "sequencing" code goes into the Let case. Now when we add more constructs to the language, we can relegate all sequencing code to the Let case and not re-implement it for the new constructs.

Also note that while Exp allowed for multiple bindings, here we allow for only one binding at a time. This also simplifies the code generation since we only have to handle one let at a time, and once we have taken care of scope-checking, they should have equivalent semantics.

2.1 Sequentializing our Programs

Exercise
Try to systematically define a conversion function sequentialize(&Exp<u32>) -> SeqExp<()> such that the resulting expression has the same semantics.

Exercise
Why should the type of the function be (&Exp<u32>) -> SeqExp<()>? In particular, why do we discard the input tags?

The central idea is that to convert some expression e1 + e2 (or any other operator), we add new let-bindings for every sub-expression. So e1 + e2 becomes let x1 = se1 in let x2 = se2 in x1 + x2 where se1 is the result of putting e1 into sequential form, and similarly for se2. The trickiest part of implementing this is making sure that the variable names we use x1, x2 are different from all the names used by the source code, as well as different from other variables we generate. To make sure they are different from each other, we can use the unique tag we have annotated on the term in a previous pass. To ensure they are different from names from the source code, we can give them names that are not valid syntactically. For instance, our parser only accepts variable names that start with an ASCII alphabetic character, so if we start our generated variable names with a non-alphabetic character we won’t clash with source variable names.

fn sequentialize(e: &Exp<u32>) -> SeqExp<()> {
    match e {
    ...
        Exp::Prim2(op, e1, e2, tag) => {
            let s_e1 = sequentialize(e1);
            let s_e2 = sequentialize(e2);
            let name1 = format!("#prim2_1_{}", tag);
            let name2 = format!("#prim2_2_{}", tag);
            SeqExp::Let {
                var: name1.clone(), bound_exp: Box::new(s_e1), ann: (),
                body:
                Box::new(SeqExp::Let {
                    var: name2.clone(), bound_exp: Box::new(s_e2), ann: (),
                    body: Box::new(SeqExp::Prim2(*op, ImmExp::Var(name1), ImmExp::Var(name2), ())),
                })
            }
        },
   ...
   }
}

Note that we discard the tags and replace them with empty annotations () in the output program. This makes sense because a Prim2 gets translated to multiple expression forms (2 Let and a Prim2) so we cannot simply preserve the tag without violating our invariant that all sub-expressions have a unique tag.

The other cases are similar, with the Let case handling the mismatch between binding sequences in Exp and the "one-at-a-time" Let in SeqExp. The main thing to be careful of is to not get too greedy in sequentializing. When we sequentialize an If

if e1: e2 else: e3

We should make sure to simply sequentialize the branches and lift the condition

let x1 = se1 in if x1: se2 else: se3

rather than lifting all of them

let x1 = se1 in
let x2 = se2 in
let x3 = se3 in
if x1: x2 else: x3

Which would always run both branches.

2.2 Improving the translation

This sequentialization pass is somewhat sloppy: it will generate many unnecessary temporary variables.

Do Now!
Find a simple expression that need not generate any extra variables, but for which sequentialize generates at least one unneeded variable.

For instance x + y is already in sequential form, but this translation will still add new bindings let #prim2_1_0 = x in let #prim2_2_0 = y in #prim2_1_0 + #prim2_2_0. There are at least two ways to remedy this:

We could make the sequentialization code more complex by checking for this special case and not generating extra variables unless necessary
We could keep the sequentialization code the same and rely on later optimizations to eliminate these extra bindings.

We will discuss the relevant optimizations later in the semester. For now it is optional whether you want to make your sequentialization code eliminate these unnecessary bindings. If you do, I encourage you to find an elegant solution that uses a helper function rather than manually inspecting the sub-expressions to check if they are immediates in each case.

Now, we can finally look at our current compiler pipeline:

fn compile(e: Exp<Span>) -> String
  /* make sure all names are in scope, and then */
  let tagged = tag_exp(&e);
  let se = sequentialize(&tagged);
  let tagged_se = tag_seq_exp(&se);
  let compiled = compile_to_instrs(tagged_se);
  /* ... surround compiled with prelude as needed ... */

Quite a lot of changes, just for adding arithmetic and conditionals!

1We will see how to implement printing soon

2This is the name I have chosen to use in this class. The most common name for this intermediate representation is monadic normal form. There are many names for quite similar intermediate representations: SSA (static-single assignment) is the most common, used in the LLVM framework. Additionally, there are CPS (continuation-passing style) and ANF (A-normal form). See here for more on the comparison between this form and SSA

1	Growing the language: adding infix operators
2	Sequential Form