Lecture 5: Binary Operations and Sequential Form

8.2

Lecture 5: Binary Operations and Sequential Form

Today we will extend the compiler to support binary arithmetic operations and not just increment and decrement. This is somewhat more difficult since

1 Growing the language: adding infix operators

Again, we follow our standard recipe:

Its impact on the concrete syntax of the language
Examples using the new enhancements, so we build intuition of them
Its impact on the abstract syntax and semantics of the language
Any new or changed transformations needed to process the new forms
Executable tests to confirm the enhancement works as intended

1.1 The new concrete syntax

‹expr›: ... | ‹expr› + ‹expr› | ‹expr› - ‹expr› | ‹expr› * ‹expr› | ( ‹expr› )

1.2 Examples and semantics

These new expression forms should be familiar from standard arithmetic notation.

Note that while operator precedence will determine the tree structure the expression is parsed into; it will not affect the evaluation order. For this language, we will decide that the order of evaluation should be leftmost-innermost: that is, in the expression (2 - 3) + 4 * 5, the evaluation order should step through

    (2 - 3) + 4 * 5
==> -1 + (4 * 5)
==> -1 + 20
==> 19

rather than the possible alternative of doing the multiplication first (a more literal reading of PEMDAS taught to American children).

1.3 Enhancing the abstract syntax

enum Prim2 {
    Add,
    Sub,
    Mul,
}

enum Exp<Ann> {
  ...
  Prim2(Prim2, Box<Exp<Ann>>, Box<Exp<Ann>>, Ann),
}

We simply add a new constructor describing our primitive binary operations, and an enumeration of what those operations might be. The parser will do the hard work of figuring out the correct tree structure for un-parenthesized expressions like "1 - 2 + x * y".

1.4 Enhancing the transformations: Normalization

Exercise
What goes wrong with our current naive transformations? How can we fix them?

Let’s try manually “compiling” some simple binary-operator expressions to assembly:

Original expression

Compiled assembly

(2 + 3) + 4

mov RAX, 2
add RAX, 3
add RAX, 4

(4 - 3) - 2

mov RAX, 4
sub RAX, 3
sub RAX, 2

((4 - 3) - 2) * 5

mov RAX, 4
sub RAX, 3
sub RAX, 2
mul RAX, 5

(2 - 3) + (4 * 5)

mov RAX, 2
sub RAX, 3
?????

Do Now!
Convince yourself that using a let-bound variable in place of any of these constants will work just as well.

So far, our compiler has only ever had to deal with a single active expression at a time: it moves the result into RAX, increments or decrements it, and then potentially moves it somewhere onto the stack, for retrieval and later use. But with our new compound expression forms, that won’t suffice: the execution of (2 - 3) + (4 * 5) above clearly must stash the result of (2 - 3) somewhere, to make room in RAX for the subsequent multiplication. We might try to use another register (RBX, maybe?), but clearly this approach won’t scale up, since there are only a handful of registers available. What to do?

1.4.1 Immediate expressions

Do Now!
Why did the first few expressions compile successfully?

Notice that for the first few expressions, all the arguments to the operators were immediately ready:

They required no further computation to be ready.
They were either constants, or variables that could be read off the stack.

Perhaps we can salvage the final program by transforming it somehow, such that all its operations are on immediate values, too.

Do Now!
Try to do this: Find a program that computes the same answer, in the same order of operations, but where every operator is applied only to immediate values.

Note that conceptually, our last program is equivalent to the following:

let first = 2 - 3 in
let second = 4 * 5 in
first + second

This program has decomposed the compound addition expression into the sum of two let-bound variables, each of which is a single operation on immediate values. We can easily compile each individual operation, and we already know how to save results to the stack and restore them for later use, which means we can compile this transformed program to assembly successfully.

Come to think of it, compiling operations when they are applied to immediate values is so easy, wouldn’t it be nice if we did the same thing for unary primitives and if? Then the compilation case for those constructs would only involve the actual operation, rather than the extra part about running a subexpression and putting its value in rax. Then the only expression form that would deal with sequentially executing two expressions would be the let form. This has the added benefit in that if we were to every change how we run two programs sequentially, then we would only have to change the let case.

1.5 Testing

Do Now!
Once you’ve completed the section below, run the given source programs through our compiler pipeline. It should give us exactly the handwritten assembly we intend. If not, debug the compiler until it does.

2 Sequential Form

Our goal is to transform our program such that every operator is applied only to immediate values (constants/variables), and every expression (besides let) does exactly one thing with no other internal computation necessary. We will call such a form Sequential Form1This is the name I have chosen to use in this class. The most common name for this intermediate representation is monadic normal form. There are many names for quite similar intermediate representations: SSA (static-single assignment) is the most common, used in the LLVM framework. Additionally, there are CPS (continuation-passing style) and ANF (A-normal form). See here for more on the comparison between this form and SSA.

There are at least two ways to implement this. Firstly, we could write a function sequentialize(&Exp) -> Exp that puts our expressions into a sequential form. This type makes sense because the sequential expressions form a subset of all expressions. However, this type signature is imprecise in that the output doesn’t reflect the fact that the output is sequential. This means when we write the next function compile_to_instrs(&Exp) -> Vec<Instr> we will still have to cover all expressions in our code, likely by using panic! when the input is not sequential. Instead we can eliminate this mismatch by developing a new type SeqExp that allows for expressing only those programs in sequential form. We also need to make a type ImmExp for describing the subset of immediate expressions.

enum ImmExp {
    Num(i64),
    Var(String),
}

enum SeqExp<Ann> {
    Imm(ImmExp, Ann),
    Prim1(Prim1, ImmExp, Ann),
    Prim2(Prim2, ImmExp, ImmExp, Ann),
    Let { var:       String,
          bound_exp: Box<SeqExp<Ann>>,
          body:      Box<SeqExp<Ann>>,
          ann:       Ann
    },
    If { cond: ImmExp,
         thn: Box<SeqExp<Ann>>,
         els: Box<SeqExp<Ann>>,
         ann: Ann
    },
}

Do Now!
Why did we choose to make cond an immediate, but not thn and els? Why?

So Prim1, Prim2 require that their arguments are immediates, while in the Let form we require only that the two sub-expressions are in sequential form themselves. For the If case the branches are allowed to be arbitrary sequential expressions, since we don’t want to evaluate them unless they are selected by the condition. The condition, on the other hand, is an immediate since it will always be evaluated.

While we already knew how to compile Prim1 and If with full sub-expressions, requiring the sub-expressions to be immediates simplifies the code-generation pass since all "sequencing" code goes into the Let case. Now when we add more constructs to the language, we can relegate all sequencing code to the Let case and not re-implement it for the new constructs.

Also note that while Exp allowed for multiple bindings, here we allow for only one binding at a time. This also simplifies the code generation since we only have to handle one let at a time, and once we have taken care of scope-checking, they should have equivalent semantics.

2.1 Sequentializing our Programs

Exercise
Try to systematically define a conversion function sequentialize(&Exp<u32>) -> SeqExp<()> such that the resulting expression has the same semantics.

Exercise
Why should the type of the function be (&Exp<u32>) -> SeqExp<()>? In particular, why do we discard the input tags?

The central idea is that to convert some expression e1 + e2 (or any other operator), we add new let-bindings for every sub-expression. So e1 + e2 becomes let x1 = se1 in let x2 = se2 in x1 + x2 where se1 is the result of putting e1 into sequential form, and similarly for se2. The trickiest part of implementing this is making sure that the variable names we use x1, x2 are different from all the names used by the source code, as well as different from other variables we generate. To make sure they are different from each other, we can use the unique tag we have annotated on the term in a previous pass. To ensure they are different from names from the source code, we can give them names that are not valid syntactically. For instance, our parser only accepts variable names that start with an ASCII alphabetic character, so if we start our generated variable names with a non-alphabetic character we won’t clash with source variable names.

fn sequentialize(e: &Exp<u32>) -> SeqExp<()> {
    match e {
    ...
        Exp::Prim2(op, e1, e2, tag) => {
            let s_e1 = sequentialize(e1);
            let s_e2 = sequentialize(e2);
            let name1 = format!("#prim2_1_{}", tag);
            let name2 = format!("#prim2_2_{}", tag);
            SeqExp::Let {
                var: name1.clone(), bound_exp: Box::new(s_e1), ann: (),
                body:
                Box::new(SeqExp::Let {
                    var: name2.clone(), bound_exp: Box::new(s_e2), ann: (),
                    body: Box::new(SeqExp::Prim2(*op, ImmExp::Var(name1), ImmExp::Var(name2), ())),
                })
            }
        },
   ...
   }
}

Note that we discard the tags and replace them with empty annotations () in the output program. This makes sense because a Prim2 gets translated to multiple expression forms (2 Let and a Prim2) so we cannot simply preserve the tag without violating our invariant that all sub-expressions have a unique tag.

The other cases are similar, with the Let case handling the mismatch between binding sequences in Exp and the "one-at-a-time" Let in SeqExp. The main thing to be careful of is to not get too greedy in sequentializing. When we sequentialize an If

if e1: e2 else: e3

We should make sure to simply sequentialize the branches and lift the condition

let x1 = se1 in if x1: se2 else: se3

rather than lifting all of them

let x1 = se1 in
let x2 = se2 in
let x3 = se3 in
if x1: x2 else: x3

Which would always run both branches.

2.2 Improving the translation

This sequentialization pass is somewhat sloppy: it will generate many unnecessary temporary variables.

Do Now!
Find a simple expression that need not generate any extra variables, but for which sequentialize generates at least one unneeded variable.

For instance x + y is already in sequential form, but this translation will still add new bindings let #prim2_1_0 = x in let #prim2_2_0 = y in #prim2_1_0 + #prim2_2_0. There are at least two ways to remedy this:

We could make the sequentialization code more complex by checking for this special case and not generating extra variables unless necessary
We could keep the sequentialization code the same and rely on later optimizations to eliminate these extra bindings.

We will discuss the relevant optimizations later in the semester. For now it is optional whether you want to make your sequentialization code eliminate these unnecessary bindings. If you do, I encourage you to find an elegant solution that uses a helper function rather than manually inspecting the sub-expressions to check if they are immediates in each case.

2.3 An alternate approach: Just use the stack!

One could make the argument that converting to ANF is a complicated waste of effort. We could simply walk the tree of EPrim2 expressions, evaluate their left arguments and push them onto the stack — after all, we have the next-available stack index as a parameter to our compiler, since we use it to compile let-bindings. Then we evaluate the right argument, and push it onto the stack. We then can retrieve both arguments from the stack (since we know where they were placed) and operate on them as normal — effectively, we’ve made them into immediate arguments, without going through the motions of creating all those let-bindings. Then we can implicitly pop the two values off the stack, basically, by forgetting they even exist, just as we do with let-bound variables that go out of scope. Surely this is simpler!

On the face of it, it is indeed simpler. But as we’ll see later, this will cause some additional headaches, because it entails that our stack frames are of dynamic size, growing and shrinking depending on the complexity of the expression being evaluated. This isn’t inherently a bad thing — in fact, it helps ensure that our stack is “compact”, without holes for values we haven’t defined or used yet — but it will require remembering that our stack frame size can change, independently of the let-bound variables in scope, which will make subsequent phases of the compiler more tightly coupled to this one.

Additionally, though it isn’t apparent so far, having code in sequential form actually enables some subsequent compiler passes, like optimizations, that would be incredibly difficult to pull off otherwise. The advantages of keeping the compiler-phases less tightly coupled, along with the later benefits of having code in a normalized form, tend to make Sequential form the winning engineering tradeoff.

Now, we can finally look at our current compiler pipeline:

fn compile(e: Exp<Span>) -> String
  /* make sure all names are in scope, and then */
  let tagged = tag_exp(&e);
  let se = sequentialize(&tagged);
  let tagged_se = tag_seq_exp(&se);
  let compiled = compile_to_instrs(tagged_se);
  /* ... surround compiled with prelude as needed ... */

Quite a lot of changes, just for adding arithmetic and conditionals!

1This is the name I have chosen to use in this class. The most common name for this intermediate representation is monadic normal form. There are many names for quite similar intermediate representations: SSA (static-single assignment) is the most common, used in the LLVM framework. Additionally, there are CPS (continuation-passing style) and ANF (A-normal form). See here for more on the comparison between this form and SSA

1	Growing the language: adding infix operators
2	Sequential Form