Lecture 5: Binary Operations and Sequential Form
Today we will extend the compiler to support binary arithmetic operations and not just increment and decrement. This is surprisingly difficult as it introduces the ambiguity of evaluation order into the language, and so we will need to add a new pass to the compiler that makes the evaluation order obvious in the structure of the term.
1 Growing the language: adding infix operators
Again, we follow our standard recipe:
Its impact on the concrete syntax of the language
Examples using the new enhancements, so we build intuition of them
Its impact on the abstract syntax and semantics of the language
Any new or changed transformations needed to process the new forms
Executable tests to confirm the enhancement works as intended
1.1 The new concrete syntax
‹expr› ... ‹expr› + ‹expr› ‹expr› - ‹expr› ‹expr› * ‹expr› ( ‹expr› )
1.2 Examples and semantics
These new expression forms should be familiar from standard arithmetic notation.
The parser will take care of operator precedence. I.e., the term
(2 - 3) + 4 * 5
will parse the same way as
(2 - 3) + (4 * 5)
according to the PEMDAS rules.
However, consider what would happen if we added new forms
print0(expr)
and print1(expr)
function to our language,
that print 0
or 1
respectively before evaluating
expr
1We will see how to implement printing soon. Then how should the
following program evaluate?
print0(6) * print1(7)
First, no matter what the expression should evaluate to 6 * 7 =
42
, so the question is what should it print?
There are several reasonable choices:
Prints "01", this is left-to-right evaluation order
Prints "10", this is right-to-left evaluation order
May print either "01" or "10", this means the evaluation order is unspecified, or implementation dependent
Which do you prefer? Either of the first two seem very reasonable, with left-to-right seeming more reasonable to match the way we write English. The third option is something probably only a compiler writer would choose, because it means it is easier to optimize the program because you can arbitrarily re-order things!
We’ll go with the first choice: left-to-right evaluation order.
Note that doing things left-to-right like this is not quite the same as the PEMDAS rules. For instance the following arithmetic expression evaluates:
(2 - 3) + 4 * 5
==> -1 + (4 * 5)
==> -1 + 20
==> 19
rather than the possible alternative of doing the multiplication first. The alternative of following PEMDAS to do the evaluation order would be very confusing:
(print0(2) - 3) + print1(4) * 5
1.3 Enhancing the abstract syntax
enum Prim2 {
Add,
Sub,
Mul,
}
enum Exp<Ann> {
...
Prim2(Prim2, Box<Exp<Ann>>, Box<Exp<Ann>>, Ann),
}
We simply add a new constructor describing our primitive binary operations, and an enumeration of what those operations might be. The parser will do the hard work of figuring out the correct tree structure for un-parenthesized expressions like "1 - 2 + x * y".
1.4 Enhancing the transformations: a new intermediate representation (IR)
Exercise
What goes wrong with our current naive transformations? How can we fix them?
Let’s try manually “compiling” some simple binary-operator expressions to assembly:
Original expression |
| Compiled assembly |
|
|
|
|
|
|
|
|
|
|
|
|
Do Now!
Convince yourself that using a let-bound variable in place of any of these constants will work just as well.
So far, our compiler has only ever had to deal with a single active expression
at a time: it moves the result into RAX
, increments or decrements it, and
then potentially moves it somewhere onto the stack, for retrieval and later
use. But with our new compound expression forms, that won’t suffice: the
execution of (2 - 3) + (4 * 5)
above clearly must stash the result of
(2 - 3)
somewhere, to make room in RAX
for the subsequent
multiplication. We might try to use another register (RBX
, maybe?), but
clearly this approach won’t scale up, since there are only a handful of
registers available. What to do?
1.4.1 Immediate expressions
Do Now!
Why did the first few expressions compile successfully?
Notice that for the first few expressions, all the arguments to the operators were immediately ready:
They required no further computation to be ready.
They were either constants, or variables that could be read off the stack.
Perhaps we can salvage the final program by transforming it somehow, such that all its operations are on immediate values, too.
Do Now!
Try to do this: Find a program that computes the same answer, in the same order of operations, but where every operator is applied only to immediate values.
Note that conceptually, our last program is equivalent to the following:
let first = 2 - 3 in
let second = 4 * 5 in
first + second
This program has decomposed the compound addition expression into the sum of two let-bound variables, each of which is a single operation on immediate values. We can easily compile each individual operation, and we already know how to save results to the stack and restore them for later use, which means we can compile this transformed program to assembly successfully.
Come to think of it, compiling operations when they are applied to
immediate values is so easy, wouldn’t it be nice if we did the same
thing for unary primitives and if? This way every intermediate result
gets a name, which will then be assigned a place on the stack (or
later on when we get to register allocation, a register) instead of
every intermediate result necessarily going through rax
.
1.5 Testing
Do Now!
Once you’ve completed the section below, run the given source programs through our compiler pipeline. It should give us exactly the handwritten assembly we intend. If not, debug the compiler until it does.
2 Sequential Form
Our goal is to transform our program such that every operator is
applied only to immediate values (constants/variables), and every
expression (besides let
) does exactly one thing with no other
internal computation necessary. We will call such a form
Sequential Form2This is the name I have chosen to use in
this class. The most common name for this intermediate representation
is monadic normal form. There are many names for quite similar
intermediate representations: SSA (static-single assignment) is the
most common, used in the LLVM framework. Additionally, there are CPS
(continuation-passing style) and ANF (A-normal form). See
here
for more on the comparison between this form and SSA.
There are at least two ways to implement this. Firstly, we could write
a function sequentialize(&Exp) -> Exp
that puts our expressions
into a sequential form. This type makes sense because the sequential
expressions form a subset of all expressions. However, this type
signature is imprecise in that the output doesn’t reflect the
fact that the output is sequential. This means when we write the next
function compile_to_instrs(&Exp) -> Vec<Instr>
we will still
have to cover all expressions in our code, likely by using
panic!
when the input is not sequential. Instead we can
eliminate this mismatch by developing a new type SeqExp
that
allows for expressing only those programs in sequential form. We also
need to make a type ImmExp
for describing the subset of
immediate expressions.
enum ImmExp {
Num(i64),
Var(String),
}
enum SeqExp<Ann> {
Imm(ImmExp, Ann),
Prim1(Prim1, ImmExp, Ann),
Prim2(Prim2, ImmExp, ImmExp, Ann),
Let { var: String,
bound_exp: Box<SeqExp<Ann>>,
body: Box<SeqExp<Ann>>,
ann: Ann
},
If { cond: ImmExp,
thn: Box<SeqExp<Ann>>,
els: Box<SeqExp<Ann>>,
ann: Ann
},
}
Do Now!
Why did we choose to make
cond
an immediate, but notthn
andels
? Why?
So Prim1
, Prim2
require that their arguments are
immediates, while in the Let
form we require only that the two
sub-expressions are in sequential form themselves. For the If
case the branches are allowed to be arbitrary sequential expressions,
since we don’t want to evaluate them unless they are selected by the
condition. The condition, on the other hand, is an immediate since it
will always be evaluated.
While we already knew how to compile Prim1
and If
with
full sub-expressions, requiring the sub-expressions to be immediates
simplifies the code-generation pass since all "sequencing" code goes
into the Let
case. Now when we add more constructs to the
language, we can relegate all sequencing code to the Let
case
and not re-implement it for the new constructs.
Also note that while Exp
allowed for multiple bindings, here we
allow for only one binding at a time. This also simplifies the code
generation since we only have to handle one let at a time, and once we
have taken care of scope-checking, they should have equivalent
semantics.
2.1 Sequentializing our Programs
Exercise
Try to systematically define a conversion function
sequentialize(&Exp<u32>) -> SeqExp<()>
such that the resulting expression has the same semantics.
Exercise
Why should the type of the function be
(&Exp<u32>) -> SeqExp<()>
? In particular, why do we discard the input tags?
The central idea is that to convert some expression e1 + e2
(or
any other operator), we add new let-bindings for every
sub-expression. So e1 + e2
becomes let x1 = se1 in let x2 =
se2 in x1 + x2
where se1
is the result of putting e1
into
sequential form, and similarly for se2
. The trickiest part of
implementing this is making sure that the variable names we use
x1, x2
are different from all the names used by the source code,
as well as different from other variables we generate. To make sure
they are different from each other, we can use the unique tag we have
annotated on the term in a previous pass. To ensure they are different
from names from the source code, we can give them names that are not
valid syntactically. For instance, our parser only accepts variable
names that start with an ASCII alphabetic character, so if we start
our generated variable names with a non-alphabetic character we won’t
clash with source variable names.
fn sequentialize(e: &Exp<u32>) -> SeqExp<()> {
match e {
...
Exp::Prim2(op, e1, e2, tag) => {
let s_e1 = sequentialize(e1);
let s_e2 = sequentialize(e2);
let name1 = format!("#prim2_1_{}", tag);
let name2 = format!("#prim2_2_{}", tag);
SeqExp::Let {
var: name1.clone(), bound_exp: Box::new(s_e1), ann: (),
body:
Box::new(SeqExp::Let {
var: name2.clone(), bound_exp: Box::new(s_e2), ann: (),
body: Box::new(SeqExp::Prim2(*op, ImmExp::Var(name1), ImmExp::Var(name2), ())),
})
}
},
...
}
}
Note that we discard the tags and replace them with empty annotations
()
in the output program. This makes sense because a
Prim2
gets translated to multiple expression forms (2
Let
and a Prim2
) so we cannot simply preserve the tag
without violating our invariant that all sub-expressions have a unique
tag.
The other cases are similar, with the Let
case handling the
mismatch between binding sequences in Exp
and the
"one-at-a-time" Let
in SeqExp
. The main thing to be
careful of is to not get too greedy in sequentializing. When we
sequentialize an If
if e1: e2 else: e3
We should make sure to simply sequentialize the branches and lift the condition
let x1 = se1 in if x1: se2 else: se3
rather than lifting all of them
let x1 = se1 in
let x2 = se2 in
let x3 = se3 in
if x1: x2 else: x3
Which would always run both branches.
2.2 Improving the translation
This sequentialization pass is somewhat sloppy: it will generate many unnecessary temporary variables.
Do Now!
Find a simple expression that need not generate any extra variables, but for which
sequentialize
generates at least one unneeded variable.
For instance x + y
is already in sequential form, but this
translation will still add new bindings let #prim2_1_0 = x in let
#prim2_2_0 = y in #prim2_1_0 + #prim2_2_0
. There are at least two
ways to remedy this:
We could make the sequentialization code more complex by checking for this special case and not generating extra variables unless necessary
We could keep the sequentialization code the same and rely on later optimizations to eliminate these extra bindings.
We will discuss the relevant optimizations later in the semester. For now it is optional whether you want to make your sequentialization code eliminate these unnecessary bindings. If you do, I encourage you to find an elegant solution that uses a helper function rather than manually inspecting the sub-expressions to check if they are immediates in each case.
Now, we can finally look at our current compiler pipeline:
fn compile(e: Exp<Span>) -> String
/* make sure all names are in scope, and then */
let tagged = tag_exp(&e);
let se = sequentialize(&tagged);
let tagged_se = tag_seq_exp(&se);
let compiled = compile_to_instrs(tagged_se);
/* ... surround compiled with prelude as needed ... */
Quite a lot of changes, just for adding arithmetic and conditionals!
1We will see how to implement printing soon
2This is the name I have chosen to use in this class. The most common name for this intermediate representation is monadic normal form. There are many names for quite similar intermediate representations: SSA (static-single assignment) is the most common, used in the LLVM framework. Additionally, there are CPS (continuation-passing style) and ANF (A-normal form). See here for more on the comparison between this form and SSA