Lecture 11: Mutating Tuples and Desugaring
1 Computing with mutable data
Standard reminder: Every time we enhance our source language, we need to consider several things:
Its impact on the concrete syntax of the language
Examples using the new enhancements, so we build intuition of them
Its impact on the abstract syntax and semantics of the language
Any new or changed transformations needed to process the new forms
Executable tests to confirm the enhancement works as intended
2 Mutating pairs in our language
2.1 Syntax and examples
We’ll write pairs as
‹expr› ... [ ‹expr› , ‹expr› ] fst ‹expr›snd ‹expr› set-fst ‹expr› ‹expr›set-snd ‹expr› ‹expr›
As before, the first expression creates a pair, and the next two access the first or second elements of that pair. The two new expression forms allow us to modify the first or second item of the pair given by the first subexpression, and set it to the value of the second subexpression. We need to decide on what these expressions will return; let’s choose to make them return the tuple itself.
let x = [4, 5] in set-fst x 10 # ==> [10, 5]
let y = [3, 2] in set-fst (set-snd y 8) 6 # ==> [6, 8]
Do Now!
Flesh out the semantics of these two new expressions. What new error conditions might there be?
We can add these expression forms to our AST easily enough. This time, the new expressions will be binary primitives:
enum Exp<Ann> { ...
Pair(Box<Exp<Ann>>, Box<Exp<Ann>>, Ann)
}
enum Prim1 = ... | Fst | Snd
enum Prim2 = ... | Setfst | Setsnd
2.2 Compilation
Given our representation of pairs (from Adding pairs to our language), it’s
relatively straightforward to compile these two new primitives. We’ll focus on
set-fst
; the other is nearly identical. We first compile both
subexpressions, and load them into two registers RAX
and RDX
. We
ensure that the first argument is a pair by checking its tag, and if so then we
untag the value by subtracting the tag. Now the content of the pair begins at
[RAX]
, and the first item of the pair is located at offset zero from
RAX
: we simply move the second value into that location. However, our
intended semantics is to return the pair value, which means we need to restore
RAX
to point to the pair: we must tag the value again.
mov RAX, <the pair>
mov RDX, <the new first value>
mov RCX, RAX ;; \
and RCX, 0x7 ;; | check to ensure
cmp RCX, 0x1 ;; | RAX is a pair
jne not_a_pair ;; /
sub RAX, 0x1 ;; untag the value into a raw pointer
mov [RAX + 8 * 0], RDX ;; perform the mutation
add RAX, 0x1 ;; tag the pointer back into a value
To compile set-snd
, we do the same thing except use an offset
address of [RAX + 8 * 1]
.
Supporting arbitrary length arrays is a straightforward extension.
3 Ergonomics in Programming with Arrays
There are a few more additions we would like to make to our language to make
3.1 Semicolon
Now that we have mutation, we need to reconsider the ergonomics of our language. It’s rare that assigning to a field of a tuple should be the only thing we want to compute: we likely want to mutate a field and keep going with our computation. These mutations therefore fit better into our language as statements to be executed for their side-effects, rather than as expressions to be evaluated for their answer. To achieve this, we might want to express the sequencing of multiple expressions, such that our program evaluates them all in order, but only returns the final result. We can add such concrete syntax easily:
Do Now!
How might we implement support for this feature? Which phases of the compiler necessarily change, and which could we avoid changing?
We’ll start by adding an ESeq
constructor to our expressions. We
have two options for how to proceed from here. We could all of our
compiler passes down to code generation to compile this new semicolon
form, but it seems like wasted effort, since intuitively,
e1 ; e2
should mean the same thing as
let DONT_CARE = e1 in e2
Rather than create an explicit expression form, perhaps we could reuse the
existing Let
form and use our intuitive definition as the actual
definition.
We could implement this by adding a "desugar"
pass in our
compiler somewhere that removes the "syntax sugar" from the language,
in this case creating a (guaranteed unique) name in place of
DONT_CARE
above.
Exercise
Where should this pass go?
3.2 Programming with Structured, Recursive Data
Now that we have arrays in our language we can work with interesting structure data. As a simple example, our language is rich enough to write an interpreter. Let’s implement an interpreter for our adder language (minus the let bindings). In Rust we represented an AST as an enum:
enum AdderExp {
Num(i64),
Prim1(Prim1, Box<AdderExp>)
}
enum Prim1 {
Add1,
Sub1,
}
We now can encode these ASTs fairly easily as values in our Snake
language. First, we can encode Prim1 as certain numbers or
booleans. Let’s say true
represents add1
and
false
represents sub1
. How can we can encode
AdderExp
trees? We need to first distinguish which case we are
in, and then also encode the arguments of that case. We can implement
these using arrays. An AdderExp
will be represented as an pair,
with the first element of the array indicating which branch we are in,
say true
indicates Num
and false
indicates Prim1
. Then the second component contains another
array with the contents of the constructor, so Num(3)
ends up
being represented as [true, [3]]
and
Prim1(Sub1, Num(5))
will be represented as
[false, [false, [true, [5]]]]
.
With support for recursive functions, we can even write an interpreter for this language, just as we did in an early homework assignmnet.
def interp(exp):
let tag = exp[0]
let components = exp[1] in
if tag:
components[0]
else:
let op = components[0], sub_exp = components[1] in
let val = interp(sub_exp) in
if op:
val + 1
else:
val - 1
end
This is quite a bit uglier to read than the interpreter we wrote in
Rust, since we have to work directly with the encoding of enum
s
rather than using pattern-matching. Let’s add a simple form of pattern
matching to our language: destructuring arrays in let-bindings. We
could write the above interpreter as
def interp(exp):
let [tag, components] = exp in
if tag:
let [num] = components in num
else:
let [op, sub_exp] = components in
let val = interp(sub_exp) in
if op:
val + 1
else:
val - 1
end
We could add this to our AST by changing Let
to have either a
variable or a complex array-destructuring expression:
enum Exp<Ann> {
...
Let {
bindings: Vec<(BindExp<Ann>, Exp<Ann>)>,
body: Box<Exp<Ann>>,
ann: Ann,
},
}
enum BindExp<Ann> {
Var(String, Ann),
Arr(Vec<BindExp<Ann>>, Ann),
}
Note that this design naturally allows for nested bindings, so we can have
complex destructurings like let [[a, b], c] = [[0, 1], false]
.
Similar to the semi-colon sequencing operation, we are left with a question of whether to implement this by carrying it all the way through to code generation or desugaring it to other forms. Desugaring sounds easier in principle, but we need to be carefuli in our design. Let’s consider an isolated example.
let [x, y, z] = e in
x + y * z
We could first try desugaring this to
let x = e[0], y = e[1], z = e[2] in
x + y * z
Exercise
There is a problem with this desugaring! What is it?
We need to think about what the intended semantics of a destructuring
let is. If the expression e
were print(0)
then
this desugaring would print three times! We could get similar issues
if e
manipulated the heap.
To stay consistent with our prior semantics of let bindings, let’s say
that e
should only be evaluated once. Then we could
desugar it as:
let tmp = e,
x = tmp[0],
y = tmp[1],
z = tmp[2] in
x + y * z
Exercise
There is a problem with this desugaring! What is it?
We have another subtle semantic question. In this desugaring, we
simply project out the first three elements of e
, so the
destructuring would run as long as e
has at least three
elements. This would be confusing as the syntax
let [x, y, z] = e
suggests that the
x,y,z
are the entirety of the array. Instead the natural semantics
would be for there to be a runtime error if the length of the array doesn’t match the
number of variables we bind.
Then, if we want to desugar this form, we need some kind of assertion
about the size of the array:
let tmp = e,
DONT_CARE = assert_has_size(tmp, 3),
x = tmp[0],
y = tmp[1],
z = tmp[2] in
x + y * z
We don’t need to add this assert_has_size
to our
programmer-facing syntax, instead we could add this solely as an
internal compiler form that checks if an expression is an array of
given size. This gives us a best of both worlds approach: most of the
semantics is taken care of by our desugaring, but for the new part
(checking the size) we add a new form for just the small part of code
generation that we need to add to support our intended semantics.
Exercise
Now that we have recursive, mutable data, write a few examples of programs that do something interesting, like sorting a list in place, or building a binary tree, etc. Rejoice in your newfound mastery over simple programs! What features should we add to our language next?