Lecture 11: Mutating Tuples and Desugaring

8.10

Lecture 11: Mutating Tuples and Desugaring

1 Computing with mutable data

Standard reminder: Every time we enhance our source language, we need to consider several things:

Its impact on the concrete syntax of the language
Examples using the new enhancements, so we build intuition of them
Its impact on the abstract syntax and semantics of the language
Any new or changed transformations needed to process the new forms
Executable tests to confirm the enhancement works as intended

2 Mutating pairs in our language

2.1 Syntax and examples

We’ll write pairs as

‹expr›: ... | [ ‹expr› , ‹expr› ] | fst ‹expr› | snd ‹expr› | set-fst ‹expr› ‹expr› | set-snd ‹expr› ‹expr›

As before, the first expression creates a pair, and the next two access the first or second elements of that pair. The two new expression forms allow us to modify the first or second item of the pair given by the first subexpression, and set it to the value of the second subexpression. We need to decide on what these expressions will return; let’s choose to make them return the tuple itself.

let x = [4, 5] in set-fst x 10             # ==> [10, 5]
let y = [3, 2] in set-fst (set-snd y 8) 6  # ==> [6, 8]

Do Now!
Flesh out the semantics of these two new expressions. What new error conditions might there be?

We can add these expression forms to our AST easily enough. This time, the new expressions will be binary primitives:

enum Exp<Ann> { ...
  Pair(Box<Exp<Ann>>, Box<Exp<Ann>>, Ann)
}
enum Prim1 = ... | Fst | Snd
enum Prim2 = ... | Setfst | Setsnd

2.2 Compilation

Given our representation of pairs (from Adding pairs to our language), it’s relatively straightforward to compile these two new primitives. We’ll focus on set-fst; the other is nearly identical. We first compile both subexpressions, and load them into two registers RAX and RDX. We ensure that the first argument is a pair by checking its tag, and if so then we untag the value by subtracting the tag. Now the content of the pair begins at [RAX], and the first item of the pair is located at offset zero from RAX: we simply move the second value into that location. However, our intended semantics is to return the pair value, which means we need to restore RAX to point to the pair: we must tag the value again.

mov RAX, <the pair>
mov RDX, <the new first value>
mov RCX, RAX           ;; \
and RCX, 0x7           ;; | check to ensure
cmp RCX, 0x1           ;; | RAX is a pair
jne not_a_pair         ;; /
sub RAX, 0x1           ;; untag the value into a raw pointer
mov [RAX + 8 * 0], RDX ;; perform the mutation
add RAX, 0x1           ;; tag the pointer back into a value

To compile set-snd, we do the same thing except use an offset address of [RAX + 8 * 1].

Supporting arbitrary length arrays is a straightforward extension.

3 Ergonomics in Programming with Arrays

There are a few more additions we would like to make to our language to make

3.1 Semicolon

Now that we have mutation, we need to reconsider the ergonomics of our language. It’s rare that assigning to a field of a tuple should be the only thing we want to compute: we likely want to mutate a field and keep going with our computation. These mutations therefore fit better into our language as statements to be executed for their side-effects, rather than as expressions to be evaluated for their answer. To achieve this, we might want to express the sequencing of multiple expressions, such that our program evaluates them all in order, but only returns the final result. We can add such concrete syntax easily:

‹expr›: ... | ‹expr› ; ‹expr›

Do Now!
How might we implement support for this feature? Which phases of the compiler necessarily change, and which could we avoid changing?

We’ll start by adding an ESeq constructor to our expressions. We have two options for how to proceed from here. We could all of our compiler passes down to code generation to compile this new semicolon form, but it seems like wasted effort, since intuitively,

e1 ; e2

should mean the same thing as

let DONT_CARE = e1 in e2

Rather than create an explicit expression form, perhaps we could reuse the existing Let form and use our intuitive definition as the actual definition.

We could implement this by adding a "desugar" pass in our compiler somewhere that removes the "syntax sugar" from the language, in this case creating a (guaranteed unique) name in place of DONT_CARE above.

Exercise
Where should this pass go?

3.2 Programming with Structured, Recursive Data

Now that we have arrays in our language we can work with interesting structure data. As a simple example, our language is rich enough to write an interpreter. Let’s implement an interpreter for our adder language (minus the let bindings). In Rust we represented an AST as an enum:

enum AdderExp {
  Num(i64),
  Prim1(Prim1, Box<AdderExp>)
}

enum Prim1 {
  Add1,
  Sub1,
}

We now can encode these ASTs fairly easily as values in our Snake language. First, we can encode Prim1 as certain numbers or booleans. Let’s say true represents add1 and false represents sub1. How can we can encode AdderExp trees? We need to first distinguish which case we are in, and then also encode the arguments of that case. We can implement these using arrays. An AdderExp will be represented as an pair, with the first element of the array indicating which branch we are in, say true indicates Num and false indicates Prim1. Then the second component contains another array with the contents of the constructor, so Num(3) ends up being represented as [true, [3]] and Prim1(Sub1, Num(5)) will be represented as [false, [false, [true, [5]]]].

With support for recursive functions, we can even write an interpreter for this language, just as we did in an early homework assignmnet.

def interp(exp):
  let tag = exp[0]
  let components = exp[1] in
  if tag:
    components[0]
  else:
    let op = components[0], sub_exp = components[1] in
    let val = interp(sub_exp) in
    if op:
      val + 1
    else:
      val - 1
end

This is quite a bit uglier to read than the interpreter we wrote in Rust, since we have to work directly with the encoding of enums rather than using pattern-matching. Let’s add a simple form of pattern matching to our language: destructuring arrays in let-bindings. We could write the above interpreter as

def interp(exp):
  let [tag, components] = exp in
  if tag:
    let [num] = components in num
  else:
    let [op, sub_exp] = components in
    let val = interp(sub_exp) in
    if op:
      val + 1
    else:
      val - 1
end

We could add this to our AST by changing Let to have either a variable or a complex array-destructuring expression:

enum Exp<Ann> {
    ...
    Let {
        bindings: Vec<(BindExp<Ann>, Exp<Ann>)>,
        body: Box<Exp<Ann>>,
        ann: Ann,
    },
}

enum BindExp<Ann> {
    Var(String, Ann),
    Arr(Vec<BindExp<Ann>>, Ann),
}

Note that this design naturally allows for nested bindings, so we can have complex destructurings like let [[a, b], c] = [[0, 1], false].

Similar to the semi-colon sequencing operation, we are left with a question of whether to implement this by carrying it all the way through to code generation or desugaring it to other forms. Desugaring sounds easier in principle, but we need to be carefuli in our design. Let’s consider an isolated example.

let [x, y, z] = e in
x + y * z

We could first try desugaring this to

let x = e[0], y = e[1], z = e[2] in
x + y * z

Exercise
There is a problem with this desugaring! What is it?

We need to think about what the intended semantics of a destructuring let is. If the expression e were print(0) then this desugaring would print three times! We could get similar issues if e manipulated the heap.

To stay consistent with our prior semantics of let bindings, let’s say that e should only be evaluated once. Then we could desugar it as:

let tmp = e,
    x = tmp[0],
    y = tmp[1],
    z = tmp[2] in
x + y * z

Exercise
There is a problem with this desugaring! What is it?

We have another subtle semantic question. In this desugaring, we simply project out the first three elements of e, so the destructuring would run as long as e has at least three elements. This would be confusing as the syntax let [x, y, z] = e suggests that the x,y,z are the entirety of the array. Instead the natural semantics would be for there to be a runtime error if the length of the array doesn’t match the number of variables we bind. Then, if we want to desugar this form, we need some kind of assertion about the size of the array:

let tmp = e,
    DONT_CARE = assert_has_size(tmp, 3),
    x = tmp[0],
    y = tmp[1],
    z = tmp[2] in
x + y * z

We don’t need to add this assert_has_size to our programmer-facing syntax, instead we could add this solely as an internal compiler form that checks if an expression is an array of given size. This gives us a best of both worlds approach: most of the semantics is taken care of by our desugaring, but for the new part (checking the size) we add a new form for just the small part of code generation that we need to add to support our intended semantics.

Exercise
Now that we have recursive, mutable data, write a few examples of programs that do something interesting, like sorting a list in place, or building a binary tree, etc. Rejoice in your newfound mastery over simple programs! What features should we add to our language next?

1	Computing with mutable data
2	Mutating pairs in our language
3	Ergonomics in Programming with Arrays