8.2

Assignment 1: Rust warmup, part 2: trees

Due: Thu 09/09 at 08:59pm

git clone

Almost every assignment we work on in this course will involve transforming sequential- or tree-shaped data from one form to another. In this assignment, you will be working with traversing a tree and producing results, and traversing a sequence and producing a tree.

1Evaluating arithmetic

Let’s define a data type for describing simple arithmetic expressions:

pub enum Arith {
Plus(Box<Arith>, Box<Arith>),
Times(Box<Arith>, Box<Arith>),
Variable(String),
Num(i32),
}

It should be mnemonically apparent what each case ought to represent. Note that there is no notion of a “parentheses expression”: the parenthesization is implicit in the tree structure. For example,

 "Math" syntax arith 3 * (4 + 5) Times(Num(3), Plus(Num(4), Num(5))) (3 * 4) + 5 Plus(Times(Num(3), Num(4)), Num(5))

This is a miniature language, and we reasonably want to do typical things with it: evaluate it and get an answer, or print it back out as text. Evaluation is straightforward, except for handling variables: we need some sort of environment to look up their values. We will use a fairly simple type for environments: &[(&str, i32)], i.e., a sequence of pairs of strings and associated values.

1. In the starter file arith.rs, we have given you the Arith data type, and a type for environments. Implement the five functions in that file:

/* Looks up the given variable name in the given environment */
fn get(env: &[(&str, i32)], x: &str) -> Option<i32>

/* Evaluates the given expression in the given environment */
fn evaluate(t: &Arith, vars: &[(&str, i32)]) -> Result<i32, String>

/* Neatly prints the given expression as a string */
fn pretty(t: &Arith) -> String

2. Be sure to thoroughly test your functions in the associated testing submodules. Think carefully about how to test them, and be sure to check for sneaky edge cases: get used to thinking about that now, while the languages are still small, so that it becomes second-nature when the languages get bigger!

The comments in the starter code include more precise instructions so read them carefully.

2Parsing Parenthesized expressions

In this part, you will be working on obtaining a tree of data from a string input: in other words, you will be parsing the data. Generally speaking, manually parsing an arbitrary language is tedious, so we will be working with a particularly simple language instead: S-expressions as used in the Lisp/Scheme family of languages.

An S-expression is either an atom or a parenthesized list of 0 or more S-expressions separated by spaces (EPSILON denotes the empty string). An atom is either an (unsigned) number, a boolean or an (ASCII) alphabetic string besides true/false

‹sexp›: | ‹atom› | ( ‹sexp_seq› ) ‹sexp_seq›: | EPSILON | ‹sexp_seq› ‹sexp› ‹atom›: | NUMBER | true | false | ALPHABETICSTRING

Here are some examples of S-expressions

5

iamanatom

true

()

((((()))) () ((a)))

(define (fac x)
(if (zero x)
1
(times x (fac (sub x 1)))))

This may look ugly for a programmer to use but for our purposes it has the advantage of being relatively easy to parse: essentially we just need to make sure the parentheses are matched.

The first step in parsing a file is lexing or tokenizing, that is, a function

tokenize(&str) -> Vec<Token>

You can think of this as breaking up the string into its component words and punctuation marks. The starter file you are given contains a definition of this function, and a definition of the token type, but those definitions are slightly more intricate than above. When you make a mistake while a program in almost any language, the compiler will give you an error message that contains the location of the error in the source. Accordingly, we need to define

struct Span {
start_line: u32,
end_line: u32,  // exclusive
start_col: u32,
end_col: u32    // exclusive
}

And our tokenizer will actually have the type The (Token, Span) is a tuple type, which is a particularly simple kind of struct:

tokenize(&str) -> Vec<(Token, Span)>

When we define tokens, we have a choice of how to represent the symbols, which can be arbitrary alphanumeric strings besides true and false. Should we use an owned, heap-allocated String or a slice of the input string &str? There is a tradeoff here: using &str will take up less memory in the tokens, but require that we keep the original input string alive while we are using the tokens. On the other hand, using String will require more memory to store the tokens, but the original string can be deleted. It seems sensible to say that since tokens are short-lived, we should represent them as &str since this only requires we keep the original string in memory during tokenization. For our AST however we will use String because (1) once we have fully parsed the input into an AST and performed some analyses we no longer need it the input in memory and (2) we may rename the variables in a later compiler pass.

This leads to the following definition of Tokens:

enum Token<'input> {
LParen,
RParen,
Sym(&'input str),
Int(u32),
Bool(bool)
}
Using &str in our Tokens requires us to explicitly annotate our token with the lifetime of those strings. I call this lifetime 'input to remind us that it is the lifetime of the original input program string.

Note that the above signature for tokenize works even though Token<'input> is parameterized by a lifetime. This is because there is only one lifetime in the input, and so Rust’s rules of lifetime elision insert the implicit lifetimes for us. An equivalent signature would be:

 tokenize<'input>(&'input str) -> Vec<Token<'input>> 

The lifetime of the input string slice is the same as the lifetime in Tok<'input> because symbol tokens contain are slices of the input. You can read more about Rust’s lifetime elision rules here.

The starter file also gives you a type definition of s-expressions. As mentioned before, we will use String to represent the variables. Additionally, while with tokens we kept the Span in a separate type, for S-expressions, we would like to know where in the source each of the sub-expressions was as well for later error messages. So we could include with each constructor a Span argument. Instead we will be slightly more general and allow an arbitrary type of "annotations" on each constructor. So Sexp<Ann> is an s-expression with annotations of type Ann:

enum Atom {
Sym(String),
Int(i32),
Bool(bool)
}
enum Sexp<Ann> {
Atom(Atom, Ann),
Nest(Vec<Sexp<T>>, Ann)
}

For parsing we will produce Sexp<Span> but later passes of the compiler will be able to fill this in with other useful datatypes.

1. Define a function

parse_tokens(&[(Token, Span)]) -> Result<Vec<Sexp<Span>>, String>
that parses a given sequence of tokens with position information into a Result: either it successfully parses a sequence of s-expressions from the token list, or else it should produce an error message explaining the error...and referencing the location at which the error occurred. For example:
parse_tokens(&tokenize("(a b)").unwrap()) ==>
Ok([Nest([Atom(Sym("a"), Span { start_line: 0, start_col: 1, end_line: 1, end_col: 2 }),
Atom(Sym("b"), Span { start_line: 0, start_col: 3, end_line: 1, end_col: 4 })],
Span { start_line: 0, start_col: 0, end_line: 1, end_col: 5 })])

parse_tokens(&tokenize("a b").unwrap())   ===>
Ok([SExp::Atom(Atom::Sym(String::from("a")), Span{ start_line: 0, start_col: 0, end_line: 1, end_col: 1}),
SExp::Atom(Atom::Sym(String::from("b")), Span{ start_line: 0, start_col: 2, end_line: 1, end_col: 3})])

parse_tokens(&tokenize("(a (b true) 3)").unwrap()) ==>
Ok([Nest([Atom(Sym("a"), Span { start_line: 0, start_col: 1, end_line: 1, end_col: 2 }),
Nest([Atom(Sym("b"), Span { start_line: 0, start_col: 4, end_line: 1, end_col: 5 }),
Atom(Bool(true), Span { start_line: 0, start_col: 6, end_line: 1, end_col: 10 })],
Span { start_line: 0, start_col: 3, end_line: 1, end_col: 11 }),
Atom(Int(3), Span { start_line: 0, start_col: 12, end_line: 1, end_col: 13 })],
Span { start_line: 0, start_col: 0, end_line: 1, end_col: 14 })])

parse_tokens(&tokenize("(a (b c").unwrap())  ==> Err("Left paren '(' at line 0, column 3 never matched")

parse_tokens(&tokenize("true) 3)").unwrap()) ==> Err("Unmatched right paren ')' on line 0, column 4")

Hint: the signature above is not general enough to be sufficient to parse s-expressions, and you’ll need one or more helper functions. First consider what a plausible “subtask” might be, to make progress on producing a Vec<Sexp<Span>>. Next, consider the data definition for Sexp carefully, and look very carefully at the second example above: at the point just after the token true and before the following RPAREN, how might you describe your intermediate state? What have you built up so far, what are you in the process of building, and what is left over?

Hint: you can and should solve this problem by looking at only one token at a time. Trying to “look ahead” in the list is pretty much guaranteed not to work, so you need to use some other technique to match up left and right parentheses. One mnemonic I was shown a long time ago: to check if an s-expression is balanced, mentally start a running count at zero, scan across the expression from left to right, increment the count at each left parenthesis, and decrement it at each right parenthesis. If the count is zero at the end, and never became negative, then you’ve got a correctly-matched s-expression. How might you carry the idea of this across into your code? (You don’t necessarily need to maintain an actual int counter; you’re implicitly maintaining the call stack...)

For this assignment, you will be graded on

• whether your code implements the specification (functional correctness),

• whether you thoroughly test every method that you write, and

4Submission

4.1Deliverables

Your submission should include all the provided files; you should not need to create any new ones.

To submit on gradescope, zip the files

\$ zip submission.zip Cargo.toml src/lib.rs src/tokenize.rs src/sexp.rs src/arith.rs

and submit that zip.