Assignment 1: Rust warmup, part 2: trees
Due: Thu 09/09 at 08:59pm
git clone
Almost every assignment we work on in this course will involve transforming sequential- or tree-shaped data from one form to another. In this assignment, you will be working with traversing a tree and producing results, and traversing a sequence and producing a tree.
1 Evaluating arithmetic
Let’s define a data type for describing simple arithmetic expressions:
pub enum Arith {
Plus(Box<Arith>, Box<Arith>),
Times(Box<Arith>, Box<Arith>),
Variable(String),
Num(i32),
}
It should be mnemonically apparent what each case ought to represent. Note that there is no notion of a “parentheses expression”: the parenthesization is implicit in the tree structure. For example,
"Math" syntax |
|
|
|
|
|
|
|
|
This is a miniature language, and we reasonably want to do typical
things with it: evaluate it and get an answer, or print it back out as
text. Evaluation is straightforward, except for handling variables:
we need some sort of environment to look up their values. We
will use a fairly simple type for environments: &[(&str, i32)]
,
i.e., a sequence of pairs of strings and associated values.
In the starter file
arith.rs
, we have given you theArith
data type, and a type for environments. Implement the five functions in that file:/* Looks up the given variable name in the given environment */ fn get(env: &[(&str, i32)], x: &str) -> Option<i32> /* Evaluates the given expression in the given environment */ fn evaluate(t: &Arith, vars: &[(&str, i32)]) -> Result<i32, String> /* Neatly prints the given expression as a string */ fn pretty(t: &Arith) -> String
Be sure to thoroughly test your functions in the associated testing submodules. Think carefully about how to test them, and be sure to check for sneaky edge cases: get used to thinking about that now, while the languages are still small, so that it becomes second-nature when the languages get bigger!
The comments in the starter code include more precise instructions so read them carefully.
2 Parsing Parenthesized expressions
In this part, you will be working on obtaining a tree of data from a string input: in other words, you will be parsing the data. Generally speaking, manually parsing an arbitrary language is tedious, so we will be working with a particularly simple language instead: S-expressions as used in the Lisp/Scheme family of languages.
An S-expression is either an atom or a parenthesized list of 0 or more S-expressions separated by spaces (EPSILON denotes the empty string). An atom is either an (unsigned) number, a boolean or an (ASCII) alphabetic string besides true/false
‹sexp› ‹atom› ( ‹sexp_seq› ) ‹sexp_seq› EPSILON ‹sexp_seq› ‹sexp› ‹atom› NUMBER true false ALPHABETICSTRING
Here are some examples of S-expressions
5
iamanatom
true
()
((((()))) () ((a)))
(define (fac x)
(if (zero x)
1
(times x (fac (sub x 1)))))
This may look ugly for a programmer to use but for our purposes it has the advantage of being relatively easy to parse: essentially we just need to make sure the parentheses are matched.
The first step in parsing a file is lexing or tokenizing, that is, a function
tokenize(&str) -> Vec<Token>
You can think of this as breaking up the string into its component words and punctuation marks. The starter file you are given contains a definition of this function, and a definition of the token type, but those definitions are slightly more intricate than above. When you make a mistake while a program in almost any language, the compiler will give you an error message that contains the location of the error in the source. Accordingly, we need to define
struct Span {
start_line: u32,
end_line: u32, // exclusive
start_col: u32,
end_col: u32 // exclusive
}
And our tokenizer will actually have the type
The (Token, Span)
is a tuple type, which is a particularly simple kind of struct:
tokenize(&str) -> Vec<(Token, Span)>
When we define tokens, we have a choice of how to represent the
symbols, which can be arbitrary alphanumeric strings besides true and
false. Should we use an owned, heap-allocated String
or a slice
of the input string &str
? There is a tradeoff here: using
&str
will take up less memory in the tokens, but require that
we keep the original input string alive while we are using the
tokens. On the other hand, using String
will require more
memory to store the tokens, but the original string can be deleted. It
seems sensible to say that since tokens are short-lived, we should
represent them as &str
since this only requires we keep the
original string in memory during tokenization. For our AST however we
will use String
because (1) once we have fully parsed the input
into an AST and performed some analyses we no longer need it the input
in memory and (2) we may rename the variables in a later compiler
pass.
This leads to the following definition of Tokens:
enum Token<'input> {
LParen,
RParen,
Sym(&'input str),
Int(u32),
Bool(bool)
}
&str
in our Tokens requires us to explicitly annotate our
token with the lifetime of those strings. I call this lifetime
'input
to remind us that it is the lifetime of the original
input program string.Note that the above signature for tokenize
works even though
Token<'input>
is parameterized by a lifetime. This is because
there is only one lifetime in the input, and so Rust’s rules of
lifetime elision insert the implicit lifetimes for us. An
equivalent signature would be:
tokenize<'input>(&'input str) -> Vec<Token<'input>>
The lifetime of the input string slice is
the same as the lifetime in Tok<'input>
because symbol tokens
contain are slices of the input. You can read more about Rust’s
lifetime elision rules
here.
The starter file also gives you a type definition of s-expressions. As
mentioned before, we will use String
to represent the
variables. Additionally, while with tokens we kept the Span
in
a separate type, for S-expressions, we would like to know where in the
source each of the sub-expressions was as well for later error
messages. So we could include with each constructor a Span
argument. Instead we will be slightly more general and allow an
arbitrary type of "annotations" on each constructor. So
Sexp<Ann>
is an s-expression with annotations of type
Ann
:
enum Atom {
Sym(String),
Int(i32),
Bool(bool)
}
enum Sexp<Ann> {
Atom(Atom, Ann),
Nest(Vec<Sexp<T>>, Ann)
}
For parsing we will produce Sexp<Span>
but later passes of the
compiler will be able to fill this in with other useful datatypes.
Define a function
that parses a given sequence of tokens with position information into aparse_tokens(&[(Token, Span)]) -> Result<Vec<Sexp<Span>>, String>
Result
: either it successfully parses a sequence of s-expressions from the token list, or else it should produce an error message explaining the error...and referencing the location at which the error occurred. For example:parse_tokens(&tokenize("(a b)").unwrap()) ==> Ok([Nest([Atom(Sym("a"), Span { start_line: 0, start_col: 1, end_line: 1, end_col: 2 }), Atom(Sym("b"), Span { start_line: 0, start_col: 3, end_line: 1, end_col: 4 })], Span { start_line: 0, start_col: 0, end_line: 1, end_col: 5 })]) parse_tokens(&tokenize("a b").unwrap()) ===> Ok([SExp::Atom(Atom::Sym(String::from("a")), Span{ start_line: 0, start_col: 0, end_line: 1, end_col: 1}), SExp::Atom(Atom::Sym(String::from("b")), Span{ start_line: 0, start_col: 2, end_line: 1, end_col: 3})]) parse_tokens(&tokenize("(a (b true) 3)").unwrap()) ==> Ok([Nest([Atom(Sym("a"), Span { start_line: 0, start_col: 1, end_line: 1, end_col: 2 }), Nest([Atom(Sym("b"), Span { start_line: 0, start_col: 4, end_line: 1, end_col: 5 }), Atom(Bool(true), Span { start_line: 0, start_col: 6, end_line: 1, end_col: 10 })], Span { start_line: 0, start_col: 3, end_line: 1, end_col: 11 }), Atom(Int(3), Span { start_line: 0, start_col: 12, end_line: 1, end_col: 13 })], Span { start_line: 0, start_col: 0, end_line: 1, end_col: 14 })]) parse_tokens(&tokenize("(a (b c").unwrap()) ==> Err("Left paren '(' at line 0, column 3 never matched") parse_tokens(&tokenize("true) 3)").unwrap()) ==> Err("Unmatched right paren ')' on line 0, column 4")
Test your parser carefully.
Hint: the signature above is not general enough to be sufficient to parse
s-expressions, and you’ll need one or more helper functions. First consider
what a plausible “subtask” might be, to make progress on producing a
Vec<Sexp<Span>>
. Next, consider the data definition for Sexp
carefully, and look very carefully at the second example above: at the point
just after the token true
and before the following RPAREN, how might you
describe your intermediate state? What have you built up so far, what are you
in the process of building, and what is left over?
Hint: you can and should solve this problem by looking at only
one token at a time. Trying to “look ahead” in the list is pretty
much guaranteed not to work, so you need to use some other technique
to match up left and right parentheses. One mnemonic I was shown a
long time ago: to check if an s-expression is balanced, mentally start
a running count at zero, scan across the expression from left to
right, increment the count at each left parenthesis, and decrement it
at each right parenthesis. If the count is zero at the end, and never
became negative, then you’ve got a correctly-matched s-expression.
How might you carry the idea of this across into your code? (You
don’t necessarily need to maintain an actual int
counter;
you’re implicitly maintaining the call stack...)
3 Grading standards
For this assignment, you will be graded on
whether your code compiles,
whether your code implements the specification (functional correctness),
whether you thoroughly test every method that you write, and
how readable your code is (indented well, commented well, etc).
4 Submission
4.1 Deliverables
Your submission should include all the provided files; you should not need to create any new ones.
To submit on gradescope, zip the files
$ zip submission.zip Cargo.toml src/lib.rs src/tokenize.rs src/sexp.rs src/arith.rs
and submit that zip.