Lecture 6: Runtime representations for multiple value types
We ought to be able to handle more kinds of values than just numbers. Let’s try to add at least one more type of value to our language. Doing so will trigger a large number of problems, which will all need to be resolved to make things work again.
1 Growing the language: adding boolean values
Reminder: Every time we enhance our source language, we need to consider several things:
Its impact on the concrete syntax of the language
Examples using the new enhancements, so we build intuition of them
Its impact on the abstract syntax and semantics of the language
Any new or changed transformations needed to process the new forms
Executable tests to confirm the enhancement works as intended
When we introduced conditionals and branching control-flow last time,
we were a bit dissatisfied with the semantics of “branch if it’s nonzero
or not” —
Do Now!
What new primitive operations should we add to operate with booleans? What should happen if we give numeric arguments to boolean operations? Or vice versa?
Exercise
We currently represent numbers as themselves. How can we represent booleans?
Exercise
How can we prevent things from going wrong?
The key challenge, from which all the other problems today will arise, is that our current encoding of numbers is too big. We’ve assigned a numeric meaning to every 64-bit pattern, and therefore used them all up: there are no bit patterns left to represent boolean values. Somehow, we need to carve out space for our booleans. As soon as we do that, though:
Not all operations should operate on all bit patterns, which means we have to ensure that operations can distinguish valid operands from invalid ones.
Not all patterns can be considered numbers, which means we have to ensure that our compilation of numbers makes sense in this new extended setup.
It becomes apparent that just supporting booleans is shortsighted: we’d like to be able to broaden these techniques to other datatypes later on, such as pointers, tuples, structures, closures, or others.
A lot of work, for just two values! But that work forms a blueprint to follow when growing the language further.
1.1 Wait, what about static typing?
Surely we can resolve some if not all of these problems by designing a type-checker for our language? After all, if we can determine statically that the program never miuses an operation with the wrong type of operands, then we don’t need to distinguish those types of values at runtime?
Sadly, not quite: while some operations will be simplified if we do this, there are still features we’d like to have in our compiler, such as garbage collection, that will inevitably need to treat some types of values differently than others. Without getting into the details now (we’ll save that for later), learning how to deal with distinguishing values now, while the language is still small, will be good practice when the language gets more complicated soon.
1.2 The new concrete syntax
Today’s language changes themselves are syntactically quite simple: only two new expression forms, and a bunch of new operators. Notice that we have operators that consume and return booleans, and also operators that consume numbers and return booleans – language features interact.
1.3 Examples
Exercise
Develop a suite of test examples for using booleans. Be thorough, and consider as many edge cases as possible.
1.4 Enhancing the abstract syntax
The abstract syntax changes are similarly straightforward to the concrete ones.
enum Prim1 {
...
Not
}
enum Prim2 {
...
And,
Or,
Lt,
Leq,
Gt,
Geq,
Eq,
Neq,
}
enum Exp<Ann> {
...
Bool(bool)
}
enum ImmExp<Ann> {
...
Bool(bool)
}
Like numbers, we will consider Bool(_)
values to be immediates for the
purposes of sequential form: no further computation on them is needed.
2 Representation choices
If all of our bit-patterns so far are used for numbers, then when we choose some patterns to reinterpret as booleans, we must have some way of distinguishing boolean (or other) patterns from numeric ones.
A note about notation: since we will be discussing actual bit-level representations today, it will be useful to use either binary or hexadecimal notations for numbers, rather than more common decimal notation. I.e.
0b00000000_0000....0000_00001010
= 0x000000000000000A
= 10
For clarity, if needed, we’ll include underscores to separate the bytes in the
binary representation, and elide the middle bits if they’re not interesting.
Further, if there are “wildcard” bits that we don’t care about, we will mark
them as w
in the binary representation.
2.1 Attempt 1: Why limit ourselves to 64-bits?
If all the patterns in a 64-bit word are used up...let’s use more words.
Suppose we encode all our values as two words, where the first word is a
tag word describing what type of value comes next —0
for
number, 1
for boolean —
Do Now!
List at least three consequences of this approach.
On the one hand, this approach is certainly future-proof: we have room to
distinguish 9 quintillion distinct types of data in our programs! But on the
other, consider how we would have to implement any of our operations. Because
our values are now larger than our word-size, they won’t fit into registers any
more, which in turn means that our invariant of “the answer always goes in
RAX
” won’t work any longer. Instead, we will need to always put our
answers on the stack, and then put the address of the first word of the
answer into RAX
: i.e., the actual answer can be found in
[RAX]
and [RAX - 8]
. This in turn means that every single
operation we want to perform will require at least two memory reads and writes
—
2.2 Attempt 2: We don’t need 64 bits for a tag
If having 64 tag bits is overkill, let’s try the minimum needed: could we get away with a single tag bit? We can pick any bit we like for this, but for reasons that will become apparent soon, let’s choose the least-significant bit (LSB) to denote numbers versus booleans:
LSB =
0
implies the value is a numberLSB =
1
implies the value is a boolean
This may still feel like overkill: we have 4 quintillion bit patterns reserved just for booleans, and we’ve given up half of our available space for numbers. In practice, we’ll accept the loss of range on numbers for now, and we’ll refine our notion of what it means to “tag a boolean” so that we can use the rest of that reserved space for other types of values, later.
2.2.1 Booleans
We now need to choose representations for true
and false
themselves.
Since it’s very likely that tags are going to get more complex later, we
probably want to reserve several low-order bits for future tag use —
MSB =
0
encodesfalse
MSB =
1
encodestrue
In other words, our true and false values will therefore be
0b1wwwwwww_wwww.....wwww_wwwwwww1
and
0b0wwwwwww_wwww.....wwww_wwwwwww1
, respectively. We can choose
whatever bits we want in between: two obvious choices would be either all zeros
or all ones. Arbitrarily, we’ll choose all-ones.
2.2.2 Numbers
How should we represent numbers, then? Since we’ve decided that an LSB of zero indicates a number, we effectively have all the bit-patterns-formerly-interpreted-as-even-numbers available to us as numbers. This lends itself to a convenient encoding: we’ll simply represent the number n as the bit-pattern that used to represent 2n:
Number |
| Representation |
|
|
|
|
|
|
|
|
|
This is clearly a one-to-one mapping, so it’s a faithful encoding; let’s see if it’s a useful one, too.
2.3 Compiling constants
Compiling our new constants is fairly straightforward. There is a twist: we need to ensure at compile time that any numeric constants we have in our source program aren’t inherently overflowing:
static MAX_SNAKE_INT = i64::max >> 1;
static MIN_SNAKE_INT = i64::min >> 1;
fn compile_imm(e: &ImmExp, env: &[(&str, i32)]) -> Arg64 {
match e {
ImmExp::Num(n) => {
if n > MAX_SNAKE_INT || n < MIN_SNAKE_INT {
panic!("Integer constant doesn't fit into 63 bits: {}", n);
} else {
Arg64::Imm(n << 1)
}
}
ImmExp::Bool(b) => {
if b {
SNAKE_TRUE
} else {
SNAKE_FALSE
}
}
}
}
<<
operator shifts the bits
of its first operand by some number of places given by the second operand. The
newly vacated bits are filled in with zeros. As a consequence,
(n << 1) == 2 * n
...but since we’re contemplating these values as bit
patterns, it’s a useful observation to notice that multiplying by two simply
slides the bits around: perhaps we can rely on that later to unslide them as needed.Do Now!
It’s worth pausing to confirm that
<<
does have the desired semantics in all cases, especially when<<
is negative. Given that our numbers use two’s complement representation, what can we determine about the most significant bits of any negative number in our valid range of numbers? Given that, what we can we deduce about the behavior of<<
?
The SNAKE_TRUE
and SNAKE_FALSE
values are whatever bit patterns we
chose above; for legibility, we ought to define them as named constants in our
compiler, rather than scatter magic numbers throughought our code.
Exercise
Now that we have defined how to compile constants, what output will we get from executing the program
15
? What will we get from running the programtrue
?
2.4 Defining arithmetic over our representations
Well clearly, even before arithmetic, the first operation we should redefine is
print
– or else we cannot trust any output of our program!
Do Now!
How many cases of output should we check for, if we’re coding
Since we now have at least two different types of values in our
language, it doesn’t quite make sense to use the Rust type i64
to describe them: they’re not meaningfully numbers in Rust,
necessarily, but rather are just 64 bits of data. Accordingly, we’ll
use a single-argument struct to create a wrapper type around
u64
, to remind ourselves of which values are meaningful
values-in-our-program, and which values are meaningful-in-C. All bit
patterns ending in zeros will be considered numbers. Only two
patterns that end in a one have any meaning, as true
and
false
; everything else is an error for now:
#[repr(C)]
#[derive(PartialEq, Eq, Copy, Clone)]
struct SNAKE_VAL(u64);
static BOOL_TAG: u64 = 0x00_00_00_00_00_00_00_01;
static SNAKE_TRU: SNAKE_VAL = SNAKE_VAL(0xFF_FF_FF_FF_FF_FF_FF_FF);
static SNAKE_FLS: SNAKE_VAL = SNAKE_VAL(0x7F_FF_FF_FF_FF_FF_FF_FF);
#[link(name = "compiled_code", kind = "static")]
extern "C" {
#[link_name = "\x01start_here"]
fn start_here() -> SNAKE_VAL;
}
// reinterprets the bytes of an unsigned number to a signed number
fn unsigned_to_signed(x: u64) -> i64 {
i64::from_le_bytes(x.to_le_bytes())
}
fn sprint_snake_val(x: SNAKE_VAL) -> String {
if x.0 & BOOL_TAG == 0 { // it's a number
format!("{}", unsigned_to_signed(x.0) >> 1)
} else if x == SNAKE_TRU {
String::from("true")
} else if x == SNAKE_FLS {
String::from("false")
} else {
panic!("Invalid snake value")
}
}
fn main() {
let output = unsafe { start_here() };
println!("As unsigned number: {}", output.0);
println!("As signed number: {}", i64::from_le_bytes(output.0.to_le_bytes()));
println!("As binary: 0b{:064b}", output.0);
println!("As hex: 0x{:016x}", output.0);
println!("As snake value: {}", sprint_snake_val(output));
}
Now let’s consider arithmetic operations.
Do Now!
What currently is output if we execute
add1(0)
? Redefineadd1
andsub1
to fix this.
Zero is represented as 0x00000000
, so add1(0)
currently produces
0x00000001
, which is a representation of a boolean value! Our
implementations of add1
, and more generally addition and subtraction, need
to take that representation into account.
Operation |
| Representation |
| Result |
| Interpretation |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
n1 + n2 |
| 2n1 + 2n2 |
| 2(n1 + n2) |
| n1 + n2 |
Interestingly, our notion of addition and subtraction just works with
our encoding, because multiplication distributes over addition. This is very
convenient: no changes are needed to how we compile addition or
subtraction.1Not quite! What important cases are we missing? There’s
nothing we can do about them yet, but we shoudl be aware that there are
lingering problems. And it informs us how we should revise add1
and
sub1
: they should add or subtract not 1, but rather the
representation of 1, i.e. 2.
Do Now!
What currently is output if we execute
3 * 2
? Redefine multiplication to fix this.
Following the same logic as above, 3 * 2
is encoded as 6 * 4
, which
produces 24
as a result, which is interpreted as 12, which is
twice as big as the desired result. When we multiply, we wind up with an extra
factor of two since both operands have individually been multiplied by two.
Our implementation of multiplication must shift the result right by one bit, to
compensate. The shr val, bits
instruction shifts the given val
to
the right by the given number of bits
.
Do Now!
Ok, so multiplication seems fixed. What happens if we execute
2 * (-1)
? Why does that result appear?
Our program print
s 4611686018427387902
, which is represented by
9223372036854775804
, or 0x7FFFFFFFFFFFFFFFC
. The trouble is that
negative numbers are represented by having the MSB set to 1 (the “sign bit”),
and shifting right does not preserve that sign bit: our output has an MSB of
zero, even though it should be representing a negative number. Instead, we need
arithmetic shift right, instead of logical shift right: a
dedicated instruction sar
that behaves like shr
, except it keeps
the sign bit unchanged.
2.5 Defining logic operations over our representations
Logical and
and or
are very easy to define. Because we’ve confined
the differences between the representations of true
and false
to a
single bit (the MSB), and in all other bits they are the same, logical
and
and or
are equivalent to bitwise and
and or
, for
which we have dedicated assembly instructions. Specifically:
val1 |
| val2 |
| val1-rep |
| val2-rep |
| val1-rep bitwise-AND val2-rep |
| val1 and val2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
val1 |
| val2 |
| val1-rep |
| val2-rep |
| val1-rep bitwise-OR val2-rep |
| val1 or val2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Logical not
is slightly trickier. We could use not
to invert the
MSB, but that would invert all the bits of the value —
Input bit |
| Flip? |
| Output |
1 |
| 1 |
| 0 |
0 |
| 1 |
| 1 |
1 |
| 0 |
| 1 |
0 |
| 0 |
| 0 |
This is the truth table that defines the exclusive-or operation, and the
xor
instruction applies this operation bitwise. Suppose we designed a
mask BOOL_MASK
equal to 0x8000000000000000
, whose only active bit is
the MSB. Then xor RAX, BOOL_MASK
will negate only the MSB, and leave
the rest unchanged, as desired:
val |
| val-rep |
| BOOL_MASK |
| val-rep bitwise-XOR BOOL_MASK |
| not(val) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Note that unfortunately, we cannot write xor RAX, 0x8000000000000000
.
Due to a limitation of x64 syntax, we cannot specify a 64-bit literal argument
to xor
(or to many other operators, either). Instead, we would have to
mov
that value into a temporary register or a stack slot first, and then
we could xor
RAX
with that temporary value. This infelicity of x64
assembly syntax will come back to bother us later, so it’s useful to point out
now.
2.6 Defining comparisons over our representations
Comparisons are slightly finicky to define well.
Do Now!
What are the kinds of numbers most likely to cause problems? Work through the bitwise representation of those numbers, and try to construct some assembly sequence that would take two (representations of) numbers and produce a (representation of a) boolean that represents their comparison.
Here are two potential strategies, focusing just on the less-than operator:
2.6.1 Using cmp
Most simply, we can just use the cmp
operation,
which compares its two operands arithmetically.2In fact, the cmp
operator simply subtracts one argument from the other, just like sub
does, and sets the same overflow, zero, and signedness flags. In contrast to
sub
, cmp
discards the result of the subtraction, and doesn’t
overwrite its first argument.
This link provides a
nice summary of the various conditional jumps available in assembly, as well as
the meaning of the flags that are set by comparisons or subtractions. We then
use one of the conditional jump instructions much like compiling an
if-expression, and move the appropriate boolean constant into RAX
as a
result. Assuming we have values in stack slots 1 and 2, we might start with
the following assembly:
;;; NOTE: cmp does not allow both arguments to be memory accesses
mov RAX, [RSP - 8*1] ;; Load first value
cmp RAX, [RSP - 8*2] ;; Compare with second value
jl less
greater:
....
j done
less:
....
done:
...
and we can refine it to produce the needed boolean. This scaffold looks pretty
similar to the structure for compiling an if
expression, but we don’t
need the full generality of that structure. We can assume the result will be
true
, perform the comparison and jump, and if the comparison fails
we’ll update the result to be false
:
mov RAX, [RSP - 8*1]
cmp RAX, [RSP - 8*2]
;; The comparison has now set the necessary flags,
;; so we can safely clobber RAX:
mov RAX, 0x8000000000000001 ;; Assume the result is true
jl less ;; If this jump falls through,
mov RAX, 0x0000000000000001 ;; then the comparison failed
less:
...
We could also have flipped the assumption, started from false
, and
jumped if the comparison failed:
mov RAX, [RSP - 8*1]
cmp RAX, [RSP - 8*2]
;; The comparison has now set the necessary flags,
;; so we can safely clobber RAX:
mov RAX, 0x0000000000000001 ;; Assume the result is false
jge less ;; If this (inverted) jump falls through,
mov RAX, 0x8000000000000001 ;; then the comparison succeeded
greater_eq:
...
This is straightforward, but potentially expensive, since conditional jump instructions can stall the processor if the processor “guesses wrong” about whether the branch will be taken or not. We can try to build an optimizer to choose which of these two compilations to pick, based on whether we think the condition is more often true or false, and that might be worthwhile for critical inner loops...
Can we do “better” and avoid needing any conditional jumps?
2.6.2 Being “clever”?
There are arithmetic and bitwise tricks we can play to get the answer correct
in nearly every case. For example, to determine if 6 < 15
, we can
simply compute their difference, or -9
. Because this is a negative
number, its MSB is 1
, which is exactly the defining bit of true
.
We merely need to set all the remaining bits to SNAKE_TRUE
’s bits. If we
chose a representation with all zeros, then and
’ing our answer with
BOOL_MASK
will force all the remaining bits to zeros. If we chose a
representation with all ones, then or
’ing our answer with the complement
of BOOL_MASK
will force all the remaining bits to ones. A similar argument
holds for checking if 15 < 6
, because that subtraction yields a
positive number, whose MSB is zero, which is the defining bit for
false
. So for example, we might try the following for 6 < 15
:
mov RAX, 12 ;; Representation of 6
sub RAX, 30 ;; Representation of 15
and RAX, 0x8000000000000000 ;; Mask off all the irrelevant bits,
or RAX, 0x0000000000000001 ;; and make sure to tag it back as a boolean!
Unfortunately, this approach fails for large numbers. Checking if
4611686018427387903 < -4611686018427387904
fails with our representations, because we
represent the former by 0x7ffffffffffffffe
, and the latter by 0x8000000000000000
.
Subtracting the latter from the former yields 0xfffffffffffffffe
, or -2
(which in turn encodes -1
in our representation), whose MSB is 1,
leading to a conclusion that 4611686018427387903 < -4611686018427387904
is true! The
problem is that our subtraction overflowed 63 bits on these large numbers, and
no longer makes sense under our representation. The solution, then, is to use
sar
to shift both operands right by one bit, before doing the
subtraction, so that overflow can’t spill into the sign bit. Using these two
extra shifts affects our compilation: even though we’ve compiled both arguments
to the comparison into immediates, we actually need to preprocess both of them,
which means we need a place to stash them both. Practically speaking, we can
use RAX
for one, and RBX
for the other, so long as nothing else in
our compilation is using RBX
(which at the moment, indeed, nothing is).
This would look like:3If we wanted to be a bit more careful with
producing exactly our two canonical booleans, all of whose bits are 1, then in
the final step we should or
with 0x7fffffffffffffff
.
mov RAX, 0x7ffffffffffffffe ;; Representation of 2^62 - 1
sar RAX, 1 ;; Shift (arithmetic) right by 1 bit
mov RBX, 0x8000000000000000 ;; Representation of -2^62
sar RBX, 1 ;; Shift (arithmetic) right by 1 bit
sub RAX, RBX ;; Produce the difference
and RAX, 0x8000000000000000 ;; Mask off the irrelevant bits
or RAX, 0x0000000000000001 ;; and tag it again as a boolean
(Actually, technically, this approach also fails, but for a far less
interesting reason: again, due to the limitation on 64-bit literal constants in
most assembly instructions. If we were to try to implement the code above
properly, we’d actually have to move the argument to and
into a temporary
register first, and then and
RAX
with that temporary register. The
following command for or
“works”, only because truncating that constant
to its lower 32 bits yields the same value of 1
as the entire 64-bit
constant does.)
2.6.3 So why does cmp
work?
Internally, cmp
really does perform a subtraction on its two arguments,
but as we just showed above, that doesn’t quite work as smoothly as we’d like!
Looking carefully at the failure above, we see that when an overflow occurs,
our sign bit is exactly wrong. Reading the documentation for cmp
and
sub
, it turns out they set a few flag bits indicating whether an
operation has overflowed, is negative, is zero, etc. The meaning of jl
,
subtly, is to check whether either the operation overflowed and is
negative, or the operation didn’t overflow and is positive. In other
words, it checks whether the overflow flag is not equal to the signedness
flag. Since we don’t have a trivial way of implementing that check via
and
, or
or xor
, our simple code above failed.
There exist techniques to copy the flags register into an actual register (or onto the stack) so that we can manipulate it further, but a detailed analysis of the cycle counts for conditional jumps versus pushing and popping the flags register shows that the conditional jump is both simpler and faster, uses fewer registers, and works in all cases.
Do Now!
Complete this analysis for other comparison operations, being careful about overflow in all cases.
3 Overflow and Bogus Arguments
Unfortunately, our code currently does not prohibit bogus programs, such as
5 + true
. Indeed, we don’t even have a mechanism yet for enforcing such
things. Additionally, while our compilation of constants prohibits
compile-time constants that overflow 63 bits, nothing prohibits us from adding
numbers together until they overlow. Checking for such broken behaviors is the
focus of next lecture.
4 Testing
As we’ve now added a new type of value to our language, it’s crucial to test thoroughly that we properly process and render each type of value consistently: that is, any operations that produce booleans should always produce valid boolean representations, and similarly for numeric operations. Moreover, operations should process their operands correctly. (We don’t yet have to worry about operations handling operands of the wrong types, as noted above.) Catching inconsistencies at this stage is substantially easier than it will be later, since we haven’t added too many new expression forms to the language.
1Not quite! What important cases are we missing? There’s nothing we can do about them yet, but we shoudl be aware that there are lingering problems.
2In fact, the cmp
operator simply subtracts one argument from the other, just like sub
does, and sets the same overflow, zero, and signedness flags. In contrast to
sub
, cmp
discards the result of the subtraction, and doesn’t
overwrite its first argument.
This link provides a
nice summary of the various conditional jumps available in assembly, as well as
the meaning of the flags that are set by comparisons or subtractions.
3If we wanted to be a bit more careful with
producing exactly our two canonical booleans, all of whose bits are 1, then in
the final step we should or
with 0x7fffffffffffffff
.