Representing information as bits
Bit-level manipulations
Integers
Representation: unsigned and signed
Conversion, casting
Expanding, truncating
Addition, negation, multiplication, shifting
Summary
Representations in memory, pointers, strings
Each bit is 0 or 1
By encoding/interpreting sets of bits in various ways
Computers determine what to do (instructions)
… and represent and manipulate numbers, sets, strings, etc.
Why bits? Electronic implementation
Easy to store with bitstable elements
Reliably transmitted on noisy and inaccurate wires
Base 2 number representation
Represent \(15213_{10}\) as \(11101101101101_{2}\)
Represent \(1.20_{10}\) as \(1.0011001100110011[0011] \ldots_{2}\)
Represent \(1.5213 \times 10^4\) as \(1.1101101101101_{2} \times 2^{13}\)
Byte = 8 bits
Binary: \(00000000_{2}\) to \(11111111_{2}\)
Decimal: \(0_{10}\) to \(255_{10}\)
Hexadecimal: \(00_{16}\) to \(FF_{16}\)
Base 16 number representation
Use characters ‘0’ to ‘9’ and ‘A’ to ‘F’
Typically written in most programming languages with the prefix 0x
Hex | Decimal | Binary |
---|---|---|
0 | 0 | 0000 |
1 | 1 | 0001 |
2 | 2 | 0010 |
3 | 3 | 0011 |
4 | 4 | 0100 |
5 | 5 | 0101 |
6 | 6 | 0110 |
7 | 7 | 0111 |
Hex | Decimal | Binary |
---|---|---|
8 | 8 | 1000 |
9 | 9 | 1001 |
A | 10 | 1010 |
B | 11 | 1011 |
C | 12 | 1100 |
D | 13 | 1101 |
E | 14 | 1110 |
F | 15 | 1111 |
C Data | Typical 32-bit | Typical-64 | x86-64 |
---|---|---|---|
char |
1 | 1 | 1 |
short |
2 | 2 | 2 |
int |
4 | 4 | 4 |
long |
4 | 8 | 8 |
float |
4 | 4 | 4 |
double |
8 | 8 | 8 |
pointer | 4 | 8 | 8 |
Algebraic representation of logic
Encode “true” as 1 and “false” as 0
Developed by George Boole in the 19th Century
Operations
and (&
): a & b = 1
when both a = 1
and b = 1
or (|
): a | b = 1
when either a = 1
and b = 1
not (~
): ~a = 1
when a = 0
xor (^
): a ^ b = 1
when either a = 1
or b = 1
, but not both
Operate on Bit Vectors
Example: \[\begin{align*} & 01101001\\ \texttt{&} \; & 01010101\\ \hline & 01000001\\ \end{align*}\]
All of the properties of Boolean algebra apply
Representation
Width \(w\) bit vector represents subsets of \(\{0, \ldots, w-1\}\)
\(a_j = 1\) if \(j \in A\)
Operations
&
: intersection
|
: union
^
: symmetric difference
~
: complement
Examples with \(w = 8\)
\(x = 01101001 = \{0, 3, 5, 6\}\)
\(y = 01010101 = \{0, 2, 4, 6\}\)
\(x \; \texttt{&} \; y = 01000001 = \{0, 6\}\)
\(x \; \texttt{|} \; y = 01111101 = \{0, 2, 3, 4, 5, 6\}\)
The operations &
, |
, ~
, and ^
are available in C
apply to any “integral” data type: long
, int
, short
, char
, unsigned
arguments are viewed as bit vectors
arguments are applied bitwise
Examples with char
type
~0x41
\(\rightarrow\) 0xBE
~0x00
\(\rightarrow\) 0xFF
0x69 & 0x55
\(\rightarrow\) 0x41
The logical operations in C are &&
, ||
, and !
zero is viewed as “false”
any non-zero value is viewed as “true”
always return 0 or 1
short-circuit evaluation
Examples with char
data type
!0x41
\(\rightarrow\) 0x00
!0x00
\(\rightarrow\) 0x01
0x42 && 0x55
\(\rightarrow\) 0x01
Left shift: x << y
shift bit vector x
left y
positions
fill with zeros on the right
Right shift: x >> y
shift bit vector x
right y
positions
logical shift: fill with zeros on the left
arithmetic shift: replicate most significant bit on the left
Undefined behavior: shift amount less than zero or greater than bit vector length
x = 01100010
x << 3 = 00010000
logical: x >> 2 = 00011000
arithmetic: x >> 2 = 00011000
x = 10100010
x << 3 = 00010000
logical: x >> 2 = 00101000
arithmetic: x >> 2 = 11101000
Unsigned
\[B2U(x) = \sum_{i=0}^{w-1} x_i \cdot 2^i\]
where \(x\) is the bit vector and \(w\) is the length of the bit vector
Signed: two’s complement
\[B2T(x) = -x_{w-1} \cdot 2^{w-1} \sum_{i=0}^{w-2} x_i \cdot 2^i\]
where \(x\) is the bit vector, \(w\) is the length of the bit vector, and \(-x_{x-1}\) is the sign bit
value | unsigned | two’s complement |
---|---|---|
000 |
(0+0+0) = 0 |
(0+0+0) = 0 |
001 |
(0+0+1) = 1 |
(0+0+1) = 1 |
010 |
(0+2+0) = 2 |
(0+2+0) = 2 |
011 |
(0+2+1) = 3 |
(0+2+1) = 3 |
100 |
(4+0+0) = 4 |
(-4+0+0) = -4 |
101 |
(4+0+1) = 5 |
(-4+0+1) = -3 |
110 |
(4+2+0) = 6 |
(-4+2+0) = -2 |
111 |
(4+2+1) = 7 |
(-4+2+1) = -1 |
Unsigned values
min = 0
max = \(2^{w} - 1\)
Two’s complement values
min = \(-2^{w-1}\)
max = \(2^{w-1} - 1\)
Values where \(w = 16\)
decimal | hex | binary | |
---|---|---|---|
unsigned max | 65535 | FF FF | 11111111 11111111 |
signed max | 32767 | 7F FF | 01111111 11111111 |
signed min | -32768 | 80 00 | 10000000 00000000 |
-1 | -1 | FF FF | 11111111 11111111 |
0 | 0 | 00 00 | 00000000 00000000 |
Equivalence
Uniqueness
Every bit pattern represents a unique integer value
Each representable integer has a unique bit encoding
Can invert mappings
unsigned bit pattern = \(U2B(x) = B2U^{-1}(x)\)
two’s complement bit pattern = \(T2B(x) = B2T^{-1}(x)\)
Mappings between unsigned and two’s complement numbers: keep the bit representation and reinterpret.
Two’s complement to unsigned: \(T2B \circ B2U\)
Unsigned to two’s complement: \(U2B \circ B2T\)
Constants
By default are considered to be signed integers
Unsigned if the suffix is “U”, for example 42U
Casting
Explicit casting between signed and unsigned same as \(U2T\) and \(T2U\)
Implicit casting also occurs via assignments and procedure calls
Expression evaluation
If there is a mix of unsigned and signed integers in a single expression, then signed values are implicilty cast to unsigned values.
Including comparison operations: <, >, ==, <=, >=
Examples
Operand 1 | Operand 2 | Relation | Evaluation |
---|---|---|---|
0 | 0U | == | unsigned |
-1 | 0 | < | signed |
-1 | 0U | > | unsigned |
-1 | -2 | > | signed |
Easy to make mistakes
Example 1
unsigned i;
for (i = cnt-2; i >= 0; i--)
a[i] += a[i+1]
Example 2
#define DELTA sizeof(int)
int i;
for (i = CNT; i-DELTA >= 0; i -= DELTA)
...
Bit pattern is maintained, but reinterpreted
Can have unexpected effects: adding or subtracting \(2^w\)
An expression containing signed and unsigned int
s implicitly casts the signed int
s to unsigned ints
Task
Given \(w\)-bit signed integer \(x\)
Convert it to \(w+k\) bit integer \(x'\) with the same value
Rule
Make \(k\) copies of the sign bit:
\(x' = x_{w-1}, \ldots, x_{w-1}, x_{w-1}, x_{w-2}, \ldots, x_{0}\)
C automatically performs sign extension
Example of sign extensions from \(w=3\) to \(w=4\)
Task:
Given \(k+w\)-bit signed or unsigned integer \(x\)
Convert it to \(w\)-bit integer \(x'\) with the same value for “small enough” \(x\)
Rule:
Drop top \(k\) bits:
\(x' = x_{w-1}, x_{w-2}, \ldots, x_0\)
Expanding (e.g. short
to int
)
Unsigned: zeros added
Signed: sign extension
Both yield expected result
Truncating (e.g. int
to short
)
Unsigned/signed: bits are truncated
Result is reinterpreted
Unsigned: modulus operation
Signed: similar to modulus
For small (in magnitude) numbers yields expected behavior
\(UAdd_{w}(u, v)\)
Operands: \(w\) bits
True sum: \(w+1\) bits
Discard carry: \(w\) bits
Standard addition function ignores carry output
Implements modular arithmetic
\[s = UAdd_w(u, v) = u + v \; \texttt{mod} \; 2^w\]
Implements modular arithmetic \(s = UAdd_w(u, v) = u + v \; \texttt{mod} \; 2^w\)
\(Add_4(u, v)\)
\(UAdd_4(u, v)\)
\(TAdd_{w}(u, v)\)
Operands: \(w\) bits
True sum: \(w+1\) bits
Discard carry: \(w\) bits
\(TAdd\) and \(Uadd\) have identical bit level behavior
True add requires \(w+1\) bits; drop off the most significant bit and interpret as 2’s complement integer
\(TAdd_{4}(u, v)\)
Problem: the exact product of \(w\)-bit numbers \(u, v\) might have a result that exceeds \(w\) bits.
Unsigned: up to \(2w\) bits
Two’s complement min (negative): up to \(2w-1\) bits
Two’s complement max (positive): up to \(2w\) bits
Maintaining exact results
would need to keep expanding word size with each product computed
is done in software if needed
\(UMul_{w}(u, v)\)
Operands: \(w\) bits
True product: \(2w\) bits
Discard \(w\) bits: \(w\) bits
Implements modular arithmetic
\[s = UMul_w(u, v) = u + v \; \texttt{mod} \; 2^w\]
\(TMul_{w}(u, v)\)
Operands: \(w\) bits
True product: \(2w\) bits
Discard \(w\) bits: \(w\) bits
Ignores high order \(w\) bits, some of which are different for signed vs. unsigned multiplication
Operation u << k
Gives \(u \cdot 2^k\) for both signed and unsigned
Operands: \(w\) bits
True product \(w+k\) bits
Discard \(k\) bits: \(w\) bits
Operation u >> k
Gives
\[\bigg\lfloor \frac{u}{2^k} \bigg\rfloor\]
Uses logical shift
Operation u >> k
Gives
\[\bigg\lfloor \frac{u}{2^k} \bigg\rfloor\]
Uses arithmetic shift
Rounds wrong direction when \(u < 0\)
Quotient of negative number power of 2
Want
\[\bigg\lceil \frac{u}{2^k} \bigg\rceil\]
Compute as
\[\bigg\lfloor \frac{u+2^k-1}{2^k} \bigg\rfloor\]
In C: (u + (1<<k) - 1) >> k
Biases dividend toward 0
Negate through complement and increment
~x + 1 = -x
Examples
Value | x |
~x |
~x+1 |
Result |
---|---|---|---|---|
15213 | 3B6D |
C492 |
C493 |
-15213 |
0 | 0000 |
FFFF |
0000 |
0 |
TMin | 8000 |
7FFF |
8000 |
TMin |
Addition
Unsigned/signed: normal addition followed by truncate
Unsigned: addition mod
\(2^w\)
Signed: modified addition mod
\(2^w\) (result in proper range)
Multiplication
Unsigned/signed: normal multiplication followed by truncate
Unsigned: multiplication mod
\(2^w\)
Signed: modified multiplication mod
\(2^w\) (result in proper range)
Programs refer to data by address
Conceptually envision it as a very large array of bytes
An address is like an index into that array, and a pointer variable stores an address
Note: system provides private address space to each “process”
Think of a process as a program being executed
So, a program can clobber its own data, but not that of others
Any given computer has a “word size”
Until recently, most machines used 32 bits (4 bytes) as a word size
Increasingly, machines have 64 bit word size
Machines still support multiple data formats
Fractions or multiples of word size
Always integral number of bytes
Addresses specify byte locations
Address of first byte in word
Addresses of successive words differ by 4 (32 bit) or 8 (64 bit)
How are the bytes within a multi-byte word ordered in memory?
Conventions
Big endian: least significant byte has highest address
Little endian: least significant byte has lowest address
Example: 4-byte value of 0x1234567
Big endian: 01 23 45 67
Little endian: 67 45 23 01
Code to print byte representation of data
typedef unsigned char *pointer;
void show_bytes(pointer start, size_t len) {
size_t i;
for (i = 0; i < len; i++) {
printf("%p\t0x%.2x\n", start+i, start[i]);
}
printf("\n");
}
Strings in C
Represented by an array of characters
Each character is encoded in ASCII format
Strings should be null terminated (final character = 0)
Compatibility
Disassembly
Text representation of binary machine code
Generated by program that reads the machine code
Example Fragment
Address Instruction code Assembly Rendition
8048365: 5b pop
8048366: 81 c3 ab 12 00 00 add $0x12ab,%ebx
804836c: 83 bb 28 00 00 00 00 cmpl $0x0,0x28(%ebx)
Representing information as bits
Bit-level manipulations
Integers
Representation: unsigned and signed
Conversion, casting
Expanding, truncating
Addition, negation, multiplication, shifting
Summary
Representations in memory, pointers, strings