The Memory Hierarchy

CSC 235 - Computer Organization

References

  • Slides adapted from CMU

Outline

  • The memory abstraction
  • RAM: main memory
  • Locality of reference
  • The memory hierarchy
  • Storage technologies and trends

Writing and Reading Memory

  • Write
    • Transfer data from CPU to memory
    • Example: movq %rax, 8(%rsp)
    • “Store” operation
  • Read
    • Transfer data from memory to CPU
    • Example: movq 8(%rsp), %rax
    • “Load” operation

Traditional Bus Structure Connecting CPU and Memory

  • A bus is a collection of parallel wires that carry address, data, and control signals

  • Buses are typically shared by multiple devices

Memory Bus
Memory Bus

Memory Read Transaction (1)

  • Example: movq A, %rax

  • CPU places address A on the memory bus

Memory Bus
Memory Bus

Memory Read Transaction (2)

  • Example: movq A, %rax

  • Main memory reads A from the memory bus, retrieves word x, and places it on the bus

Memory Bus
Memory Bus

Memory Read Transaction (3)

  • Example: movq A, %rax

  • CPU read word x from the bus and copies it into register %rax

Memory Bus
Memory Bus

Memory Write Transaction (1)

  • Example: movq %rax, A

  • CPU places address A on the memory bus; main memory reads it and waits for the corresponding data word to arrive

Memory Bus
Memory Bus

Memory Write Transaction (2)

  • Example: movq %rax, A

  • CPU places data word y on the bus

Memory Bus
Memory Bus

Memory Write Transaction (2)

  • Example: movq %rax, A

  • Main memory reads data word y from the bus and stores it at address A

Memory Bus
Memory Bus

Random-Access Memory (RAM)

  • Key features
    • RAM is traditionally packaged as a chip or embedded as part of processor chip
    • Basic storage unit is normally a cell (one bit per cell)
    • Multiple RAM chips form a memory
  • RAM comes in two varieties
    • SRAM (static RAM)
    • DRAM (Dynamic RAM)

RAM Technologies

  • DRAM
    • 1 transistor + 1 capacitor per bit
    • Must refresh state periodically
  • SRAM
    • 6 transistors per bit
    • Holds state indefinitely (but will still lose data on power loss)
  • Trends
    • SRAM scales with semiconductor technology
    • DRAM scaling limited by need to minimum capacitance

Enhanced DRAMs

  • Operation of DRAM cell has not changed since its invention
    • Commercialized by Intel in 1970
  • DRAM cores with better interface logic and faster I/0:
    • Synchronous DRAM (SDRAM)
      • Uses a conventional clock signal instead of asynchronous control
    • Double data-rate synchronous DRAM (DDR SDRAM)
      • Double edge clocking sends two bits per cycle per pin
      • Different types distinguished by size of small prefetch buffer
        • DDR (2 bits), DDR2 (4 bits), DDR3 (8 bits), DDR4 (16 bits)

Conventional DRAM Organization

  • \(d \times w\) DRAM
    • \(d \cdot w\) total bits organized as \(d\) supercells of size \(w\) bits
DRAM Array
DRAM Array

Reading DRAM Supercell (2,1)

  • Step 1(a): Row access strobe (RAS) selects row 2
  • Step 1(b): Row 2 copied from DRAM array to row buffer
DRAM RAS
DRAM RAS

Reading DRAM Supercell (2,1)

  • Step 2(a): Column access strobe (CAS) selects column 1
  • Step 2(b): Supercell (2,1) copied from buffer to data lines, and eventually back to the CPU
  • Step 3: All data written back to row to provide refresh
DRAM CAS
DRAM CAS

Memory Modules

DRAM Module
DRAM Module

The CPU-Memory Gap

  • The gap widens between DRAM, disk, and CPU speeds
CPU-Memory Gap
CPU-Memory Gap

Locality

  • Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

  • Temporal locality:
    • Recently referenced items are likely to be referenced again in the near future
  • Spatial locality:
    • Items with nearby addresses tend to be referenced close together in time

Locality Example

sum = 0;
for (i = 0; i < n; i++) {
    sum += a[i];
}
return sum
  • Data references
    • References array elements in succession (spatial)
    • Reference variable sum each iteration (temporal)
  • Instruction references
    • Reference instructions in sequence (spatial)
    • Cycle through loop repeatedly (temporal)

Qualitative Estimates of Locality

  • Claim: being able to look at code and get a qualitative sense of its locality is a good skill for a professional programmer

  • Question: Does this function have good locality with respect to array a?

    int sum_array_rows(int a[M][N]) {
        int i, j, sum = 0;
        for (i = 0; i < M; i++) {
            for (j = 0; j < N; j++) }
                sum += a[i][j];
            }
        }
        return sum;
    }

Locality Example

  • Question: can you permute the loops so that the function scans the 3D array with a stride-1 reference pattern (and thus have good spatial locality)?

    int sum_array_3d(int a[M][N][N]) {
        int i, j, k, sum = 0;
        for (i = 0; i < M; i++) {
            for (j = 0; j < N; j++) }
                for (k = 0; k < M; k++) {
                    sum += a[k][i][j];
                }
            }
        }
        return sum;
    }

Memory Hierarchies

  • Some fundamental and enduring properties of hardware and software:
    • Fast storage technologies cost more per byte, have less capacity, and require more power (heat!).
    • The gap between CPU and main memory speed is widening
    • Well-written programs tend to exhibit good locality
  • These fundamental properties complement each other beautifully

  • The suggest an approach for organizing memory and storage systems known as a memory hierarchy

Example Memory Hierarchy

Memory Hierarchy
Memory Hierarchy

Caches

  • Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device

  • Fundamental idea of a memory hierarchy:
    • For each \(k\), the faster, smaller device at level \(k\) serves as a cache for the larger, slower device at level \(k+1\)
  • Why do memory hierarchies work?
    • Because of locality, programs tend to access data at level \(k\) more often than they access the data at level \(k+1\)
    • Thus, the storage at level \(k+1\) can be slower, larger, and cheaper per bit
  • Big Idea (Ideal): The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but serves data to programs at the rate of the fast storage near the top

General Cache Concepts

Cache Concepts
Cache Concepts

General Cache Concepts

  • A cache hit is when the data in block \(b\) is needed and is in the cache

  • A cache miss is when the data in block \(b\) is needed and is in not the cache

  • Types of cache misses:
    • Cold (compulsory) miss: occur because the cache starts empty and this is the first reference to the block
    • Capacity miss: occur when the set of active cache blocks (working set) is larger than the cache
    • Conflict miss: occur when the level \(k\) cache is large enough, but multiple data objects all map to the same level \(k\) block where a block is a small subset of the block positions at level \(k-1\)

Storage Technologies

  • Magnetic disks
    • Store on magnetic medium
    • Electromechanical access
  • Nonvolatile (Flash) memory
    • Store as persistent charge
    • Implemented with 3D structure

Disk Geometry

  • Disks consist of platters, each with two surfaces
  • Each surface consists of concentric rings called tracks
  • Each track consists of sectors separated by gaps
Disk Geometry
Disk Geometry

Disk Capacity

  • Capacity: maximum number of bits that can be stored
    • Vendors express capacity in units of gigabytes (GB) or terabytes (TB), where 1 GB = \(10^{9}\) Bytes and 1 TB = \(10^{12}\) Bytes
  • Capacity is determined by these technology factors:
    • Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment
    • Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment
    • Areal density (bits/in\({^2}\)): product of recording density and track density

Disk Operation

Disk Operation
Disk Operation

Disk Operation

Disk Operation
Disk Operation

Disk Access Time

  • Average time to access some target sector approximated by:
    • \(T_{access} = T_{seek} + T_{rotation} + T_{transfer}\)
  • Seek time
    • Time to position heads over cylinder containing target sector
    • Typical \(T_{seek}\) is 3 to 9 ms
  • Rotational latency
    • Time waiting for the first bit of target sector to pass under read/write head
    • \(T_{rotation} = \frac{1}{2} \cdot \frac{1}{RPMs} \cdot \frac{60 \; s}{1 \; min}\)
    • Typical rotational rate is 7,200 RPMs
  • Transfer time
    • Time to read the bits in the target sector
    • \(T_{transfer} = \frac{1}{RPM} \cdot \frac{1}{avg \; sectors \; per \; track} \cdot \frac{60 \; s}{1 \; min}\)

Disk Access Time Example

  • Given
    • Rotational rate = 7200 RPM
    • Average seek time = 9 ms
    • Average number of sectors per track = 400
  • Derived:
    • \(T_{rotation} = 4 \; ms\)
    • \(T_{transfer} = 0.02 \; ms\)
    • \(T_{access} = 0.02 \; ms\)
  • Important points:
    • Access time is dominated by seek time and rotational latency
    • First bit in sector is the most expensive, the rest are free
    • SRAM access time is about 4 ns per double word, DRAM about 60 ns
      • Disk is about 40,000 times slower than SRAM
      • Disk is about 2,500 times slower than DRAM

I/O Bus

Reading a Disk Sector (1)

  • CPU initiates disk read by writing a command, logical block number, and destination memory address to a port (address) associated with the disk controller

Reading a Disk Sector (2)

  • Disk controller reads the sector and performs a direct memory access (DMA) transfer into main memory

Reading a Disk Sector (3)

  • When the DMA transfer completes, the disk controller notifies the CPU with an interrupt

Nonvolatile Memories

  • DRAM and SRAM are volatile memories
    • Lose information if powered off
  • Nonvolatile memories retain value even if powered off
    • Read-only memory (ROM): programmed during production
    • Electrically erasable PROM (EEPROM): electronic erase capability
    • Flash memory: EEPROMS with partial (block level) erase capability
  • Uses for Nonvolatile Memories
    • Firmware programs stored in a ROM
    • Solid state disks
    • Disk caches

Solid State Disks (SSDs)

SSD
SSD
  • Pages: 512KB to 4KB, Blocks: 32 to 128 pages
  • Data read/written in units of pages
  • Page can be written only after its block has been erased
  • A block wears out after about 100,000 repeated writes

SSD Tradeoffs versus Rotating Disks

  • Advantages
    • No moving parts
  • Disadvantages
    • Have the potential to wear out
    • More expensive per byte
  • Applications
    • Smartphones, laptops
    • Increasingly common in desktops and servers

Summary

  • The speed gap between CPU, memory and mass storage continues to widen
  • Well-written programs exhibit a property called locality
  • Memory hierarchies based on caching close the gap by exploiting locality
  • Flash memory progress outpacing all other memory and storage technologies