Memory Hierarchy and Caching

7 min readNov 14, 2024

As programs and their outputs demand increasingly more memory, simply building larger memory units often leads to performance degradation. To maintain efficiency, memory systems are structured into multiple levels, each with specific roles, allowing for smarter, faster data management. This approach, known as the memory hierarchy, incorporates distinct storage layers, from smaller, faster caches close to the CPU to larger, slower main memory and beyond.

In today’s discussion, we’ll explore the concepts of memory hierarchy and caching, examining how these strategies help optimize performance by efficiently loading and storing data across different memory levels.

Elements of The Hierarchy

Most modern computers implement a six-level memory hierarchy to optimize data access and processing speeds. These levels include:

Registers: Located within the CPU, registers are the fastest and smallest form of memory, typically only a few bytes in size. They store data the CPU is currently processing, such as intermediate results and instruction addresses.
L1 Cache: This is the primary cache directly connected to the CPU cores. It’s faster than other caches and is usually divided into separate data and instruction caches. L1 caches are small (usually between 32 KB to 128 KB per core), balancing high speed and size.
L2 Cache: Larger and slightly slower than L1, L2 cache is typically shared by a single core or a small group of cores, with sizes from 256 KB to a few MBs. It helps bridge the gap between the L1 cache and main memory by storing data that L1 misses.
L3 Cache: This is a shared cache among all cores of a CPU. Although slower and larger than L1 and L2 caches, L3 offers improved hit rates, reducing the need to access main memory. L3 cache size typically ranges from a few MBs to tens of MBs.
Main Memory (RAM): This is the primary memory where the OS, applications, and active data reside. DRAM is slower than the caches but has higher capacity (usually in gigabytes). RAM serves as the main storage for currently running programs and actively used data that doesn’t fit in cache memory.
Secondary Storage (SSD/HDD): When the data exceeds RAM capacity, it is stored in secondary storage, such as SSDs (solid-state drives) or HDDs (hard disk drives). While significantly slower than RAM, secondary storage has a large capacity and provides non-volatile storage, meaning data persists even when the power is off.

By efficiently structuring data access across this hierarchy, systems can balance performance, cost, and storage capacity, ensuring that high-speed memory is used for frequently accessed data and more economical storage for infrequently accessed data.

Caching

Over the years, processors have become significantly faster, while memory speeds have struggled to keep pace, creating a substantial gap between CPU and memory performance. Accessing data from main memory can now take hundreds of CPU cycles, far longer than even the most complex CPU operations. To address this, caches have been introduced as a solution.

Caching works by storing recently accessed data in small, high-speed memory units, making it faster to retrieve data the CPU is likely to need soon. This approach leverages two main principles of data access patterns:

Temporal Locality: Programs often access the same memory location repeatedly within a short timeframe, as seen in loops.
Spatial Locality: Programs also tend to access memory locations near each other within a given timeframe, commonly seen in array processing.

These locality principles make caching highly effective, allowing data access to keep up with processor demands and reducing the frequency of slow main memory access.

Cache Terminologies

In memory hierarchy, memory chunks and cache blocks (also known as cache lines) are key concepts that help in organizing data for efficient storage and retrieval:

A memory chunk refers to a contiguous block of memory in the main memory (RAM) allocated for storing data. Memory chunks can vary in size and are commonly used to hold data structures.

A cache block or cache line is the smallest unit of data that can be transferred between the main memory and the cache.

When data from a larger memory chunk is accessed, it is divided into multiple smaller cache blocks as it is transferred into the cache. The cache system uses these smaller blocks to load and store data closer to the CPU efficiently, ensuring that frequently accessed data can be retrieved quickly without repeatedly accessing the larger memory chunk in RAM.

A HIT occurs when the processor successfully finds the requested data in the cache, allowing it to avoid a time-consuming trip to main memory. This saves a substantial number of clock cycles, improving performance.

On the other hand, a MISS means that the data is not available in the cache, forcing the processor to fetch it from main memory. This access is much slower, costing additional cycles and reducing efficiency.

Cache Block

A cache block generally comprises three types of bits to manage data storage and retrieval efficiently:

Data Bits: This is the actual data stored in the cache block, typically ranging from 32 to 128 bytes, representing a contiguous segment of memory from main memory. The size of these data bits depends on the cache architecture.
Tag Bits: Tags act as unique identifiers mapping each cache block to its specific memory address in main memory. When the processor requests data, these tags help determine if the cache has the requested address, resulting in either a “cache hit” or “cache miss.”
Status/Control Bits: These bits provide extra details about the cache block’s status, usually including:

Valid Bit: Indicates if the data within the cache block is valid or not.
Dirty Bit: Shows whether the data has been modified in the cache but not yet written back to main memory.
Replacement Policy Bits: Help manage replacement policies, indicating which cache block to replace when the cache is full.

With these components, cache blocks efficiently track data validity, recency, and state, allowing for quick and accurate data retrieval in a structured memory hierarchy.

Addressing A Cache Block

To determine the cache block for a given memory address, a typical memory address is divided into three main parts: the tag, index, and offset. These parts help the cache locate and manage data effectively by identifying the specific block where the data resides and the exact location within that block.

Tag: The tag portion uniquely identifies which block in main memory maps to a particular block in the cache. It is used during cache lookup to verify whether the data in a specific cache block matches the requested memory address.
Index: The index portion indicates which cache block will store the data. It points to a specific location within the cache by selecting the cache block or set where the data might be found. The number of bits needed for the index depends on the number of blocks or sets in the cache.
Offset: The offset specifies the exact location of the data within a cache block. Since a cache block typically contains multiple bytes (32, 64, or 128 bytes), the offset allows accessing a specific byte or word within that block. The number of offset bits depends on the block size; for example, if the cache block size is 64 bytes, you need 6 bits (since 26=642⁶ = 6426=64) for the offset.

When a memory address is accessed:

The index is used to locate the specific cache block or set where the data could be.
The tag of the address is compared with the tag stored in the cache block to verify if it’s a match. If the tags match and the valid bit is set, it’s a cache hit.
The offset is then used to access the exact location within the cache block, retrieving the desired data.

Example of Cache Addressing Based on 8-bit Memory Address

Chunk Mapping Strategies

Determines how larger memory chunks (typically allocated in main memory) are mapped to cache blocks for efficient access. The goal of these strategies is to leverage cache effectively, improving data retrieval performance by aligning with common access patterns. There three types of chunk mapping strategies:

Direct Mapping: Each memory chunk is assigned to exactly one specific cache block. This is done using a simple modulus operation that determines which cache block will hold the chunk based on its address.
Fully Associative Mapping: Any memory chunk can be mapped to any cache block. There are no constraints on where a chunk can be stored within the cache, which maximizes flexibility and reduces cache misses due to conflict.
Set-Associative Mapping: Balance between direct and fully associative mapping. In this strategy, the cache is divided into several sets, and each memory chunk maps to a specific set but can occupy any block within that set.

There are many more caching concepts we haven’t covered here, like conflict resolution and prefetching algorithms. To keep this article concise, I’ll stop here, but I plan to dive deeper into these topics in future posts. If you’re interested, follow me to get updates when those articles go live!

Resources

Digital Design and Computer Architecture 2nd Edition by David Harris & Sarah Harris
Onur Mutlu Lectures