Increasing Cache Performance with Prefetching

Enes Harman
5 min readNov 17, 2024

--

Caching is one of the most essential techniques for achieving high performance in modern computers. It enables faster access to required data, often within a few cycles, compared to the hundreds of cycles needed to fetch data from main memory. However, the performance benefits of caching depend heavily on the cache hit rate. Even with the most advanced memory systems capable of retrieving data in a single cycle, it is ineffective if the data is not in the cache during program execution. To tackle this issue and improve cache hit rates, prefetching mechanisms are employed.

In this article, I will delve into the types of cache misses and explore prefetching mechanisms designed to minimize these misses.

Cache Miss Types

Before diving into prefetching, let’s first understand why it is necessary to prefetch data. Prefetching involves retrieving data that a program is likely to need in advance, aiming to prevent stalls caused by cache misses. Cache misses can significantly impact program performance, and they generally fall into three categories:

  • Compulsory Miss: These occur on the first reference to a memory address block, as the data has not yet been loaded into the cache.
  • Capacity Miss: These happen when the cache is too small to store all the data needed by the program, leading to data being evicted and subsequently reloaded.
  • Conflict Miss: These occur when multiple memory addresses compete for the same cache location, despite the cache having sufficient capacity.

Prefetching is particularly useful for avoiding compulsory misses. By definition, the first access to a memory address will result in a miss because the data is not yet in the cache. To mitigate this, prefetching anticipates these accesses and loads the data into the cache ahead of time, ensuring that when the main program flow requires the data, it is already available. This proactive approach minimizes stalls in the main process and enhances overall performance.

Understanding these types of misses highlights the importance of prefetching mechanisms, which aim to reduce their impact and optimize cache utilization.

Prefetching

As mentioned earlier, prefetching involves fetching data before it is actually needed by the program. Its effectiveness lies in addressing the high latency of memory operations. By accurately and timely prefetching data, we can significantly reduce the delays caused by waiting for memory access.

Prefetching works by predicting which memory addresses will be accessed in the future. Unlike branch misprediction, which can disrupt program correctness, incorrect prefetching has no impact on the program’s correctness. It simply results in wasted memory bandwidth, making it a safer optimization technique.

Fundementals of Prefetching

There are several ways to implement prefetching, each leveraging different aspects of the computing stack:

  • Programmer
  • System
  • Hardware
  • Compiler

Before deciding on a prefetching approach, several critical questions must be addressed to design an effective and precise prefetching system:

  1. What address to prefetch?
    Prefetching irrelevant or unused data wastes resources such as bandwidth and energy. To determine what to prefetch, systems often analyze past access patterns, such as sequential or strided memory access, and predict future needs.
  2. When to initiate a prefetch request?
    Timing is crucial. If data is fetched too early, it risks being evicted before it’s used, wasting the effort. Conversely, fetching too late might not fully hide the memory latency, defeating the purpose of prefetching. Striking the right balance ensures optimal performance.
  3. Where to prefetch?
    Prefetching can target different levels of the memory hierarchy, such as L1, L2, or even main memory. Modern systems often include separate prefetchers for different levels, each designed to optimize the data flow specific to its location.
  4. How to prefetch?
    As discussed earlier, prefetching can be implemented through various methods: hardware, compiler, programmer-driven, or system-level prefetching. The choice depends on the application, system constraints, and desired performance gains.

Answering these questions thoughtfully helps in designing prefetching mechanisms that minimize overhead, maximize cache efficiency, and improve overall system performance.

Execution Based Prefetching

The idea behind execution-based prefetching is to pre-execute a portion of the program exclusively to prefetch data. These threads, called speculative threads, are designed to predict and load data into the cache before it is needed, reducing stalls caused by cache misses.

Speculative threads can operate in various ways:

  1. On a separate processor or core
    Entire processors or cores are dedicated to running speculative threads, ensuring that prefetching tasks do not interfere with the main computation.
  2. On a separate hardware thread context
    Speculative threads utilize separate hardware thread contexts (e.g., multithreading) to execute concurrently with the main thread, balancing resource sharing and isolation.
  3. On the same thread context during idle cycles
    Speculative threads run opportunistically during idle cycles of the main thread, making use of otherwise unused processing time.
Execution Based Prefetching

In the example above, additional threads are forked from the main thread to perform prefetching tasks. These threads run independently to fetch data and fill the cache ahead of time, preventing stalls in the main program flow caused by memory latency. This technique is particularly useful in programs with predictable memory access patterns, as it keeps the processor actively engaged with useful computations.

Runahead Execution

Runahead execution, which is employed by modern computers, is a speculative technique that allows instructions to execute and retire without strictly following sequential processing. This enables the system to trigger cache misses preemptively, addressing a critical bottleneck in modern computing: the large instruction window required for out-of-order execution to tolerate main memory latencies effectively.

The process is:

  • When the oldest instruction in the pipeline encounters a long-latency cache miss, the system checkpoints the architectural state and enters runahead mode.
  • In this mode, instructions are pre-executed speculatively, not for correctness but to generate useful prefetches for future execution.
  • Once the original cache miss is resolved, the system restores the checkpoint, flushes the speculative pipeline, and resumes normal sequential execution.
Runahead Execution Example

In standard execution, instructions must retire in order, which keeps dependent instructions locked in the instruction window, leading to long stalls during cache misses. Runahead execution circumvents this by speculatively pre-executing instructions in runahead mode. These instructions leave the window as they generate useful prefetches, reducing the bottleneck caused by long-latency memory accesses.

Benefits of Runahead Executin:

  • Increased Cache Hit Rate: Runahead execution proactively fetches data into the cache, reducing future memory latency.
  • Improved Instruction Throughput: By clearing dependent instructions from the window during runahead mode, it allows the system to prepare for smoother execution once the main pipeline resumes.

Resources

  1. Digital Design and Computer Architecture 2nd Edition by David Harris & Sarah Harris
  2. Onur Mutlu Lectures

--

--

Enes Harman
Enes Harman

Written by Enes Harman

I’m a computer engineer passionate about deeply understanding and clearly sharing fundamental concepts.

No responses yet