Vector Processors

5 min readOct 3, 2024

A vector processor, is a type of CPU designed to perform operations on entire vectors of data in a single instruction, rather than on individual scalar values. In this context, a vector is a one-dimensional array of numbers. Therefore, a vector processor executes instructions that operate on vectors rather than single data elements (scalars).

In this article, I will explain the basic structure of vector processors and provide examples of code processing on a vector processor.

To process these data sets, vector processors require specialized registers. They are:

Vector Data Register: It is a special type of CPU register that can hold multiple data elements (like an array) simultaneously.
Vector Length Register (VLR): It is a special register in vector processors that controls the number of elements to be processed in vector instructions.
Vector Stride Register (VSTR): It specifies the memory spacing (or stride) between consecutive elements of a vector when accessing memory.
Vector Mask Register (VMASK): It is a register or set of registers in vector processors used to control which elements in a vector operation are active or inactive.

We will discuss these registers in more detail as we delve deeper into the topic.

Loading/Storing Vectors

Loading and storing vectors inherently involve performing operations on multiple data elements simultaneously. In memory, the elements of a vector are separated by a constant distance, referred to as the stride. This value is stored in the VSTR (Vector Stride) register during vector processing.

The next address is calculated as:
Next Address = Previous Address + Stride

Irregular Access to Vector

If vector data is not stored contiguously in memory with a fixed stride, we use an indirection mechanism to combine or pack elements into vector registers. This technique is known as scatter/gather operations.
It is particularly useful for avoiding unnecessary computations on sparse vectors, where data elements are irregularly spaced.

These operations are often implemented in hardware to efficiently handle sparse vectors (or matrices) and cases of indirect indexing. During vector loads and stores, an index vector is used, which is added to a base register to generate the memory addresses for each element. This allows the processor to gather scattered elements from memory or scatter elements to non-contiguous locations.

The gather operation works in a similar way. The data vector is filled by accessing memory locations specified by the index values in the index vector. This allows for collecting non-contiguous elements into a single vector register.

Conditional Operations with Vectors

If there is a condition that determines which data elements should be processed, the VMASK register is used to handle it. The VMASK register acts as a filter, enabling or disabling operations on specific elements based on the condition.

Let’s say we have a loop iterating over two vectors and we want to multiply their elements, but only when the corresponding value in the first vector is non-zero.

Here a high level code:

There are essentially two types of mask processing:

Simple Implementation: Execute all operations, trun off result writeback accoring to VMASK register
Density-Time Implementation: Scan mask vector and only execute with non-zero masks

Depending on the data, one implementation may prove to be more efficient than the other.

Getting High Performance From Vectors

In vector processing, the real bottleneck is usually the memory system. The throughput largely depends on the memory used in the system. Due to the nature of vector instructions, we don’t need to fetch instructions for every individual element, which results in a significant performance boost. Additionally, heavy instructions, such as load operations, can be easily pipelined. However, achieving this without any loss requires a memory system with sufficient banks and ports.

A memory bank is a logical division of memory that allows parallel access to different parts of the memory simultaneously.

Let’s say we have this code:

for i=0 and i<50
  C[i] = (A[i] + B[i])/2

In the code above, we add two vectors element-wise, divide the result by 2, and store the outcome in a third vector. Below is a high-level overview of the vector instructions used to achieve this operation:

We assume that the addition instruction takes 4 cycles, and the vector load takes 11 cycles with sufficient memory banks available. Due to pipelining, loading a complete vector in this example takes 11 + (VLEN — 1) cycles. Additionally, we only fetch and decode the vector load (VLD) instruction once per vector, compared to 50 times for a scalar processor, leading to significant savings in instruction fetch and decode cycles.

Conclusion

Vector processors are a powerful tool for efficiently processing large data sets by performing the same operation on multiple data elements simultaneously. They are especially well-suited for applications in scientific computing, graphics, and machine learning, where the workload is often structured around operations on vectors and matrices. By leveraging specialized vector registers, vector instructions, and optimized memory access patterns, vector processors can achieve substantial performance improvements over traditional scalar architectures.

Despite their complexity and dependence on a high-bandwidth memory system, the benefits of reduced instruction count and efficient handling of large data sets make vector processors a valuable addition to modern computing. Understanding their architecture and operation is essential for optimizing performance in vectorized applications and harnessing the full potential of parallel processing.

Resources

Onur Mutlu Lectures