Skip to Content
Author's profile photo Sudhir Verma

High Performance Application Using Vectorization

In this article I intend to share the advancement of processors taking Intel as example and how the latest changes in the computer architecture can be leveraged by software program. Over the years we have seen growth in the computer architecture and languages on the software side.

Computer languages starting from 8086 programming using instruction set, to basic, to HLL’s (High Level Language). HLL’s have helped software industry with high productivity, software engineer can write code for huge program at ease, whereas the same code could have taken lot of efforts having tried in basic language. The HLL’s have also helped IT industry with the easy maintenance, since it is much less code to maintain and also much more human readable compared to Instruction level programming. HLL programming also eliminated the need to understand the below hardware / microarchitecture level knowledge for programming, with most focus on language itself. With the growth of tooling in IDE (Interactive Development Environment), compilation and monitoring of software have improved a lot for programmer across world. Programmers can concentrate on problem in hand rather getting deep into computer architecture side. Industry for future too would wish that, also engineer designing computer system wants the same and trying to address the same too. Making sure the compilers, instruction are able to handle most of it automatically.

As much as it is true to remain independent in HLL’s from the beneath architecture of hardware / microarchitecture, at times it becomes important to have the knowledge of machine on which software is going to be executed. There are many examples in which it becomes important for programmer to understand the microarchitecture of computer prior to writing the code. Example such as database software or analytics or graphics, researchers across world is trying to create a database technology which aligns best to the hardware design for it to remain high performing. HANA itself is a great example of high performance database platform. To get deeper understanding of high performance, one need to get more insight (nicely covered in the following paper of Very Large Data Bases 2009 :  link here). HANA has leveraged the microarchitecture to its best form and used the latest vectorization approach in software programming to get the best performance of In-Memory database (as mentioned in the paper). To get into the details let us first look into the evolution of Intel processors register since 1993 to 2012 in Figure 1. Figure 1 shows that initial register having 16 bits capacity in 1993, 32 bits in 1997, Intel SSE with 128 bits, Intel AVX (advance vector extension) of 256 bits and in 2012 with Xeon 516 bits registers. Let us take a simple example of adding two arrays and storing them in another, consider space taken by integer is 16 bits:

Code Example 1:

For (int i=0; i<32; i++) {

      C[i] = A[i] + B[i];

}

SIMDRegisterGrowth.PNG

Figure 1, showing register storage increase since 1993 to 2012

One could imagine that all we are doing in code example 1 it to add consistently two number from given two arrays and storing them into another. Which means when converted into instruction it would look as following:

Move A[0] to register A

Move B[0] to register B

Add A and B register and store result in register C

Move A[31] to register A

Move B[31] to register B

Add A and B register and store result in register C

Total 3 instructions to perform simple add for one element of the array. In 1993 the above instruction would have been fired 32 times, which means a total of 32*3=96 instructions for the whole execution for adding arrays with 32 elements.

With the increased bits in register within Intel (any processor) and evolution of microarchitecture it has become possible to perform single instruction on multiple dataset (SIMD), which is the case in the given scenario of array addition. By 1999 Intel SSE (Intel’s Streaming SIMD Extension) made it possible to reduce the instruction required for adding the array as following:

Move A[0:4] to register A

Move B[0:4] to register B

Add A and B register and store result in register C

……

Move A[28:4] to register A

Move B[28:4] to register B

Add A and B register and store result in register C

Format within array index is: start address : Length (number of integers)

Overall the same add can be performed by 24 instructions, since the load loads the 4 integers into the register for A and B and the new add operation does the simultaneous add for all four together. Thanks to evolution of hardware and cost of making higher bit registers and efficient energy mechanism for executing instruction. The approach mentioned above is called vectorization approach, in which the operation are not performed as scalar (execution of one element) but on the multiple dataset.

By 2013, with 512 bit register, the same would be easily mapped to 3 instructions as given below:

Move A[0:32] to register A

Move B[0:32] to register B

Add A and B register and store result in register C

As one can see from the example above that solving same problem with 96 instruction Vs with just 3 instructions is great performance boost for a program performance. With advent of hardware cost going down and larger register space and cache makes any program performance gets automatic boost.

The real question is to leverage computer architecture (low level processor information) in the best way, which is made possible to utilize the vectorization notation or functions or pragma SIMD tags in the C++ program to completely utilize the power of hardware instead of leaving for complier to make choice. In the latest array notation structure can help programmer in C++ to write high performance vectorized code.

In the vectorized notation the code example 1 can be written as following:

C[:] = A[:] + B[:]; (Notation 1)

This code is same as code example 1 in which explicit for loop was used. In the array notation usage it is clear to the compiler that the vectorized instruction need to be implemented instead of scalar add. Intel has evolved its execution with vectorized format.

One question which comes to anyone mind is utilizing the power of multi-core, this can be easily done to change the instruction into small chunk instructions. In such cases the parallelism can be utilized by incrementing the for loop counter with the size of register (such as by 4 in case of Intel SSE), for example if we have four processors, simultaneously execution of 16 integer was possible and the total execution cycle can be reduced by quarter of original time in single core SSE implementation. Only difference is that the following notation would lead to such parallelism, instead of incrementing index ‘i’ by one we need to increment by length of register which in case of SSE is 4:

For (int i=0; i<32; i+4) {

C[i:4] = A[i:4] + B[i:4];

}

One thing to keep in mind is about age old challenge of maintaining HLL’s which utilize the microarchitecture of specific processor, it is very much tuned for the specific hardware configuration for high performance benefits. SAP HANA being a good example of leveraging Intel SSE vectorization for its high performance in searching or table scan, one can read in details in this link.

SIMD for database operations makes lots of sense, since it is same instruction which need to be executed over the large amount of dataset. Column based approach is a good fit for SIMD as the memory expected for SIMD load operation need to be continuous (not necessary, but continuous would be efficient). For processing a search query on particular column, database performs a compare operation on each element of the column. Now with SSE implementation the same operation is parallelized within instruction and across cores. So the concept of SIMD or AVX (Advance Vector Extensions) takes parallelism to new dimension, at instruction level. Figure 2 shows various kind of parallelism such as parallelism across application, within one application threading is used for utilizing the core for parallelism, within one thread instruction are parallelized across multiple cores, and with latest development of vectorization the parallelism can be explored at the level of single instruction.

TypesOfParallelism.PNG

Figure 2, Level of parallelism, indicating SIMD (single instruction) parallelism

Programming with SIMD or without which might automatically lead to vectorization is the choice of developers. Getting deeper into low level information for understanding the performance of application and at times the problem with performance requires good amount of knowledge both at program as well as the various implementations at the hardware level.

Applicability of such knowledge can be in any program but strong focus in database technology, analytic, graphics, data processing and cloud framework.

While writing about vectorization I got to know many things from C++ perspective also at the compiler level to leverage SIMD implementation automatically. I could not gather much on the Java side for the same. As I further read on this topic, would explore if we have similar notation for Java, which can help Java programmer to utilize the vectorization from high level language. In case you have some material to share on from Java notations or SIMD, please share as comment for this blog.

Assigned Tags

      2 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Vijay Pal Singh Gaur
      Vijay Pal Singh Gaur

      Very nice article. simple to read and understand!!

      Author's profile photo Amit Saini
      Amit Saini

      nice Article, one question which is coming to my mind .. could be straight forward

      "Vectorization is currently not possible for HLL like Java n ABAP currently"?