Vectorization, Part 2: Why and What?

June 22, 2017

This is the second in a series of blogs on vectorization, which is a key tool for dramatically improving the performance of code running on modern CPUs. Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).
by Rohan Douglas, CEO & Jamie Elliott, Development Manager, Risk Architecture (Quantifi)

In my last blog I cover how CPUs have evolved and how software must leverage both Threading and Vectorization to get the highest performance possible from the latest generation of processors. In this blog I cover the why and what of Vectorization.

Why Vectorize

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time.

Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD). For example a CPU with a 512 bit register could hold 16 32-bit single precision doubles and do a single calculation 16 times faster than executing a single instruction at a time. Combine this with threading and multi-core CPUs leads to orders of magnitude performance gains.

The following is code to add two vectors.

two vector code

 

In a serial calculation, the individual vector (array) elements are added in sequence. The additional register space in modern CPUs is unused.

 

serial calculation

 

In a vectorized calculation, all elements of the vector (array) can be added in one calculation step.

 

vectorized calculation

What kind of problem is vectorizable?

Not all code can take advantage of vectorization. The problem set must be amenable to a vectorized solution. Vectorization works best on problems that require the same simple operation to be performed on each element in a data set. So, first of all, look for a loop. The prototypical example is used above - the addition of each element in an array.

simple loop

But many other primitive operators can also be vectorized. The kinds of matrix transformation seen in linear algebra are usually a good candidate for vectorization. The good news is that the Finance domain provides many problem sets that are suitable.

 

Issues that impact vectorizing your code

There are a range of issues that can impact the effectiveness of vectorisation. Some of the more common ones include:

 

1.    Loop Dependencies (Avoid read-after-write)

 

 

2.    Indirect Memory Access (Use loop index directly. Seek unit loop stride)

 

 

 

3.    Non ‘Straight line’ code (function calls, conditions, unknown loop count)

 

 

Issues that impact vectorizing your code


 

Implementing Vecorization

Alternatives

There are a range of alternatives and tools for implementing Vectorization. They vary in terms of complexity, flexibility and future compatibility.

implementing vectorization

Source: Intel

Intel’s 6 Step Program for Vectorization

The simplest way to implement vectorization is to start with Intel’s 6-step process. This process leverages Intel tools to provide a clear path to transforming existing code into modern, high-performance software leveraging multicore and manycore processors.

Step 1. Measure baseline release build performance

The starting point is a reference release build. A release build is important because:

  1. The compiler will optimize your code

  2. You need to have a baseline to measure how vectorization is improving performance

Ideally you should set a goal for performance to know when you are done.

Step 2. Determine hotspots

Tools like Intel’s performance profiler VTune™ Amplifier XE can be used to profile your application to find the most time-consuming areas of code or “Hotspots”. Identifying Hotspots helps focus effort on the areas of optimization that will generate the most benefit.

Intel VTune Amplifier XE

Intel VTune Amplifier XE

Step 3. Determine loop candidates

Compiler reports like Intel's Compiler Optimization Report can tell you which loops are suitable for vectorization. Loops in hotspots that are not automatically vectorizable may be able to be modified using various techniques to allow them to be vectorized.

Step 4. Analyse specific hotspot code to measure performance gains

Tools like Intel's Advisor can be used to measure potential benefits from vectorization of specific code to help focus effort for the maximum gain.
Intel Advisor

Intel Advisor

Step 5. Implement Vectorization Recommendations

Implement recommendations for vectorizing code using re-ordering of code, compiler hints or other methods.

Step 6. Repeat

The process is iterative and should be repeated till the desired performance is reached.

 

Resources

Vectorization, Kirill Rogozhin, Intel, March 2017
Vectorization of Performance Dies for the Latest AVX SIMD, Kevin O’Leary, Intel, Aug 2016, 
A Guide to Vectorization with Intel® C++ Compilers, Intel, Nov 2010, 
Vectorization Codebook, Intel, Sep 2015, 
The Free Lunch Is Over - A Fundamental Turn Toward Concurrency in Software, Herb Sutter, March 2005
Recipe: Using Binomial Option Pricing Code as Representative Pricing Derivative Method, Shuo-li, Intel, June 2016

Comments

Comments are closed on this post.