by Rohan Douglas, CEO & Jamie Elliott, Development Manager, Risk Architecture (Quantifi)
This is the second in a series of blogs on vectorization, which is a key tool for dramatically improving the performance of code running on modern CPUs. In our last blog, we covered how CPUs have evolved and how software must leverage both Threading and Vectorization to get the highest performance possible from the latest generation of processors. In this blog, we’ll cover the why and what of Vectorization.
Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time.
Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD). For example a CPU with a 512 bit register could hold 16 32-bit single precision doubles and do a single calculation 16 times faster than executing a single instruction at a time. Combine this with threading and multi-core CPUs leads to orders of magnitude performance gains. The following is code to add two vectors.
In a serial calculation, the individual vector (array) elements are added in sequence. The additional register space in modern CPUs is unused.
In a vectorized calculation, all elements of the vector (array) can be added in one calculation step.
What kind of problem is vectorizable?
Not all code can take advantage of vectorization. The problem set must be amenable to a vectorized solution. Vectorization works best on problems that require the same simple operation to be performed on each element in a data set. So, first of all, look for a loop. The prototypical example is used above – the addition of each element in an array.
But many other primitive operators can also be vectorized. The kinds of matrix transformation seen in linear algebra are usually a good candidate for vectorization. The good news is that the Finance domain provides many problem sets that are suitable.
Issues that impact vectorizing your code
There are a range of issues that can impact the effectiveness of vectorisation. Some of the more common ones include:
- Loop Dependencies (Avoid read-after-write)
- Indirect Memory Access (Use loop index directly. Seek unit loop stride)
- Non ‘Straight line’ code (function calls, conditions, unknown loop count)
There are a range of alternatives and tools for implementing Vectorization. They vary in terms of complexity, flexibility and future compatibility.
Intel’s 6 Step Program for Vectorization
The simplest way to implement vectorization is to start with Intel’s 6-step process. This process leverages Intel tools to provide a clear path to transforming existing code into modern, high-performance software leveraging multicore and manycore processors.
Step 1. Measure baseline release build performance
The starting point is a reference release build. A release build is important because:
- The compiler will optimize your code
- You need to have a baseline to measure how vectorization is improving performance
Ideally you should set a goal for performance to know when you are done.
Step 2. Determine hotspots
Tools like Intel’s performance profiler VTune™ Amplifier XE can be used to profile your application to find the most time-consuming areas of code or “Hotspots”. Identifying Hotspots helps focus effort on the areas of optimization that will generate the most benefit.
Step 3. Determine loop candidates
Compiler reports like Intel’s Compiler Optimization Report can tell you which loops are suitable for vectorization. Loops in hotspots that are not automatically vectorizable may be able to be modified using various techniques to allow them to be vectorized.
Step 4. Analyse specific hotspot code to measure performance gains
Tools like Intel’s Advisor can be used to measure potential benefits from vectorization of specific code to help focus effort for the maximum gain.
Step 5. Implement Vectorization Recommendations
Implement recommendations for vectorizing code using re-ordering of code, compiler hints or other methods.
Step 6. Repeat
The process is iterative and should be repeated till the desired performance is reached.
Vectorization, Kirill Rogozhin, Intel, March 2017.
Vectorization of Performance Dies for the Latest AVX SIMD, Kevin O’Leary, Intel, Aug 2016.
A Guide to Vectorization with Intel® C++ Compilers, Intel, Nov 2010.
Vectorization Codebook, Intel, Sep 2015.
The Free Lunch Is Over – A Fundamental Turn Toward Concurrency in Software, Herb Sutter, March 2005.
Recipe: Using Binomial Option Pricing Code as Representative Pricing Derivative Method, Shuo-li, Intel, June 2016.