by Rohan Douglas, CEO & Jamie Elliott, Development Manager, Risk Architecture (Quantifi)
This is the first in a series of blogs on vectorization which is a key tool for dramatically improving the performance of code running on modern CPUs. Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).
In this blog I cover how CPUs have evolved and how software must leverage both Threading and Vectorization to get the highest performance possible from the latest generation of processors.
The Rise of Parallelism
For the past decade, Moore’s law has continued to prevail, but while chip makers have continued to pack more transistors into every square inch of silicon, the focus of innovation has moved away from greater clock speeds and towards multicore and manycore architectures.
As Herb Sutter famously observed in 2005, for developers this architectural shift meant the end of the “Free Lunch”, where existing software automatically ran faster with each new generation of hardware. Traditional applications based on a single serial thread of instructions no longer see performance gains from new hardware as CPU clock rates have flat-lined.
Since that time, a great deal of focus has been given to engineering applications that are capable of exploiting the growing number of CPU cores by running multi-threaded or grid-distributed calculations. This type of parallelism has become a routine part of designing performance critical software.
At the same time as the multi core chip design has given rise to task parallelism in software design, chipmakers have also been increasing the power of a second type of parallelism, instruction level parallelism. Alongside the trend to increase core count, the width of SIMD (single instruction, multiple data) registers has been steadily increasing. The software changes required to exploit instruction level parallelism are known as ‘vectorization’.
The most recent processors have many cores/threads and the ability to implement single instructions on an increasingly large data set (SIMD width).
A key driver of these architectural change was the power/performance dynamic of the alternative architectures.
- Wider SIMD – Linear increase in transistors and power
- Multi core – Quadratic increase in transistors and power
- Higher clock frequency – Cubic increase power
SIMD provides a way to increase performance using less power.
The first widely deployed desktop SIMD was with Intel’s MMX extensions to the x86 architecture in 1996.
Intel’s latest generation of Xeon Phi processors codenamed Knights Landing uses Intel’s new 14nm manufacturing process, has over 70 cores on a 2D mesh structure, 4 threads per core, and can operate on 512 bit vectors (SIMD length).
Software design must adapt to take advantage of these new processor technologies. Multi-threading and vectorization are each powerful tools on their own, but only by combining them can performance be maximized.
The above results are for a binomial options pricing example. Most existing code is either serial or implements Threading or Vectorization only. The combination of both Threading and Vectorization provides dramatic improvements and the scale of those improvements is growing with each new generation of hardware.
Modern software must leverage both Threading and Vectorization to get the highest performance possible from the latest generation of processors.
Resources
Vectorization, Kirill Rogozhin, Intel, March 2017
Vectorization of Performance Dies for the Latest AVX SIMD, Kevin O’Leary, Intel, Aug 2016,
A Guide to Vectorization with Intel® C++ Compilers, Intel, Nov 2010,
Vectorization Codebook, Intel, Sep 2015,
The Free Lunch Is Over – A Fundamental Turn Toward Concurrency in Software, Herb Sutter, March 2005
Recipe: Using Binomial Option Pricing Code as Representative Pricing Derivative Method, Shuo-li, Intel, June 2016