Accelerating the Performance of Large-scale XVA Workloads

February 24, 2021

In the post-crisis world, an increasing number of banks have set up a centralized XVA desk. With the introduction of new regulations to ensure banks are adequately capitalized, it has become common practice to include certain costs in the pricing of OTC derivatives that, in many cases, had previously been ignored. To assist in the pricing for the cost of dealing with a counterparty in a derivative transaction, the markets have developed various metrics including CVA, DVA, FVA, ColVA, KVA, and MVA—collectively known as XVAs.

One of the key challenges of XVAs is that adjustments need to be calculated on a portfolio basis rather than trade by trade. This requires dealing with a large number of computations and orders of magnitude more calculations for accurate results. The calculation of XVAs is highly complex, combining the intricacies of derivative pricing with the computational challenges of simulating a full universe of risk factors. Given the strategic importance of XVA, banks require enhanced capabilities and modern infrastructures to calculate the required credit, funding, and capital adjustments. As banks look to reduce, mitigate, and optimize XVA and other capital charges, they are making investment in XVA capabilities in an attempt to solve the computational challenge of simulating a full universe of risk factors.

Another key challenge is how to efficiently calculate XVA sensitivities. While sensitivities have always been an important component of XVA desk risk management, the FRTB-CVA framework published by the Basel Committee in 2015 has made managing regulatory capital a priority for banks globally. This has further driven the demand for calculation of sensitivities. Banks that are unable to calculate CVA capital charge using the sensitivity-based FRTB approach will have to use the rather punitive formula-based basic approach.

Why is it important to make XVA calculations faster?

XVAs are simulation-based calculations. The market standard is to use Monte Carlo (MC) simulations, with each covering thousands of paths across a large number of future time steps.

For example, a MC simulation of 2000 paths across 78 time steps for a portfolio of 40,000 trades requires up to 6.24 billion calculations that on average produce over 10GB of compressed result data. Sensitivities, stress tests, and attribution calculations can increase the number of calculations by an order of magnitude. XVA-related calculations are by far the most computationally resource-intensive for a bank. Hence banks are looking for quantitative as well as technology-based solutions designed to optimize performance.
 

Calculating XVA

The demand for higher performance has highlighted the need to get the most out of the latest generation of software. A distributed architecture that supports the heavy demands of big data provides a number of benefits when dealing with large, complex portfolios. The main benefits include scalability, reliability, and resilience. However, the use of distributed computing for calculating XVA also presents a number of challenges, mainly in regard to I/O performance and (central processing unit) CPU processing.

While distributing the workload increases the calculation performance, it comes at a cost of transporting and persisting results to the data store. Furthermore, to compute the results the calculations reference data loaded from the data store. Measuring XVA is a highly complex process that requires having to save and manipulate large amounts of data. Having access to large-scale distribution and big data technology to minimize I/O is important. XVA calculations are also time-critical, which requires a high-performance CPU to handle the workload.

A number of factors influence the speed of XVA computations, including the size of the portfolio, the amount of market data, and the configuration of the Monte Carlo simulations themselves. Collectively these increase the number of calculations that need to be performed. Simulation configuration increases the complexity of each individual calculation, which means each calculation takes longer to complete and generate results.

The number of simulated market variables, the amount of time steps to simulate, and the number of simulation paths are the configurations that have the greatest impact on performance. Simulations are configured to have a number of market variables that evolve over time. The number of time steps are the number of steps into the future that are simulated to evolve the market variables, whereas the number of simulation paths is the number of times the market variables are simulated over time. Naturally, as any of these are increased, the calculations become more complex. The most expensive factor in the performance of these simulations is the number of simulation paths.

All these factors result in XVA calculations being very resource-intensive, even when using a distributed computing architecture. On a set-up with an average number of simulation paths and an average portfolio size, a full set of XVA calculations (simulations, aggregation, and sensitivities) can take well over 30 minutes to complete and generate over 200GB of result data to be persisted.

Quantifi is built on a modern microservices architecture using a distributed computation system backed by a data store. XVA calculations are set up to take advantage of the distributed computation system, as it allows results to be calculated quickly in parallel. The distributed architecture also has the benefit of being able to scale up to meet increased workloads.


Quantifi Compute Engine Architecture

CPU versus I/O

For the Quantifi XVA use case, performance analysis can be broken down into two main components: CPU performance (time spent on calculations) and I/O performance (time taken to transfer and store the data involved in the calculations). In an ideal scenario, Quantifi XVA calculations would be limited almost entirely by CPU performance, since the majority of the workload is due to the large set of computations.

Inherent to the performance of Quantifi XVA is a push-and-pull relationship between CPU and I/O. As calculations complete faster, the time spent on I/O to transfer and save the results increases. This challenge is made more prevalent using a distributed architecture, as adding more computing resource also increases the amount of results to be transferred and stored at the same time. Similarly, if calculations become more complex, then the size of the results to store and transfer will increase and affect how quickly I/O can complete.

To determine the impact on Quantifi XVA calculations, a sample environment was created. A dummy portfolio was used to profile the performance of the system with a sample set of XVA calculations. While the calculations completed, system-wide performance metrics were collected to serve as a set of baseline numbers to be examined and compared with those of subsequent tests.
 

Diagram 1: % CPU time used with 1st Gen CPU persisting results to disk

 

Diagram 1 demonstrates the existence of the I/O bottlenecks during the XVA calculations. The circled sections outlined where the CPU should be performing calculations but is instead at a lower utilization as they weight to persist result data to the disk on the data store. The circled sections account for 38 percent of the overall duration.

Diagram 1 highlights the percentage of CPU time used across the distributed computation system. This metric is used to determine the times when the CPU is running calculations versus when it is idle. Ideally, CPU performance would reach 100 percent and remain at that level for the duration of the calculations. The troughs in the middle of the graph illustrate the presence of I/O, causing the CPU to go into an idle state.

The initial analysis indicated that the first step to improve performance was to minimize the time spent on I/O. From the baseline performance statistics and the accompanying graphs, it appears that XVA calculations are being stopped because of the I/O required to transfer and store the results.

To compare performance, another set of calculations was completed, with the results persisted to memory. The expectation was that by eliminating the step of persisting the results back to the data store, the time spent on I/O between calculations should all but disappear.
 

Runtime with
persistence (mm:ss)

Runtime without
persistence (mm:ss)

% Decrease

9:01

6:59

23

 

Table 1: XVA calculation durations with and without persisting results to disk

 

Diagram 2: CPU time used without persisting results to disk
 

Diagram 2 llustrates the best-case scenario I/O performance using the existing hardware. The circled regions in Diagram 2 highlight that the CPU performance dips are much smaller than those in Diagram 1. As shown in Diagram 2, the CPU utilization is much better in the regions most affected by I/O in Diagram 1. The 2nd diagram’s circled regions are also 38 percent shorter than they were in Diagram 1.

Comparing the durations in Table 1 with and without persisting, the results highlight that approximately 23 percent of the total duration is spent on the I/O required to save the results. This shows that the I/O related to persisting result data can be improved. The first step was to improve the write I/O performance.

Intel recommendations and test results

For the next test, in an effort to improve the write I/O performance and reconcile the difference between the durations collected with and without persisting to the data store, the CPU on the data store was upgraded to Intel’s latest 2nd Generation Xeon Scalable processor.

Intel’s latest CPUs boast a number of features designed to improve I/O performance, including expanded I/O through 48 lanes of PCIe 3.0 bandwidth, Integrated Intel Quick Assist Technology (Intel QAT), and access to the Intel Intelligent Storage Acceleration Library (Intel ISA-L). The expanded I/O bandwidth could directly improve the I/O performance out of the box, as more lanes allow more results to be written to the data store in parallel. Integrated Intel QAT promises efficient, enhanced data transport capabilities. Intel ISA-L is a feature that will become more useful as more software adopts it to optimize and improve storage operation performance.

Intel’s new CPUs also support Intel Optane Persistent Memory, an innovative storage technology in which specialized memory provides a faster alternative to traditional disk storage. As a first step, Quantifi replaced the existing 1st Generation Xeon Scalable processor on the data store but kept the storage technology constant and collected the same metrics as the previous test.

 

1st Gen CPU
runtime (mm:ss)

2nd Gen CPU
runtime (mm:ss)

% Decrease

9:01

7:11

20

 

Table 2: XVA calculation durations on 1st Gen CPU vs 2nd Gen CPU

 

1st Gen CPU 
runtime (mm:ss)

2nd Gen CPU 
​runtime (mm:ss)

% Decrease

7:11

6:59

2


Table 3: XVA calculation durations on 2nd Gen CPU vs 1st Gen CPU without persisting results to disk

 

Diagram 3: CPU time used with 2nd Gen CPU persisting results to disk

 

The conclusion to draw from Diagram 3 is that it closely resembles Diagram 2—the circled areas are a similar size. This indicates that the change in CPU has a demonstrable impact on I/O performance. Switching to the 2nd Generation CPU and persisting results to the disk yielded similar performance when compared with using Intel’s 1st Generation CPU and persisting results to memory. The circled regions in Diagram 3 are also 40 percent smaller than those in Diagram 1.

Comparing this test to the previous test where results were persisted to memory shows that the change in hardware alone improved the I/O performance of writes considerably. In terms of total duration, the gap in performance between a set of calculations with persistence and a set of calculations without persistence has dramatically reduced using a 2nd Generation CPU. The change in CPU resulted in better write performance, so repeating the test without persisting results with the 2nd Generation CPU was a useful next step to determine whether the I/O performance could be further improved.

1st Gen CPU 
runtime (mm:ss)

2nd Gen CPU 
​runtime (mm:ss)

% Decrease

7:11

6:37

8

 

Table 4: Total duration on 2nd Gen CPU with and without persisting results to disk
 

Table 4 demonstrates that the change from 1st Generation to 2nd Generation Xeon Scalable processor reduced the I/O time for the calculations from 23 percent to 8 percent. This is a significant boost to I/O performance from just a CPU change. Quantifi conducted another test, this time utilizing the new Intel Optane persistent memory for the data store. The goal was to determine whether Optane could further enhance the I/O performance using the 2nd Generation CPU given that the read and write speeds are considerably faster than that of the non-volatile memory express (NVMe) drive previously used.

The first results using Optane were not very promising in terms of the overall runtime. The runtimes between NVMe and Optane were very similar, and across all samples fell within the margin of error for elapsed time. As such, the conclusion from this set of results was that performance remained consistent for 2nd Generation CPU with NVMe versus 2nd Generation CPU with Optane.

A more positive outcome was the dramatically improved throughput of the I/O, having switched from NVMe to Optane. The runtimes were very similar from NVMe to Optane, which is likely due to the result data not being large enough to max out the NVMe write speed for long. In a more expensive environment using more computing resources or a larger number of simulation paths, there could be a greater performance benefit from using the Optane drives.

 

2nd Gen CPU + NVMe
(bytes/sec)

2nd Gen CPU + Optane (bytes/sec)

% Decrease

567,351,788

1,377,171,632

143


Table 5: Maximum recorded write pressure on 2nd Gen CPU using the NVMe and Optane

 

The table above, comparing the maximum recorded disk writes in bytes per second, shows Optane writes data over 130 percent faster than NVMe. Intel claimed that if the size of the results grew larger than NVMe’s max throughput, then the Optane calculation durations would stay consistent while the NVMe durations would become larger. This highlights that Optane provides for greater system scalability. When combined with the other I/O performance benefits seen with the 2nd Generation processors, these new CPUs appear to be very attractive in the long term for large-scale XVA calculation workloads.

 

Number of 
simulation paths

Average disk writes
(bytes/sec)

Duration (mm:ss)

% increase average
disk writes from
2000 paths

% increase duration
from 2000 paths

2000

73,053,585

7:09

- -

3000

83,965,349

9:17

13

23

4000

95,406,641

11:10

23

36

5000

108,578,910

12:54

33

45


Table 6: Average I/O throughput comparison using variable amount of simulation paths
 

Additional testing was carried out to determine the effects of running with a greater amount of computing resources and an increased number of simulation paths on the system. Metrics collected on a consistent hardware set-up demonstrate the effect of increasing the number of simulation paths by increments of 1000. As the number of simulation paths increases, both the average disk writes and duration jump by a sizeable margin.

In a similar fashion, increasing the amount of computing resources for these calculations also has a direct impact on the average disk writes. Raising the amount of distributed computing resources ups the number of writers persisting results to the data store. The calculations also generally complete faster using more compute cores up to the point where the I/O becomes the limiting factor.
 

Number of
compute cores

Average disk writes
(bytes/sec)

% increase between
average disk writes
from 176 cores

176

77,651,854

-

248

85,761,755

10

320

93,116,821

20

 

Table 7: I/O throughput comparison using variable amount of compute node cores

Conclusions

The various tests carried out in this whitepaper demonstrate that leveraging Intel’s latest hardware can accelerate the performance of large-scale XVA workloads by increasing performance of the CPU and improving the efficiency of I/O.

The tests also revealed that the use of Optane persistent memory over more traditional storage offers greater scalability if the XVA workload becomes more expensive. Furthermore, the number of simulations, the amount of computing resources, and portfolio size all stress the system’s I/O performance when writing to the data store. This makes upgrading to Intel’s latest generation processor much more compelling, as it provides the ability to scale with portfolio size, computing resources, and calculation complexity.

 

    Request Related Whitepaper

 


Related Whitepaper and Videos

Quantifi Enhances Performance with Intel

Vectorisation: The Rise of Parallelism