The Use of Data Science
Data science applications are used across multiple industries. Obvious power users are high-tech web-based firms like Google, Netflix, Uber and Amazon. Bricks-and-mortar industries like Big Pharma and Logistics firms are also heavy users.
Aside from businesses, the COVID-19 pandemic and concepts like “flattening the curve” and R-naught have suddenly, and tragically, brought data science applications to the forefront of our lives. Applications that are designed to collect and cleanse data, pipe it through models and visualize model outputs are all powered by data science tools. For example, Palantir, a data firm, is helping the NHS in the UK cleanse its data and merge it with other datasets to help plan the response to the pandemic. Similarly, the John Hopkins Dashboard which delivers up-to-date information in real time is powered by Solace, an event streaming and management platform. Likewise, the predictions that are constantly being discussed in the news are powered by forecasting models from MIT, IHME, Columbia and a few others.
The table below outlines a number of use cases across markets and trading, banking, investment management and non-financial risk activities that can benefit from the use of some of these environments.
In the finance and banking industry there has been adoption on the core banking side, for example, to model customer behaviour, assess credit for borrowers etc. Within capital markets, and specifically in risk management, some of the big banks have invested heavily over the past few years to build their own proprietary platforms. Tier 2 and Tier 3 banks, along with smaller buy-side institutions, are primarily where we see an opportunity for third-party solutions.
A data science powered risk analytics platform for these firms would have three different components.
Firstly, the data component, which includes integrated security master, portfolio risk and financial data. On the data management side, this would involve on-demand normalization and curation of the data. Data no longer resides in a database somewhere, it is streaming over the cloud and needs to be normalized in real time. Secondly, the analysis component includes cross-asset financial model libraries as well as AI and machine learning tools. The final component is BI and visualization which includes third-party tools such as Power BI and Tableau.
All of this needs to be implemented in a development environment or a platform that provides a fast cycle of model development, from experimentation to production, while also enforcing a strong governance structure. This eliminates the fragmentation of production system versus analysis system, which is used by traders, analysts, and quants – typically in Excel workbooks – across the organization. It also facilitates the ability to combine internal data sets with external structured and unstructured data sets and the agility required for experimentation within a production environment.
There are a number of open source tools, as well as third-party applications, available in the market that are designed to support the data science process:
- Ingesting Data – Files, Data Feeds, SQL, HDFS and Kafka
- Wrangling Data – Refine and Python
- Modelling – Python, Jupyter, R, RStudio, VS Code along with the Financial & Risk Model Libraries that that will be required
- Testing – DataOps
- Publishing – Python, Dash or third-party applications like Power BI and Tableau
How is Quantifi Leveraging Data Science?
Quantifi has stayed ahead of the competition by continuing to make smart investments in emerging technologies and next-generation approaches including data science. A common use case that Quantifi typically sees from clients is leveraging third-party visualization tools to report on portfolio and risk data. This essentially involves publishing data, not just from Quantifi models, in a format that can be processed by a third-party reporting tool.
A more interesting use case that Quantifi has encountered is back-testing. Here a client might have mixed datasets from diverse sources, coupled with open source tools (Python, Jupyter, RStudio, etc) and market standard financial model libraries. This creates an ideal platform for back-testing analysis. Back-testing involves large amounts of data along with the financial models that one would use on top of that data. One example of back-testing includes portfolio and product structuring. If a portfolio manager constructs a portfolio or a trader structures a product, before they execute on the portfolio or product they back-test it against current historical data or stressed historical data to anticipate how the portfolio or product would perform.
Trading strategies like algorithmic trading would be an obvious example. Another example would be correlation trading, where you are taking a view on certain risk factors like correlation. In this case you build a structure that is essentially hedging all the other risk factors and only exposing you to correlation risk. This sort of strategy is driven by risk-neutral hedges but traders also need to assess how it would perform in a real-world environment. By back-testing it they can see if the performance based on the risk neutral strategy is closely replicated in the real world. Hedging strategies are common on the buy-side as well sell-side and a good example would be FX hedge balancing on an intraday or end-of-day basis.
Lastly, these are some of the use cases that are driven by regulation. If a bank qualifies for a model-based regulatory capital approach (as opposed to a standard approach), their models need to be validated regularly. The regulators will ensure that the bank has proper processes in place to validate those models. For example, with FRTB, there are simulation based measures like VaR and Expected Shortfall that would be used if a bank wants to avail itself of the advanced model-based approach. In this instance, one of the conditions is that banks need to regularly back-test those models to ensure that the model is performant. The same applies to measures such as Potential Future Exposure.
Portfolio Construction and Optimization
Portfolio construction and optimization is another area in which artificial intelligence (AI) and machine learning algorithms are frequently used. By leveraging novel optimization techniques and multiple structured and unstructured data sources, firms can make better investment decisions. For example, with trading strategies based on price, Quantifi has collaborated with a firm that uses AI and machine learning to forecast bond prices based on data analysis. This AI firm uses Quantifi models for the risk metrics on their platform.
Alternatively, event based trading strategies involve forecasting defaults, earnings, corporate actions, and then structuring portfolios to take advantage of possible arbitrage opportunities when such events occur.
Another option is a weight optimization strategy where portfolios are tracking benchmarks or model portfolios based on risk metrics like variance, returns, Sharpe ratios, duration, etc. are using AI and machine learning algorithms to make better investment decisions.
To complement and enable these strategies, firms are increasingly seeking out alternative data sources. There is a wealth of data that is digitized and easily available and more firms are using web scraping, crowd-sourced data, and social media along with image recognition and natural language processing.
We are currently undergoing a shift towards new data science tools. The current COVID-19 pandemic is likely to accelerate this trend given the need to generate forward-looking insights and support business decisions in a collaborative and time compressed manner. As previously mentioned in Part 1 of this blog, ‘How is Data Science Transforming Banking and Capital Markets’, Excel is a tool for ad-hoc analysis and forward-looking simulations. Whilst Excel is not going away anytime soon, there has been a concerted effort to move away from a heavy reliance on Excel due to some of the limitations in performance volumes and a lack of collaborative features in the tool.
Secondly, while the application of data science approaches holds significant promise, there are several caution points and considerations that financial institutions must take into account – the first being that strong data and model management foundations are still very much required. As with any technology solution, it is not a panacea for all the complex data and analytical challenges that firms may face. The implementation of some of these platforms assumes that there are relatively established data quality processes in place. Most data science tools and platforms are not meant to address fundamental data governance and assurance activities, even though they contain facilities for data handling, data wrangling and management.
As more advanced analytics and AI base models are deployed, we expect regulators and business stakeholders to require a more appropriate fit-for-purpose model risk governance process. This will not only cover conventional models but also AI based algorithms, especially those that are financially material when embedded within a firm’s business decisions. We expect there to be an increased focus on explainable AI as end users and clients will want to understand the nature and insights of a smart algorithm, the underlying data sets and the potential for bias before employing them within their own organizations. Taking this into consideration when designing models will be beneficial as it will save retrofitting costs and provide a platform that can facilitate transparency requirements.
Thirdly, there are non-conventional data sets that will increasingly be used in conjunction with existing structured financial and market data. However, at present, there is still a fair degree of friction around data ingestion and wrangling of new and alternative data sets, as they do not originate from the financial sector, so the taxonomies may be different. The efficacy of some of these new data sets and the correlation with prevailing trading and investment patterns will need to be tested and analysed. This must be done in a relatively time compressed manner in order to reduce information commoditization and to prevent time related value decay.
Overall, this represents an early stage development for where trends are heading and this is still rapidly changing in terms of data science, machine learning technologies and IT practices. People talk about Python, Java, and other machine learning languages on the market now – however there are also other languages on the fringe that are being explored: for example Julia. Over the next few years, we expect some of these tools and languages to become less clunky, more industrialized and better streamlined compared to what we have today. These tools will develop more intuitive user interfaces, collaborative features and workflows, as well as AI based data quality routines.
In the coming years we anticipate that firms will look for converged, open data science offerings that can integrate new tools, new languages and offer a coexistence of different development stacks. These offerings also provide the opportunity to shield quants and data scientists from low-level features and infrastructure administration activities. For example, some quants are involved with certain administrative aspects of AWS infrastructure however that may not actually be core to what they do. By implementing a platform that could shield them from some of these lower level activities would help in terms of productivity.
With packaged platforms, the net effect is to lower the operational risks and barriers to embracing data science and machine learning deployments. These platforms also better enable firms to scale up and we expect this to play out in the next few years as some of the tools and languages mature and there are more packaged and end-to-end offerings in the marketplace.