Streaming Big Data in Finance (2).pdf

Viewer
Transcript

Richard Skeggs

Streaming Big Data in Finance Date: July 2017 No: 2017-21

Exploring data. Enhancing knowledge. Empowering society W. www.BLGdataresearch.org

Abstract The buzz word in business for the last couple of years has been Big Data. There is a wealth of presentations, papers and products within this area and it grows year on year. Most of these talk about ingestion, map reduce and data mining. The growth of streaming data produces its own problems and requires a new approach. According to Forrester Research streaming big data analytics is defined as “Tools that allow a business to process and act on massive amounts of information while it’s still moving, as opposed to waiting for data to come to rest in a data warehouse or Hadoop. The technology is being used increasingly as new sources of data become common, such as streaming sensor data from the Internet of Things, streaming social media data like Twitter, and streaming mobile information from apps.” The traditional approach to big data has been store the data then process. The processing can be performed in multiple ways and blended with other data sources to extract more meaning. This approach is not always possible with data streams. The questions then become: 

How and what can we extract meaning from the stream?



Can we blend this with other datasets?



Do we need to store the data for historic reasons?

Introduction When looking at the use of streaming data within finance perhaps the most obvious scenario is around the streaming of stock and financial instruments from the various stock exchanges. Though the obvious scenario by far not the only. Streaming data analytics are also important within fraud detection for bank accounts and credit cards. The real time analytics of sales data within large high street chains is also important. The insight provided by real time analytics can make a real difference to a business, whether that is the improvements to a customer experience or the ability to take immediate action based on a discovered insight. This paper will start of by looking at some of the tools and techniques that can be used to process data streams at, near to or in real time. There are the existing solutions from the Apache foundation as well as solutions from Amazon heavily based around their cloud platform. Microsoft and Google are also producing tools capable of handling streaming data. The paper will also look at the use of visualisation, blending data as well as the benefits that machine learning can provide. This paper will also look at the techniques used by the fintech sector to gain an advantage. Existing Solutions The Apache Foundation has a number of tools within the stable that can handle the streaming of data. These can be split between the traditional approach of batching the data prior to processing or processing the data as it is streamed. Apache Spark is a low latency in memory processing engine. Data is loaded into memory and can be queried repeatedly. The memory load process creates

1

small batches that are run at near real time. At the core of Spark is an advanced directed acyclic graph (DAG) engine supporting cyclic data flow. A Spark job will create a set of DAG tasks to be performed on the cluster. Spark can be used for the processing of sensor and log data. Apache Storm is a distributed real-time computation system. It is designed to process streaming data. The processing within Apache Storm follows a typical DAG dataflow. The integration of Storm within any programming language allows developers to build custom distributed processing engines. There are similarities between Spark and map reduce jobs but the biggest difference is that map reduce will eventually terminate whereas a Storm process does not. Storm can be used for the cleaning, normalisation and analysis of high volume data with low latency. Apache Flink is yet another stream processing framework for a high performance high availability processing engine. Flink provides a data streaming API that allows for the transformation of the data stream by filtering, clustering and state manipulation. Data is processed in parallel across the distributed platform. The application can handle the processing of unreliable data where data can be received out of sequence. Apache Apex is a YARN native platform that unifies stream and batch processing. The design is aimed to process big data in motion. The API allows developers to write the business logic behind the application. The processing is performed in chunks within memory within the cluster making real time batch processing possible. Apache Kafka is another open source stream processing platform. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. The data is partitioned into topics before being indexed and stored with a timestamp. Kafka is a good solution for large scale message processing applications. It can be used with Apache Storm to handle data pipelines for high speed filtering and pattern matching on the fly Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. The streams get divided into partitions that are an ordered sequence where each has a unique ID. It supports batching and is typically used with Hadoop's YARN and Apache Kafka. The approach to processing streaming data can be split into two camps pure streaming of data or the real time batch processing of data. The real time batch processing engines of Spark, Apex and Flink take the stream, chunk the data and process. Spark also has the capacity to work as a stream process engine. Apex is reliant on Hadoop as it integrates with YARN. Storm and Kafka concentrate more on streaming data. Samza like Apex relies on Hadoop as it integrates with YARN but also works in conjunction with Kafka. Amazon Kinesis reads data from a stream which can be used to aggregate data which can then be used to populate a dashboard. The type of data used includes IT infrastructure log data, application logs, social media, market data feeds, and web clickstream data. Because the response time for the data intake and processing is in real time, the processing is typically lightweight. Kenesis.

2

Programming Languages such as Java and Python have with their libraries tools to help with the analysis of big data. Within Python the Pandas library allows for the manipulation of data within a dataset. Tweepy allows for access into the Twitter stream. Most of the Hadoop stack has an API accessible via Java. Existing Techniques The traditional approach has been to stream data onto a big data platform such as Hadoop and then process the data from there. The processing has often meant the production of some sort of visualisation or dashboard. The benefits around visualisations are that they can provide a good clear and concise view of a large dataset but that does not always provide a good insight into the data. If the data has been stored, processed then presented does that mean that the visualisation is out of date? Is the time delay in the processing of the data acceptable to the user? Visualisation The art of data mining can help users to deal with the flood of information that is synonymous with this age of data. The advantage of visual data exploration is that the user is directly involved in the data mining process [1]. Presenting data in an easy to understand pictorial format can help unlock patterns and trends that may not be possible when looking at raw data. The use of visualisations provides an interesting insight into data with a dynamic graphical representation. With streaming data the ability to understand and identify patterns becomes even more important. The more traditional approach to dealing with live dynamic data streams has been to employ the programming languages of Java Python or R and employ a suitable API such as the Google Visualisation API. Graphics packages such as Datawatch, Alooma and Zoomdata can also provide an integration into Apache Spark. With tools like the popular Microsoft Power BI, Qlik and Tableau can be used to create visualisations without the need to know a programming language. These tools have the ability to perform some limited or simplistic data blending through point and click with a mouse. With the tools becoming more dynamic in their ability to create more advanced interactive graphics. Analysts can now generate dashboards based on dynamic streamed data. Fintech The financial technology industry (fintech) describes how technology and finance have combined and is best defined as "a new financial industry that applies technology to improve financial activities”[2]. Fintech companies are utilising the software and tools available to extract the meaning from the large volumes of data they are faced with. Within this sector the ability to process data quickly and accurately can give a major boost to organisations. To facilitate this boost fintech companies are using an array of existing tools and techniques. To begin with the use of social media

3

can give a fintech a surprising insight into customer behaviour and as a result may provide a more accurate guide to credit risk than traditional routes. This view was reinforced by a 2015 study [3] which found “Scores can become more accurate as a result of modifications in social networks”. Using traditional data set within in an untraditional manner can give unforeseen benefits. There is also the risk from the customer side that what was meant as an unguarded comment could potential have a significant side effect. The use of social media just underlines the business adage of “know thy customer”. Data is the new currency and fintechs are taking this to heart [4]. Articles have been highlighting this fact for a while and social media companies have made money and business on this very fact. A study in 2016 [5] found that a “majority of users become reactant if they are consciously deprived of control over their personal data”. With recent discussions about the use of machine learning becoming more prevelent in business decision making there is a potential that customers may be less willing to supply data. Machine Learning The use of machine learning by fintechs covers a range of tasks from better understanding the customer to the security concerns around fraud. In the UK in 2016 Financial Fraud Action UK estimated losses totaled £768.8 m. Fintechs are using machine learning to tackle the rise in financial fraud [6]. The machine learning tools available to fintechs include classification, regression, clustering, prediction, outlier detection, and visualisation. In 2013 the British Banking Association estimated that customers in the UK used internet banking 800,000 times an hour. The ability to process this volume of data for fraud detection alone requires the ability to process streaming data. The use of Hadoop [7] as part of a fraud detection solution has been researched as the platform also encompasses machine learning tools Spark and Storm provides a platform for analysing streamed data. Normally when talking about the process of streaming data within the financial sector the analysis of financial markets is probably the most obvious topic. Here fintechs are also making waves. The use of the latest tools from the big data arena makes an appearance. The 2009 paper [8] looks at the use of Hadoop a Map-Reduce and neural network solution to predict patterns in the stock market. The 2009 [9] paper uses textual analysis on stock market news to predict the movement in stock prices. The use of machine learning tools within the fintech sector to predict stock price movement is growing. Searching any academic library will show the growth of research in this field. Banks and other financial institutions are always looking for an edge. Whether that is to maximise profit or identify fraudulent transactions.

4

Natural Language Processing As has been seen earlier [8] natural language processing (NLP) can provide an insight into news stories that may have an impact on the movement of stock prices. Being able to predict that movement and react quickly can give the fintech an edge. NLP can be useful in the searching of highly contextual textual documents [10]. Aided by the fact that the search could potentially be across multiple data sources [11] which could including streamed data. Fintechs are also looking at utilising the use of natural language interface to databases (NLIDB)[11] and natural language processing. These techniques provide an insight into the requirements of customers based on their interactions with systems. An example of this could be a customer asking “How much am I spending on pet food”. NLIDB tools can convert this to a query providing an actual financial cost. Alternatively through analysis of customer queries organisations can get a stronger view of customer requirements which comes back to ‘know thy customer’. Blending Data Using multiple datasets in a blended or merged fashion can provide an interesting insight that the single data source may not have. The problem can be if the multiple datasets are all streamed. A timing issue between the differing data streams may prevent the data from being blending successfully. With streaming data it is possible as well as desirable to blend it to a static data source. The benefits of blending data are well known as they allow analysts insights that are not always available from a single dataset. With the disparate nature of data blending data from the likes of social media, email with other traditional data formats like JSON and XML can be problematic. Traditionally this has been overcome by converting the streams into a standard format using programming tools within Python or R through the use of data frames. The R code extract below combines two R data frames by joining on two volumes V4 and V2. merged.data <- merge(zoopla, companieshouse by.x="V4" ,by.y="V2") The tools with R to handle streaming data which is not Twitter based in still limited. The R Stream package does not lend itself to blending with other data sources. This is where the use of more powerful languages such as Java or the in memory databases such as Hive or Spark come into their own. Then there is the risk that the data becomes processed as a batch rather than a stream. Natural langage processing can be used to convert a thin data stream into a thick stream through blending. Data from the stream can be extracted then used as input into querying external data sources.

5

Input stream

DB

Blend process

DB

Output stream

The diagram above shows gives an overview of the process that blends streams with extenal repoitories. Conclusion Traditionally the approach has been to store data onto a platform for processing in a batch environment. The use of stream processing allows for the data to be processed as it is received. This then raises the question does the processed data need to be stored for further use? The batch process approach is good if the delay in processing does not prove too costly. In the world of finance when delays can make a difference financially the ability to process data in a quick and timely fashion is important. It is becoming increasingly apparent that streaming data processing is important with stream transform load (STL) superseding extract transform load (ETL) yet with stream processing there may be no load step. References 1. Keim, D.A., 2002. Information visualization and visual data mining. IEEE transactions on Visualization and Computer Graphics, 8(1), pp.1-8. 2. Schueffel, Patrick (2017-03-09). "Taming the Beast: A Scientific Definition of Fintech". Journal of Innovation Management. 4 (4): 32–54. ISSN 21830606. 3. Wei, Y., Yildirim, P., Van den Bulte, C. and Dellarocas, C., 2015. Credit scoring with social network data. Marketing Science, 35(2), pp.234-258. 4. Hardy, Quentin (2012): Just the Facts. Yes, All of Them. In: New York Times, 24 March 2012. http://query.nytimes.com/gst/fullpage.html?res=9A0CE7DD153CF 936A15750C0A9649D8B63&pagewanted=all (last accessed 1 January 2015).

6

5. Spiekermann, S. and Korunovska, J., 2017. Towards a value theory for personal data. Journal of Information Technology, 32(1), pp.62-84. 6. Ngai, E.W.T., Hu, Y., Wong, Y.H., Chen, Y. and Sun, X., 2011. The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems, 50(3), pp.559-569. 7. Halvaiee, N.S. and Akbari, M.K., 2014. A novel model for credit card fraud detection using Artificial Immune Systems. Applied Soft Computing, 24, pp.40-49. 8. Schumaker, R.P. and Chen, H., 2009. Textual analysis of stock market prediction using breaking financial news: The AZFin text system. ACM Transactions on Information Systems (TOIS), 27(2), p.12. 9. Dubey, A.K., Jain, V. and Mittal, A.P., 2015, March. Stock market prediction using Hadoop Map-Reduce ecosystem. In Computing for Sustainable Global Development (INDIACom), 2015 2nd International Conference on (pp. 616-621). IEEE. 10. Dhar, V. and Stein, R.M., 2016. FINTECʜ PLAтFORмS AND SтRAтEGY. 11.Skeggs, R. Lauria, S and Swift S, 2017, Federated Searching with Natural Language Interface to Database

7

Streaming Big Data in Finance (2).pdf

There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Main menu.

Download PDF

617KB Sizes 3 Downloads 162 Views

Report

Streaming Big Data in Finance (2).pdf

Recommend Documents