IBM Research - Tokyo

StreamWeb: Real-Time Web Monitoring with Stream Computing Toyotaro Suzumura1,2 and Tomoaki Oiki2 1 IBM Research – Tokyo 2 Tokyo Institute of Technology

IEEE ICWS 2011 (International Conference on Web Services) 11/07/07

© International Business Machines Corporation 2011

IBM Research - Tokyo

Executive Summary §  We propose a real-time web monitoring system called “StreamWeb” that handles the large amounts of streaming social data available from the Web and analyzes that data in real time on top of a stream computing platform

§ 

StreamWeb

2 IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Outline § Background and Motivation § Stream Computing and System S § Real-Time Web Monitoring System § System Evaluation § Concluding Remarks and Future Work

3

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Background – Growth of Streaming Social Data §  Recently a major trend involves Web services with streaming APIs that allows end users or partners to retrieve real-time streaming data published by those Web services. §  Examples include the Twitter Streaming API, the Facebook Open Stream API, and so forth. This trend will greatly affect the world and lead to innovative services.

4

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Motivation: Real-Time Web Monitoring §  We need real-time web monitoring system that handles the large amounts of streaming data available from the Web and analyzes that data in real time for such examples as real-time pandemic prediction, marketing, economic indicator (GDP and Consumer Price Index, …), …

5

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Problem Statement

Prior Arts is built as a monolithic architecture and special-purpose application

§  Social Web Monitoring –  Google Flu Trends (http://www.google.org/flutrends/) •  Ginsberg (Google), Detecting influenza epidemics using search engine query data, Nature 2008

–  Earthquake Real-time Monitoring from Twitter [Sakaki, WWW’10] à Built as a monolithic architecture and special-purpose application

§  MapReduce Programming Model [Dean, OSDI’04] –  We focus on the real-timeliness and response times as well as the throughput, and MapReduce and Hadoop are unsatisfactory.

[Sakaki, WWW2010] Earthquake shakes Twitter users: real-time event detection by social sensors [Dean, OSDI2004] MapReduce: Simplified Data Processing on Large Clusters 6

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Outline § Background and Motivation § Stream Computing and System S § Real-Time Web Monitoring System § System Evaluation § Concluding Remarks and Future Work

7

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Stream Computing and System S §  System S: a Stream Computing Middleware developed by IBM Research (productized as “InfoSphere Streams” now) §  A middleware platform that processes massive amount of data on the memory rather than storing data on the disk like traditional model Traditional Computing

Fact finding with data-at-rest 8

IEEE ICWS 2011 (International Conference on Web Services)

Stream Computing

Insights from data in motion © International Business Machines Corporation 2011

IBM Research - Tokyo

System S Programming Model Source Adapters

Operator Repository

Sink Adapters

Application Programming (SPADE) Platform optimized compilation

9

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

SPADE : Advantages of Stream Processing as Parallelization Model §  A stream-centric programming language dedicated for data stream processing §  Streams as first class entity –  Explicit task and data parallelism –  Intuitive way to exploit multi-core and multi-nodes

§  Operator and data source profiling for better resource management §  Reuse of operators across stored and live data §  Support for User-Defined OPerator (UDOP) implemented in either C/C++ or Java 10

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

A SPADE Example [Application] SourceSink trace

Aggregate Functor

Source

Sink

[Nodepool] Nodepool np := (“host1”, “host2”, “host3) [Program] // virtual schema declaration vstream Sensor (id : id_t, location : Double, light : Float, temperature : Float, timestamp : timestamp_t) // a source stream is generated by a Source operator – in this case tuples come from an input file stream SenSource ( schemaof(Sensor) ) := Source( ) [ “file:///SenSource.dat” ] {} -> node(np, 0) // this intermediate stream is produced by an Aggregate operator, using the SenSource stream as input stream SenAggregator ( schemaof(Sensor) ) := Aggregate( SenSource ) [ id . location ] { Any(id), Any(location), Max(light), Min(temperature), Avg(timestamp) } -> node(np, 1) // this intermediate stream is produced by a functor operator stream SenFunctor ( id: Integer, location: Double, message: String ) := Functor( SenAggregator ) [ log(temperature,2.0)>6.0 ] { id, location, “Node ”+toString(id)+ “ at location ”+toString(location) } -> node(np, 2) // result management is done by a sink operator – in this case produced tuples are sent to a socket Null := Sink( SenFunctor ) [ “udp://192.168.0.144:5500/” ] {} -> node(np, 0) 11

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

InfoSphere Streams Runtime Optimizing scheduler assigns operators to processing nodes, and continually manages resource allocation

Processing Element Container

Processing Element Container

Processing Element Container

Processing Element Container

Processing Element Container

Streams Data Fabric Transport X86 X86 Blade Box 12

X86 X86 Blade Blade

Template Documentation

X86 FPGA Blade Blade

IEEE ICWS 2011 (International Conference on Web Services)

X86 X86 Blade Blade

X86 Cell Blade Blade © International Business Machines Corporation 2011

IBM Research - Tokyo

Outline § Background and Motivation § Stream Computing and System S § Real-Time Web Monitoring System § System Evaluation § Concluding Remarks and Future Work

13

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

StreamWeb: Real-Time Web Monitoring with Stream Computing §  We propose real-time web monitoring system called “StreamWeb” that handles the large amounts of streaming social data available from the Web and analyzes that data in real time on top of a stream computing platform

§ 

14

StreamWeb

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

System Requirements for StreamWeb §  Generality and Extensibility –  The system needs to add and monitor additional data sources as new data sources become available –  The system needs to support for various analytics algorithms and two Web Services: Pushed-based Web Service (e.g. Twitter Streaming API) and Pull-based Web Services ( e.g. Twitter Search Service).

§  Programmability and Software Productivity –  The system needs to provide an easy-to-use programming model that allows end users to write new analytical algorithms without worrying about the performance and scalability issues.

§  Performance and Scalability –  The system should scale as the volume of data becomes large. –  The system should handle major surges dynamically since the number of messages varies depending on the time of day and the situation, such as when a special event is taking place. 15

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Overall StreamWeb Architecture Visualization Tier

Real-time Analytics Tier

Web Application (e.g. Visualization via Map)

Web Browser

Data

Web Application (e.g. only display Statistics)

Real-time Analytics Engine

Real-time Analytics Engine Data

Streaming Data Collector

Push

Pull

Streaming Data Collector

Streaming Translator

Streaming Data Collector

Web Scraping

Pull

Streaming Web Service (w/ Streaming API)

Web Service (w/ RESTAPI or RSS ) Web Sites (w/o API)

External Web Services (I) Map (e.g. Google Map, Yahoo Map)

16

SNS (e.g. Facebook)

Photo Sharing

External Web Services (II)

IEEE ICWS 2011 (International Conference on Web Services)

(e.g.Flickr)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Real-Time Analytics Tier §  This tier is comprised of Streaming Data Collector (SDC) and RealTime Analytics Engine (RAE). §  Implementation on top of SPADE –  Both of the components are implemented using SPADE and run on top of System S. –  As the incoming data volume increases, both components can scale depending on the incoming traffic, thanks to System S’s design.

§  Support for various Web services, –  Type I that already support streaming API –  Type II that provides data access via REST or SOAP –  Type III existing websites without special APIs. SDC is only responsible for handling continuous data from external components. We need extra translation components for Type II and Type III sources.

17

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Sample Real-Time Analytics Engine §  The scenario is that the system obtains streaming messages from the Twitter service, monitors for specified keywords, and then maps the messages including those keywords onto Google Maps in real-time. §  To realize this service, we used two Web services available in Twitter. –  One is a traditional Web service called the Twitter Search Service that returns a list of messages with the keywords specified in the HTTP request. –  The other is the Twitter Streaming API.

18

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Streaming Data Collector §  For the XML format: –  We built our own parser, the Streaming XML Parser dedicated to incoming XML messages –  Existing XML parsers such as Xerces assume that the parser retrieves the XML data from a file. However, for streaming data, we should avoid storing the incoming XML data in any file.

§  For the JSON format: –  We also created a dedicated SPADE operator for parsing JSON-format data using the C++-based JSON Parser.

19

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

How to realize sample services with Twitter Search Service / Streaming API ? Twitter Streaming API The system obtains all of the posted messages from the Twitter Streaming API and filters them against the specified keywords. These results include the user profile data.

Twitter Search Service 1.  Retrieve a list of posted messages from Twitter Search Service: – 

The system sends an HTTP request with the target keywords for monitoring and receives a list of messages. This is pull-based, so it repeatedly sends request to the search service.

2.  Retrieve the user profiles via the Twitter API (since the returned messages include only the user names). 3.  Each returned user profile includes a user location. Some users with iPhones also publish their exact locations, so Step 2 can be skipped. For Japanese users, the system uses the morphological analysis tool Mecabu to get the name of the city from the location data in the profile data. 4.  The internal dictionary identifies the latitude and longitude for the user location. Twitter Streaming API : http://dev.twitter.com/pages/streaming_api_methods Twitter Search Service: http://search.twitter.com/api/ 20

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Implementation (1/2) SPADE Program

Twitter Streaming API

Post Filter

Geometry Coder

Post Parser

Post Filter

Geometry Coder

Source Connector Functor

Twitter Search Service

Post Parser)

Split

Visualization Tier Barrier

Post Retrieval

Post Parser

Filter

Geometry Coder

Node assignment on physical compute nodes streams01 streams02 streams03 streams04 streams05 streams06 streams07

21

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Implementation (2/2) §  SourceConnector connects to a Twitter Streaming Server via the Twitter Streaming API, and then continues to fetch the posted messages in the JSON format.

§  PostParser parses the incoming JSON

SPADE Program

Twitter Streaming Source Connector API

messages with a JSON Parser implemented in C++.

§  PostFilter obtains posted messages from PostParser, and transmits only the messages with the specified keywords

§  GeometryCoder returns a list of messages with geographic information for the latitude and longitude.

Functor

Twitter Search Service

Post Parser)

Post Filter

Geometry Coder

Post Parser

Post Filter

Geometry Coder

Split

Visualization Tier Barrier

Post Retrieval

Post Parser

Filter

Geometry Coder

Node assignment on physical compute nodes streams01 streams02 streams03 streams04 streams05 streams06 streams07

22

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Visualization Tier Example This user interface displays twitter messages at the locations where they were posted. We used the Google Maps API and Ajax components in JavaScript to asynchronously connect to the Web server (implemented using the Python Twisted library) and retrieve the posted messages and the location data.

23

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Other Screenshot …

24

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Performance Evaluation §  Experiment I tested the system scalability while monitoring various numbers of keywords. §  Experiment II tested the performance optimization with System S’s node re-allocation. §  The first operator called “SourceConnector” parses JSON data using the C++ JSON Parsing Library, and this parsing is the heaviest processing, so we distributed this work among the nodes from streams01 to streams06.

25

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Experimental Environment SPADE Program Total Nodes: 7 nodes (streams01 – streams07) Spec. for Each Node : 2.7-GHz AMD Athlon Post Post Geometry 1640B uniprocessor , 1GB memory, CentOS 5.2 Parser) Filter Coder (Linux Kernel 2.6.18.-92) Twitter Network : 1Gbps Network Streaming Source Post Post Geometry Connector API Parser Filter Coder Software: InfoSphere Streams 1.1 Functor Split Data Access Method: spritzer level of the Twitter Streaming API Twitter Post Data : 41,237 posted messages (50,432 KB) for Search Retrieval Service the 1 hour from 0:00 to 0:59 on 2009/10/18 Experimental Setting: Post Filter Geometry Coder Parser The emulation for reusing the messages via the Twitter Streaming API was handled by the first UDOP operator, the “SourceConnector” that avoid a bottleneck in sending the posted Node assignment on physical compute nodes messages from a file or network socket by storing the messages in memory. streams01 streams02 streams03 streams04 streams05 streams06 streams07

26

IEEE ICWS 2011 (International Conference on Web Services)

Visualization Tier Barrier

© International Business Machines Corporation 2011

IBM Research - Tokyo

Experimental I: Throughput as the number of nodes increased §  The system could process more than 25,000 messages per second with 3 nodes. The throughput was saturated around 3 nodes due to a bottleneck in the Split operation that distributes the data to the multiple nodes

1 word

25000 20000 15000 10000 5000 0 1

27

2

3 4 # of nodes

5

6

IEEE ICWS 2011 (International Conference on Web Services)

Throughput (messages per second)

Throughput (messages/sec)

§  By increasing the number of monitoring keywords from 1 to 1024 that leads to more computational load, the throughput becomes better in linear way up to 6 nodes.

1024 words

25000 20000 15000 10000 5000 0 1

2

3

4

5

6

# of nodes © International Business Machines Corporation 2011

IBM Research - Tokyo

Experiment II: Optimizing Throughput by Changing Node Allocation §  6 nodes (streams01 to streams06) used as compute nodes and 1 node (streams07) used for the Socket, Functor, and Split operations. §  The node streams07 becomes the bottleneck since it is busy trying to send a sufficient number of requests to all the compute nodes. §  By re-allocating the Functor and Split operators to streams07, and allocating the Socket operator to streams06, the throughput became better from 20000 to around 250000 messages per second.

28

# of words to be monitored

IEEE ICWS 2011 (International Conference on Web Services)

4096

2048

1024

512

256

128

0 64

4096

2048

512 1024

256

64 128

32

16

4 8

2

1

0

5000 32

5000

10000

16

10000

15000

8

15000

20000

4

20000

25000

2

25000

30000

1

30000

Throughput (messages/sec)

Throughput (messages per second)

Node Re-allocation

# of words © International Business Machines Corporation 2011

IBM Research - Tokyo

Concluding Remarks and Future Work §  Concluding Remarks –  We proposed a real-time Web monitoring system called “StreamWeb” built on top of a stream computing platform, System S. –  Our first application of the StreamWeb system tracks vast amount of streaming Twitter messages and displays them according to their originating locations on Google Maps. –  We only showed one instance of streaming data sources, but our defined architecture is general and flexible, so we could build other innovative applications to find new knowledge in real-time.

§  Future Work –  We will use other data sources other than Twitter and build more applications and complex analytics, and explore other performance optimizations 29

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

IBM Research - Tokyo

Questions

? ?

Thank You 30

IEEE ICWS 2011 (International Conference on Web Services)

© International Business Machines Corporation 2011

StreamWeb: Real-Time Web Monitoring with Stream ...

Nov 7, 2007 - International Business Machines Corporation 2011 ... 2 Tokyo Institute of Technology ... SPADE : Advantages of Stream Processing as.

4MB Sizes 1 Downloads 176 Views

Recommend Documents

StreamWeb: Real-Time Web Monitoring with Stream ...
throughput, and MapReduce and Hadoop are unsatisfactory. The developer should be able to focus only on the analytical algorithms, perhaps selecting from ...

StreamWeb: Real-Time Web Monitoring with ... - Semantic Scholar
Twitter streaming data, and that displays any messages including the specified keywords .... stored into relatively slow secondary storage such as a hard disk.

StreamWeb: Real-Time Web Monitoring with ... - Semantic Scholar
messages into what they call a “public timeline” if the users are making public “tweets” (the basic .... iPhone. By leveraging that location information with the tweets, ... Extensibility: The system needs to add and monitor additional data s

Learn to Write the Realtime Web - GitHub
multiplayer game demo to show offto the company again in another tech talk. ... the native web server I showed, but comes with a lot of powerful features .... bar(10); bar has access to x local argument variable, tmp locally declared variable ..... T

Realtime HTML5 Multiplayer Games with Node.js - GitHub
○When writing your game no mental model shift ... Switching between different mental models be it java or python or a C++ .... Senior Applications Developer.

complete web monitoring pdf download
Click here if your download doesn't start automatically. Page 1 of 1. complete web monitoring pdf download. complete web monitoring pdf download. Open.

Monitoring with Zabbix agent - EPDF.TIPS
server cache, and two minutes until agent would refresh its own item list. That's better ...... You can look at man snmpcmd for other supported output formatting ...

Elastic Stream Computing with Clouds
[email protected]. Abstract—Stream computing, also known as data stream processing, has emerged as a new processing paradigm that processes incoming data streams from tremendous numbers of .... reduce the time needed to set up servers if we prepare i

Elastic Stream Computing with Clouds
C. Cloud Environment. Cloud computing is a way to use computational resources ... Cloud is only an IaaS (Infrastructure as a Service) such as. Amazon EC2 or ...

Elastic Stream Computing with Clouds
cloud environment and to use optimization problem in an elastic fashion to stay ahead of the real-time processing requirements. Keeping the Applicationʼ's.

Optimizing the update packet stream for web ... - Research at Google
Key words: data synchronization, web applications, cloud computing ...... A. Fikes, R. Gruber, Bigtable: A Distributed Storage System for Structured Data,. OSDI ...

Collaboration-Enhanced Receiver Integrity Monitoring with Common ...
greatly enhancing fault-detection capability. Keywords-Collaborative Navigation, CERIM, RAIM. I. INTRODUCTION. In safety-critical applications of the Global Navigation. Satellite System (GNSS), such as vehicle automation, it is critical to verify ran

56.PERSONAL HEALTH MONITORING WITH ANDROID BASED ...
PERSONAL HEALTH MONITORING WITH ANDROID BASED MOBILE DEVICES.pdf. 56.PERSONAL HEALTH MONITORING WITH ANDROID BASED MOBILE ...

Earthquakes - modelling and monitoring - with mr mackenzie
consideration to how you will analyse and present your results. ... microphone input of a computer, software can be used to analyse the voltage and hence the.

Atmospheric Monitoring with Arduino.pdf
Page 1 of 89. www.it-ebooks.info. Page 1 of 89. Page 2 of 89. www.it-ebooks.info. Page 2 of 89. Page 3 of 89. Atmospheric. Monitoring. with. Arduino. Patrick Di ...

Application Note // Refrigeration Monitoring with ...
Page 1. Application Note // Refrigeration Monitoring with EpiSensor.

53. A Low Cost Web Based Remote Monitoring System.pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 53. A Low Cost ...

Anomaly detection techniques for a web defacement monitoring ...
DI3 – Università degli Studi di Trieste, Via Valerio 10, Trieste, Italy ... exploiting security vulnerabilities of the web hosting infrastruc-. ture, but ...... the best results.

Anomaly detection techniques for a web defacement monitoring service
jack DNS records of high profile new zealand sites, 2009; Puerto rico sites redirected in DNS attack security, 2009) ..... Hotelling's T-Square method is a test statistic measure for detecting whether an observation follows a ... statistics of the ob

ADOW-realtime-reading-2017.pdf
September-November 2017 TheTenthKnot.net. SEPTEMBER. OCTOBER. Page 1 of 1. ADOW-realtime-reading-2017.pdf. ADOW-realtime-reading-2017.pdf.

MUVISYNC: REALTIME MUSIC VIDEO ALIGNMENT ...
computers and portable devices to be played in their homes or on the go. .... lated cost matrix and the path through this matrix does not scale efficiently for large ...

From Russia with Love Film Stream German 1963_ ...
Page 1. Whoops! There was a problem loading more pages. Retrying... From Russia with Love Film Stream German 1963_.MP4___________________.pdf.