Watershed: A High Performance Distributed Stream ...

Viewer
Transcript

2011 23rd International Symposium on Computer Architecture and High Performance Computing

Watershed: A High Performance Distributed Stream Processing System Thatyene Louise Alves de Souza Ramos, Rodrigo Silva Oliveira, Ana Paula de Carvalho, Renato Antônio Celso Ferreira and Wagner Meira Jr. Department of Computer Science Universidade Federal de Minas Gerais Belo Horizonte, Brazil {thatyene, rsilva, anapc, renato, meira}@dcc.ufmg.br

Abstract—The task of extracting information from datasets that become larger at a daily basis, such as those collected from the web, is an increasing challenge, but also provides more interesting insights and analysis. Current analyses went beyond content and now focus on tracking and understanding users’ relationships and interactions. Such computation is intensive both in terms of the processing demand imposed by the algorithms and also the sheer amount of data that has to handled. In this paper we introduce Watershed, a distributed computing framework designed to support the analysis of very large data streams online and in real-time. Data are obtained from streams by the system’s processing components, transformed, and directed to other streams, creating large ﬂows of information. The processing components are decoupled from each other and their connections are strictly data-driven. They can be dynamically inserted and removed, providing an environment in which it is feasible that different applications share intermediate results or cooperate to a global purpose. Our experiments demonstrate the ﬂexibility in creating a set of data analysis algorithms and their composition into a powerful stream analysis environment.

demand imposed by the algorithms and also of the sheer amount of data from the web. It implies the need for massive computational power, memory and I/O resources, that might be supported by distributed processing architectures. The low cost of processing components, the evolution of network technologies and the popularity of multi-core processors allow massive parallel processing with efﬁcient data communication among computational resources or nodes in a distributed system. When there is a temporal notion in this communication or when it is necessary to process the data as soon as they are deployed to the system, the communication is called stream-oriented [1] [2] and the system becomes a stream processing system. Examples of data streams include network trafﬁc statistics, monitoring measures collected by sensors, and data generated by social networks and other services in the Internet. A stream processing system is composed by modules that process in parallel and communicate with each other through channels. These modules are divided into three classes. The sources collect external data, the filters perform some processing over these data and the synchronizers provide the transformed data as output of the system. A scheme of a stream processing system is presented in Figure 1.

Keywords-Distributed systems, Data-driven architectures, Stream processing, High-performance computing, Dynamic application topology.

I. I NTRODUCTION Data analysis is a fundamental stage to understand and solve many problems and enables the acquisition of information that can help decision-making. There are several data sources for analysis and the Internet is nowadays one of the most important. The task of extracting information from those datasets that become larger at a daily basis is an increasing challenge, but also provides more interesting insights and analysis. With the web turning more and more into an important mean of user interaction, current analyses went beyond content and now focus on tracking and understanding users’ relationships and interactions. In such scenario, we can see the web as a very large “social sensor” which collects excerpts of user’s thoughts that may be geo and temporal referenced and from these data, a signiﬁcant amount of information may be extracted. The dynamic nature of this source requires that the processing is made in an efﬁcient and effective way, once the information extracted lose relevance as time goes by. Such computation is intensive both in terms of the processing 1550-6533/11 $26.00 © 2011 IEEE DOI 10.1109/SBAC-PAD.2011.31

Figure 1.

A typical stream processing architecture.

In this paper we introduce Watershed, a distributed computing framework designed to allow online analysis of very large data streams. It provides an abstraction to develop distributed stream-based applications. Watershed is inspired in the Data-Flow model, data are obtained from streams by the system’s modules, transformed, and directed to other 191

streams, creating large ﬂows of information. The modules are decoupled from each other and their connections are strictly data-driven. The multiple modules leverage task parallelism in the analysis system as a whole, and, on top of that, each module can also be instantiated in multiple nodes, enabling data parallelism. In Watershed, modules insertion or removal can be made at execution time. We believe that this dynamic nature makes the platform proper to stream processing scenarios, providing an environment in which it is feasible that different applications share intermediate results or cooperate to a common goal. We present a prototype of such system and our experiments show the ﬂexibility in creating a set of data analysis algorithms and their composition into a powerful stream analysis environment. Though the experiments are in a small scale laboratory setting they are enough to highlight the ﬂexibility and the efﬁciency of the proposed environment.

functions, increasing its utilization in different areas. In Google’s MapReduce implementation all the data trafﬁc between functions is done through Google File System [6], what is convenient because simple operations are performed on large amounts of data, without the need of an intensive message exchange. Isard et al. [7] proposed Dryad, a general-purpose distributed execution environment for applications with coarse grained data parallelism. An application written for this environment can be seen as an acyclic graph of data streams, combining processing vertices and communication edges. The communication is done in different ways, like ﬁles, TCP channels and shared memory, depending on the application requirements. Applications are decomposed into multiple processing stages and the environment allows many stages as needed, what is not supported by MapReduce. A limitation of Dryad is that cyclic applications can not be represented. System S, nowadays called InfoSphere Streams, is a large-scale, high performance computing platform developed at IBM Research, in which it is possible to build applications that can respond quickly to events and changing requirements, adapt to changing workloads, and analyze real-time data at high rates. It can execute a large number of jobs in the form of data-ﬂow graphs described in its special stream-application language called SPADE (Stream Processing Application Declarative Engine) [8], [2]. The Cayuga System [9], [10] is a high-performance system for complex event processing (CEP). It combines a simple language for composing queries with a scalable query processing engine. Cayuga is able to scale with the arrival rate of events in the streams and also with the number of queries being processed. In this system, each stream has a ﬁxed relational schema, and events in the stream are treated as relational tuples whereas the query language is a mapping function of algebra operators into an SQL like syntax. Cayuga supports resubscription, in other words, the output stream from one query can be used as input stream to one or more other queries. This feature makes feasible to have very complex event pattern queries. The latest demands of data processing require that results are generated as soon as the input data are available, creating the notion of continuous queries in these streams. Although there are systems that work under this paradigm, few of them offer the feature of multiple applications sharing intermediate results with the possibility of processing components being dynamically added and removed. In this paper we propose a distributed processing system, named Watershed, that supports these features.

II. R ELATED W ORK A way to obtain efﬁciency in large-database processing is using a set of computers to solve problems related to a given application. Several approaches have been proposed to achieve this efﬁciency and provide an abstraction to make the parallel applications developers work easier. It motivates the emergence of many distributed systems for stream processing. Simulations of physical phenomena made by scientists and engineers are a source of large volume of data. In the scope of these data, Beynon et al. [3] proposed a middleware infrastructure called DataCutter. DataCutter provides a set of main services on which developers can implement other services related to their applications or combine them with existing services in a grid environment, for example. The applications are divided into ﬁlters, written in the ﬁlterstream programming model, that might be executed at any network node. Another ﬁlter-stream based system is Anthill [4]. It was developed as a library that provides an API that allows developers to create distributed applications in an easy and intuitive way. The provided abstractions make it possible to exploit both data and task parallelism. Moreover, it introduces a third type of parallelism: asynchrony between iterations of loop applications. The applications in Anthill are often decomposed as a cyclic graph of computation and the execution may consist of iterations over this graph. MapReduce [5] is a programming model for processing and generating large databases. In this model, users may specify map functions, which process a pair (key,value) and produce as result an intermediate set of pairs, and reduce functions, which join all the values indexed by the same key in the intermediate set, generating a list of values as result. The relevance of this model is that many real applications can be expressed as map and reduce

III. WATERSHED OVERVIEW Watershed is a dynamic distributed platform that provides an abstraction to develop distributed applications that aim to process massive data streams. This platform exploits the

192

< !DOCTYPE c o n f i g SYSTEM " s t a r t u p . dtd "> prefix = " . . . "> < / ompi> < s e r v e r name = "ws−manager" home = " . . . " < !−− Watershed running d i r e c t o r y −−> running_dir = " . . . "> < h o s t name = " h o s t 1 " d a t a b a s e _ s e r v e r = " t r u e "> < r e s o u r c e name = " TwitterAPI " / > < r e s o u r c e name = " Database " / > < / host> < h o s t name = " h o s t 2 " d a t a b a s e _ s e r v e r = " f a l s e "> < r e s o u r c e name = "WebAccess" / > < / host> < / hostdec>

Figure 2.

< !DOCTYPE f i l t e r SYSTEM "module . dtd "> name = " . . . " < !−− Module t y p e ( batch | stream ) −−> type = " . . . " < !−− Path t o t h e module s h a r e d l i b r a r y −−> library = " . . . " < !−− I n p u t stream −−> flow_in = " . . . " < !−− Data r e c e i p t p o l i c y −−> input_policy = " . . . " < !−− Output stream −−> flow_out = " . . . " < !−− Number o f i n s t a n c e s −−> instances = " . . . " < !−− Module arguments −−> arguments = "−i arg1 −o arg2 "> < !−− Demanded r e s o u r c e s −−> < / demands> < / module>

The XML conﬁguration ﬁles. (Left) Watershed conﬁgurator. (Right) Processing module conﬁgurator.

available computational resources in order to achieve the maximum efﬁciency in those applications. In this platform, an application follows the ﬁlter-stream model and is composed by a chain of processing modules, exploiting task parallelism. Each stream has one or more producer modules that send messages to consumer modules. Each processing module can be transparently replicated in a set of identical instances which may run in different nodes, supporting data parallelism. Applications may run in stream mode, in which there is no termination concept. In other words, the application modules run indeﬁnitely until the user decides to remove them. Another application mode is the batch mode, in which the modules terminate as soon as they process the their last message. In this case, Watershed detects the termination of an application module and notiﬁes the next module in the chain. Once this module has received the notiﬁcation, it can terminate after processing its last received message. This action is repeated until all application modules terminate. Watershed provides a mechanism to add/remove processing modules at execution time. This feature allows multiple applications running at the same time and sharing intermediate results. The modules might consume or produce data and they are dynamically linked with their data producers and consumers. The reuse of these modules implies in a gain of performance, because it eliminates repeated computations over the same data. In order to deploy the system on a cluster, it is necessary to provide an XML (Extensible Markup Language) ﬁle that describes the environment settings, including some conﬁguration information, as well as a list of machines and their available resources, as showed in Figure 2 (Left). Using this

ﬁle information, Watershed is loaded in all machines of the cluster and it is ready to run applications. Users can then interact with the system by adding processing modules or removing them when applicable. New modules are created as specialized C++ classes, in which the programmer has to implement only the method responsible for processing data when they are delivered to it. The base class is provided by Watershed through a dynamic shared library and it implements the control methods of a module. Moreover it exports some methods that may be invoked by the user’s classes: • int GetNumberInstances(void) Returns the number of the instances of the module. • int GetRank(void) Returns the instance identiﬁer of the caller. • string GetArgument(string arg_name) Users can pass arguments to their modules through a ﬁeld in the conﬁguration ﬁle. This method retrieves a parameter value using its identiﬁer. • void Send(Message* output_message) Sends a message to all consumers according to the following policies: i) Round Robin, messages from a producer instance are sent to the consumer instances in a round robin fashion; ii) Broadcast, messages from a producer instance are sent to all consumer instances; iii) Labeled Stream, messages from a producer instance are sent to consumer instances according to a hash function applied to the message data. This hash function must be provided by the programmer of the consumer module. • void Process(Message* message) Implemented by the module programmer. Receives and processes a message from producer modules.

193

The module code must be linked with its base class library, generating a new shared library which can be loaded by the system. To add a module, users must provide an XML ﬁle containing its information. Figure 2 (Right) illustrates this ﬁle with comments about each ﬁeld. To remove a module from Watershed, users provide its name as an argument to the console and the system takes care about it, removing all instances of the module. IV. WATERSHED A RCHITECTURE The Watershed architecture, as seen in Figure 3, includes: • A system console, an interface through which users can send commands to the system. • A set of manager daemons, one for each cluster node, that controls the execution of the applications. • A set of database daemons, responsible for matching streams, providing information to make the dynamic module linking. • A communication layer, based on Message-Passing Interface[11] and built over Open MPI [12].

Figure 4.

The user/Watershed interaction diagram.

B. Manager Daemons The manager daemons are responsible for the control of each processing module during its time-life, that is, its activation, execution, and deactivation. When a manager daemon receives a command to include a new processing module, it becomes its owner. In this stage, the manager reads the module XML descriptor and creates an internal conﬁgurator. The scheduling decision is based on the module’s resource demands and the required number of instances. The demands deﬁne which nodes are able to run the instances. Given the eligible nodes, the scheduler assigns these instances in a round robin fashion. When the descriptor ﬁle does not provide the number of instances, it is assigned one to each eligible node. During the module execution, the owner manager daemon provides information about how this module can connect to a database daemon and how they can dynamically be linked to its data consumers and producers. C. Database Daemons

Figure 3.

One feature to be implemented in Watershed is the stream persistence. Since the applications are composed by a chain of processing modules, their outputs are combinations of intermediate results. It is interesting to maintain a tracking of this derivation process in order to offer a data provenance mechanism. When a new module is added to the system, it can consume real-time streams or data from the storage system. The database daemons are responsible for maintaining information about the modules’ input and output streams and make their persistence in a storage system. The information about streams are used in the module linkage process, in which the output streams from a module are matched to the input streams from another ones, and vice-versa. The stream persistence mechanism is a task in progress and it is not integrated to the system yet. The database daemons will reside in the nodes where the persistence take place, what may be in a subset of the cluster nodes.

Watershed architecture.

The architecture is detailed in the following sections. A. System Console The system console is a program that allows users to interact with Watershed. Users can start the entire environment, add processing modules as well as remove them. The system can be halted anytime through a console command, which will start a shutdown process that removes all the running modules and stops the daemons. The user/system interaction is depicted in Figure 4.

194

D. Communication Layer Watershed uses the MPI standard for communication between processes. We chose MPI because it has some desirable advantages over other message passing platforms. It is a mature and widely supported standard and it provides a high degree of portability. Watershed has a communication layer over the Open MPI library and provides the basic functions to send and receive messages, as well as functions to process management and linkage. We did a mapping between its architecture components and the MPI concepts. We describe this mapping next and the interaction between these components is showed in Figure 5. Watershed daemons, the console and modules’ instances are implemented as MPI processes. The system console communicates only with the manager daemons to send user commands. This communication follows a server/client model, where each manager daemon opens a port to receive connections from the console. Manager daemons are created and placed into a logical MPI group. They communicate with each other using an intra-communicator. When a manager daemon is requested to add a new processing module, it spawns the module instances, also logically grouped, and becomes its owner. After this moment, this manager is responsible for sending any control message to it, including its removal from the system. The manager group creates the database daemons according to the conﬁguration ﬁle and puts them in a separated logical group. The communication between these daemons is done using an intra-communicator with the purpose of exchange information about the active streams. The database group opens a port to connect to modules’ groups. When spawned by a manager daemon, a module group connects to the database group using the opened port and opens its own port to receive connections from other modules. After that, it receives from Watershed a list containing the name and the port of its data producers and consumers. With this information, the module group can connect to them, becoming part of the application chain. Then, the new module starts processing, receiving and sending messages. All communication between processes’ groups is done using MPI inter-communicators. When a processing module removal is requested, Watershed ﬁnds the corresponding owner manager, which sends a termination message to all the instances of that module. After that, the module group disconnects from the database group and from its producers and consumers. Once disconnected, all the instances within the module group are terminated. If the module is running in batch mode, Watershed is responsible for propagating the termination notiﬁcation. One of the issues in designing distributed systems such as Watershed is ﬂow control, which is necessary to avoid that

Figure 5.

Interaction between processes in Watershed.

a producer that is faster than some consumer overwhelms the latter. Although Open MPI provides buffering at the consumer side, what could be used as a solution to this problem, it is possible that the buffers increase so much that the host memory becomes insufﬁcient. In order to avoid this behavior, Watershed controls the communication between two processing modules through a credit-based scheme. The consumer has a ﬁxed size (maximum number of messages) of its receiving buffer and it divides this buffer equally between its producers. Whenever a producer sends out a message, its credit count will be decremented. When the credit reaches zero, the producer sends a special message asking for more credits and it waits for a new announcement from the consumer. When the consumer receives a request for new credits, it has already ﬁnished processing this producer messages. Then, it computes the new credit based on its current number of producers and sends it to the requester. V. E XPERIMENTAL E VALUATION In this section we describe a sample application on Watershed and present some experiments we conducted in order to investigate the system ﬂexibility and performance. The evaluation was performed in a cluster of 12 nodes, TM R each node with an Intel Core 2 CPU 6420 @2.13GHz equipped with 2GB of RAM. The application is based on a scenario in which many data analysis tools process a large collection of data coming from the web, inferring relevant information. The data source chosen was Twitter and each data analysis tool was written as a processing module. The application workﬂow is showed in Figure 6. There is a module named Collector that uses the Twitter API1 to collect tweets about a determined subject. A user gives the query to this module through a parameter and adds it to the system. This module then collects tweets 1 http://dev.twitter.com/

195

modules, the departure timestamp is recorded together with its identiﬁer and its arrival timestamp in a ﬁle. Thus, it was possible to measure the throughput at each application output. The Collector was implemented so that tweets are sent at a rate of 10 tweets/s, despite input rate variations. We varied the number of instances of Stopwords Remover from 1 to 24 and kept 1 instance for each of the other modules. For each conﬁguration we let the application process for 2 minutes. In transitions between conﬁgurations we completely remove Stopwords Remover and put it again with its new number of instances. This process allows us to simulate a dynamic scheduling mechanism that will be implemented in the future. The graph in Figure 7 shows the throughput measured in all application outputs. Figure 6.

Application workﬂow.

and sends them to two other modules, Text Extractor and Hashtag Extractor. The extractor modules parse each object representing a tweet, extracting its text and possible existing hashtags. Hashtag Extractor sends processed data to Hashtag Counter that counts the found hashtags and outputs the results, ﬁnalizing one of the application computations. This application branch can be used to determine trending topics via hashtags, for instance. Text Extractor sends processed data to a module named Stopwords Remover. This module removes the stop words present in a tweet text and sends the result to Cleaner and Stemmer. The Cleaner module extracts links and citations from received data. The results are sent to the system output, ﬁnalizing another application branch. The Stemmer module reduces the words remaining of tweet text to their stem, that is, their root forms. This processing can be used in data mining, for example. It sends the tweet text stemmed to a module named Word Counter that counts the number of occurrences of the words and outputs the result. This last module may be used to maintain word clouds. Once each module realizes a light work considering one tweet processing, we decide to inject an additional load in the Stopwords Remover, making it a bottleneck to the modules which receive data from it. This additional load consists in a random computation of trigonometric functions. In order to evaluate Watershed behavior, we gradually increment the number of instances of the Stopwords Remover module, observing the performance of all synchronizer modules. Whenever a new tweet arrives in Watershed, it receives an incremental identiﬁer and the arrival timestamp. When it leaves the environment through one of the synchronizer

Figure 7. Throughput measured in Hashtag Counter, Cleaner and Word Counter outputs.

It is possible to observe that increasing Stopwords Remover instances reduces the bottleneck experimentally imposed to its consumers, making their throughput grow fast. Clearly when the number of instances achieves the number of nodes, the overhead impacts the system throughput because the modules begin to compete for the limited computational resources, what was expected. Nevertheless, Watershed keeps the application performance in a satisfactory landing without abrupt degradation. The graph also shows that Watershed is very ﬂexible in the task of adding and removing modules. It can be seen that the throughput on Hashtag Counter does not vary despite these operations. The observed low throughput in this module is a consequence of the fact that hashtags are not present in all tweets and messages not containing this element are ﬁltered in advance and not even sent to Hashtag Counter. Another experiment was made taking some of the previously deﬁned modules, creating a new application and

196

(a) Execution time. Figure 8.

(b) Speedup.

Execution time and speedup measured for the batch application after processing 10,000,000 items.

repetitions of the experiment. Computing the overall throughput for this experiment, we ﬁnd about 121 items/s for the conﬁguration using 1 instance per module and about 739 items/s using 7 instances per module. These values certainly can be improved through some optimizations in the application code, which was not considered because it is out of the scope of this work. Although we have presented preliminary results, the ability of Watershed to deal with multiple applications processing online data was demonstrated and the dynamic mechanism worked properly under heavy load scenarios. Moreover, we achieved good speedups when executing a simple application, which is very promising given Watershed’s requirements.

executing it in batch mode in order to verify its scalability on Watershed. We employed the Collector, the Cleaner, the Stopwords Remover and the Word Counter modules as a pipelined application depicted next: DATABASE → Collector → Cleaner → Stopwords Remover → Word Counter → OUT The workload for this experiment was created as follows. We previously collected 10, 000, 000 tweets and we inserted their texts in a database. We then replaced Twitter with this database as the data source in order to evaluate Watershed performance, exercising it under a heavy load condition, which was not possible using Twitter because of the low rate of data arrival of our queries. We were then able to eliminate the extra computation added to Stopwords Remover, once we don’t need to impose a bottleneck to the application as a whole anymore. More speciﬁcally, we kept 1 instance of Collector during this whole experiment. This module reads the items from the database and sends them to the next module with a rate as higher as possible. We varied the instances of the other modules from 1 to 7, exploiting all available resources in the cluster. For each conﬁguration, the execution time for processing the entire database was measured and the speedup was computed. Figure 8(a) shows the execution time, in minutes, for each conﬁguration, whereas Figure 8(b) shows the respective speedup. From these curves it is possible to see that the application scales properly with the replication of its modules. The observed speedup is very close to the linear reference, which means that Watershed does not impose signiﬁcant overhead when handling many copies of the application’s processing modules. Small oscillations can be seen along the speedup curve, and can be explained by the absence of

VI. O NGOING AND F UTURE W ORK Watershed platform was designed to be a general stream processing distributed system, but it has speciﬁc features to run online applications that process massive data from the web. In this context, we are assessing the possibility of using Watershed as an underlying layer of the data mining algorithms running in a project named Observatório da Web2 , which is a free tool dedicated to monitor important facts, events and entities from the web in real-time. Watershed is under construction and several important features are being developed and added to it. One of these features is related to the persistence of streams. For now, when a module is added, all messages existing in its input stream before are not delivered to it. In other words, this module loses those messages. Therefore, an important ongoing development is the implementation of a mechanism for stream persistence, that will allow a new added module process past data, stored in persistent databases. 2 http://www.observatorio.inweb.org.br

197

Another important feature being implemented in Watershed is a dynamic scheduling mechanism which will provide more ﬂexibility and more efﬁciency in using computational resources. This mechanism will monitor all the system load and the cluster’ resources. Based on derived statistics, Watershed will be able to decide where and how many copies of a module might run, and the most important, it may decides to move modules between machines at execution time when necessary to improve the applications’ performance. To support dynamic scheduling it is being designed an strategy of global state maintenance, in which it will be possible to save the modules’ partial states and compose these, obtaining the Watershed global state. This allows modules be moved, suspended and resumed when necessary. The global state maintenance will also be part of a fault tolerance mechanism, in order to deal with process failures.

[3] M. Beynon, R. Ferreira, T. Kurc, A. Sussman, J. Saltz, and J. H. Medical, “DataCutter: Middleware for Filtering Very Large Scientiﬁc Datasets on Archival Storage Systems,” in 2000 17th IEEE Symposium on Mass Storage Systems, vol. 9619020, 2000, pp. 119–133.

VII. C ONCLUSIONS In this paper we presented Watershed, a dynamic distributed stream processing system that supports the online analysis of very large data streams. Watershed is inspired on the dataﬂow model, so that an application is composed by a chain of decoupled processing modules, exploiting task parallelism. Each module may be transparently replicated through a set of identical instances, providing data parallelism. Its communication layer is based on Message-Passing Interface [11] and users can interact with the system by adding or removing processing modules at execution time, using a very simple programming model. The main contributions of Watershed are the mechanism of adding and removing processing modules at execution time and the scalability of the applications running on top of it. These features allow multiple applications to execute at the same time and to share intermediate results, while supporting high performance. We described a prototype of Watershed and our results demonstrate the ﬂexibility in creating a set of data analysis algorithms and their composition into a powerful stream analysis environment. Though the experiments are in a small scale laboratory setting, the results are promising, highlighting the ﬂexibility and efﬁciency of the system.

[6] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google ﬁle system,” Operating Systems Review (ACM SIGOPS), vol. 37, no. 5, pp. 29–43, 2003.

[4] R. A. Ferreira, W. Meira Jr., D. Guedes, L. M. A. Drummond, B. Coutinho, G. Teodoro, T. Tavares, R. Araujo, and G. T. Ferreira, “Anthill: A scalable run-time environment for data mining applications,” in Proceedings of the 17th International Symposium on Computer Architecture on High Performance Computing. Washington, DC, USA: IEEE Computer Society, 2005, pp. 159–167. [5] J. Dean and S. Ghemawat, “Mapreduce: Simpliﬁed data processing on large clusters,” in Proceedings of the 6th Symposium on Operating System Design and Implementation, San Francisco, California, USA, December 2004, pp. 137– 150.

[7] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: Distributed data-parallel programs from sequential building blocks,” in Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems. New York, NY, USA: ACM, 2007, pp. 59–72. [8] L. Amini, H. Andrade, R. Bhagwan, F. Eskesen, R. King, P. Selo, Y. Park, and C. Venkatramani, “SPC: A distributed, scalable platform for data mining,” in Proceedings of the 4th International Workshop on Data Mining Standards, Services and Platforms. New York, NY, USA: ACM, 2006, pp. 27– 37. [9] L. Brenna, A. Demers, J. Gehrke, M. Hong, J. Ossher, B. Panda, M. Riedewald, M. Thatte, and W. White, “Cayuga: a high-performance event processing engine,” in Proceedings of the 2007 ACM SIGMOD international conference on Management of data, ser. SIGMOD ’07. New York, NY, USA: ACM, 2007, pp. 1100–1102. [Online]. Available: http://doi.acm.org/10.1145/1247480.1247620 [10] A. J. Demers, J. Gehrke, B. Panda, M. Riedewald, V. Sharma, and W. M. White, “Cayuga: A general purpose event monitoring system,” in Online Proceedings of The Third Biennial Conference on Innovative Data Systems Research, ser. CIDR 2007, Asilomar, CA, USA, 2007, pp. 412–422.

ACKNOWLEDGMENTS This research was partially funded by FAPEMIG, CNPq, CAPES, FINEP and the Brazilian National Institute for Science and Technology of the Web — InWeb (MCT/CNPq 573871/2008-6).

[11] M.P.I. Forum, “MPI: A message-passing interface sandard.” University of Tennessee, Tech. Rep. UT-CS-94-230, 1994, http://www.mpi-forum.org/docs/docs.html. [12] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall, “Open MPI: Goals, concept, and design of a next generation MPI implementation,” in Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004, pp. 97–104.

R EFERENCES [1] R. Stephens, “A survey of stream processing,” Acta Informatica, vol. 34, pp. 491–541, 1997. [2] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo, “SPADE: The System S declarative stream processing engine,” in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: ACM, 2008, pp. 1123–1134.

198

Watershed: A High Performance Distributed Stream ...

each node with an Intel R Core. TM. 2 CPU 6420 @2.13GHz equipped with ... tory landing without abrupt degradation. The graph also shows that Watershed is ...

Download PDF

669KB Sizes 1 Downloads 194 Views

Report

Watershed: A High Performance Distributed Stream ...

Recommend Documents