Enhance Exploring Temporal Correlation for Data Collection in WSNs Ngoc Duy Pham, Trong Duc Le, and Hyunseung Choo School of Information and Computer Engineering Sungkyunkwan University, Korea Email: {phmngocduy, letrongduc}@skku.edu, [email protected]

Abstract—Continuous data collection applications in wireless sensor networks require sensor nodes to continuously sample the surrounding physical phenomenon and then return the data to a processing center. Battery-operated sensors have to avoid heavy use of their wireless radio by compressing the time series sensed data instead of transmitting it in raw form. One of the most commonly used compacting methods is piecewise linear approximation. Previously, Liu et al. proposed a greedy PLAMLiS algorithm to approximate the time series into a number of line segments running in Θ(n2 logn) time, however this is not appropriate for processing in the sensors. Therefore, based on our study we propose an alternative algorithm which obtains the same result but needs a shorter running time. Based on theoretical analysis and comprehensive simulations, it is shown that the new proposed algorithm has a competitive computational cost of Θ(nlogn) as well as reducing the number of line segments and so it can decrease the overall radio transmission load in order to save energy of the sensor nodes.

I. I NTRODUCTION One of the main objectives of wireless sensor networks (WSNs) is to collect environmental sensor reading data [1], that is, each sensor node periodically collects local measures of interest such as illumination, temperature and humidity, and then transmits them back to the sink node. Typically, each node measures the environmental parameters at a fixed interval of time, and the time-ordered sequence of samples constitutes a time series. Because of the nature of the physical phenomenon, there is significant temporal correlation among the time series of the sensor readings, meaning that the sensed data is quite similar during a short period of time and so future values can be predicted based on the previous measurements. The correlations can be captured by mathematical models such as wavelet transforms or linear models (see example in Fig. 1). Therefore, the time series can be approximated by using a suitable mathematical model, and the obtained result is the amount of approximating data, which is usually much lower than the volume of the whole data series. Transferring compressed data, instead of raw data, can significantly help in reducing the energy consumption of the communication in network [10]. There already exists a number of research efforts to exploit temporal correction [2], [9], [11], and to classify, cluster and index the time series [5], [7]. Some of these are meant to be executed on a server, which has enough computational resources for mining large time series online or offline. Unfortunately, sensor nodes in WSNs are very limited in 978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

(a) Fig. 1.

(b)

The time series (a) and its piecewise linear presentation (b).

computational and energy resources, making existing methods not very effective. One of the most noteworthy approaches is the Energy Efficient Data Collection (EEDC) framework proposed by Liu et al. [3], [4]. A greedy approximation technique was introduced, which has the computation cost of Θ(n2 logn), where n is the length of the time series. In general, the purpose of the algorithm is to find the minimum number of line segments to approximate the time series such that the difference between any approximation value and the actual value is less than a given error bound . However, EEDC, and most existing studies, only investigate the theoretical aspects of the correlation, or provide an approximating algorithm at the sensor nodes using a relatively high computing cost. In this paper, to exploit the temporal correlation, an alternative piecewise linear approximation algorithm has been adopted, to approximate the time series by a sequence of line segments. The advantage of this approach is that it leads to a shorter running time of Θ(nlogn). Moreover, from the experimental results the proposed algorithm requires less number of line segments to be archived. Accordingly, for continuous data collection applications in WSNs, the new algorithm saves energy in the computation and communication needed at every individual sensor node. The remainder of this paper is organized as follows: section II briefly discusses previous work in this area, the new algorithm to approximate the time series of sensed data is presented in section III, Section IV describes the performance evaluations using simulation results, and section V presents the conclusions of the research. II. E XPLOITING T EMPORAL C ORRELATION In continuous data collection applications, sensor nodes obtain the measurement samples periodically and send the collected readings to a buffer. When the buffer is full, the nodes treats the data stream as a time series and transmits it back to the sink (processing center). However, transmitting 204

value

x1

1

Fig. 2.

TABLE I T HE NOTATION USED IN THIS PAPER

x4 x6

x2 x5

x3 2

3

4

5

X X[a : b]

x7 time

6

Segs

7

max error

An example of the piecewise linear approximation.

all the raw data from each sensor node to the sink often causes a large energy consumption via radio transmission. In order to reduce the volume of the transmitted data, the observed time series is approximated within a given error bound. Much work has been carried out to pack the time series of data such as compressions or wavelet transformations but the complexity of these techniques is quite high and it is not appropriate for processing at simple sensor nodes. Due to the limited processing ability of most sensor nodes, the EEDC framework uses a piecewise linear approximation technique which approximates the time series by a number of line segments. Only the end points from every line segment, rather than the whole time series, needs to be transmitted to the sink node. To facilitate understanding, Fig. 2 shows a possible scenario for the piecewise linear approximation, where three line segments (x1 , x2 ), (x2 , x6 ), and(x6 , x7 ) could be used to approximate the time series. The problem can be modeling as follows. The problem. For a given time series and a given error bound , find the minimum number of line segments to approximate the time series such that the difference between any approximation value and its actual value is less than . The end points of the line segments must be the points in the time series. In [3], it has been shown that the PLAMLiS problem (Piecewise Linear Approximation with Minimum number of Line Segments) can be solved in polynomial time with the complexity of Θ(n3 ) in the worst case. In [4] - the main reference paper used for this study - a greedy PLAMLiS algorithm which can run in Θ(n2 logn) is proposed. The approach is to convert the PLAMLiS problem into a set-covering problem as follows: Suppose a time series X consisting of n points x1 , x2 , ..., xn , for each point xi in the time series, associate it with point xj (j > i) which is the farthest away from point xi , and the line segment (xi , xj ) that meets the given error bound. Let Fi denote the subset consisting of all the points on the line segment xi , xi+1 , ..., xj . Eventually, we obtain a set F with n subsets (F1 , F2 , ..., Fn ). At this time, the PLAMLiS problem is converted to the problem of picking up the least number of subsets from F , which covers all the elements in the set X. The well-known greedy algorithm is used and the final result is a set of subsets from F which also are the set of line segments that are being sought. Regarding the running time complexity of the algorithm, in the first step, to calculate the set F requires time Θ(n2 logn) while in the second step, the greedy algorithm requires Θ(n2 ) 978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

A time series in the from x1 , x2 , ..., xn The subsection of series X from a to b, xa , xa+1 , ..., xb A piecewise linear approximation of a time series of length n with K segments. Individual segments can be addressed with Segs(i) The given error bound

running time. On the whole, the greedy PLAMLiS algorithm can run in time Θ(n2 logn). However, due to their limited processing abilities, the computational complexity of the above algorithm is quite high for running on the sensor nodes and leads to the sensor nodes overloading and resulting in high energy consumptions. This is the motivation for presenting the proposed algorithm, i.e., it is aimed at determining an alternative algorithm which is simpler to implement yet retaining the same output results. III. O UR A PPROACH The approach aims at improving upon previous work in [4] by proposing a top-down approximating algorithm that reduces the complexity of the greedy PLAMLiS algorithm, but keeping the same output results. In the following the new algorithm is presented and its computational complexity is verified. For ease of referencing the notations used in this paper we present in the information in Table I. A. The proposed algorithm Assume the time series consists of n points x1 , x2 , ..., xn . In [4], a graph was built, with the data points as the vertices, and the problem solved by converting to a set-covering problem. In contrast, this study follows a different approach namely a topdown method is used whereby the time series is recursively partitioned until some criteria is met. Suppose that only one line segment connects x1 and xn for the series of n points, which is then checked to see if this line segment (x1 , xn ) meets the error bound or not; that is, the difference between the approximation value and the actual value of xk (1 < k < n) is not larger than the given error bound. Note that the approximation value of xk is the intersection point of the line

x1

1

x2

x3

2

3

x5

x6 x1

x4

4

5

6

1

x2

x3

2

3

(a)

x1

1

x2

x3

2

3

Fig. 3.

205

x6

4

5

6

(b) x5

x6 x1

x4

4

(c)

x5 x4

5

6

1

x2

x3

2

3

x5

x6

5

6

x4

4

(d)

An example of top-down piecewise linear approximation algorithm.

segment (xi , xj ) and the vertical line x = xk . The series is scanned from left to right in order to find the first point that does not meet the error bound. If all of the points meet the given error bound, then the line segment required has been found and the algorithm can be stopped. Otherwise, partition the line segment into two parts at this point then apply the algorithm for each part of the line segment until all of the sub line segments meet the error bound. The followings is the pseudo code for the algorithm. 1 2 3 4 5 6 7 8 9 10 11

Approximating (X, Segs) for k = 2 to length(X) − 1 if calculate_error(X, k) > max_error // Recursively split the left // section of data Segs = Approximate(X[1 : k]) // Recursively split the right // section of data Segs = Approximate(X[k : length(X)]) break end end

In particular, the pseudo code function named calculate error(X, k) calculates the difference between the approximating value and the actual value at point k of the time series X(x1 , x2 , ..., xn ). Firstly one line segment is calculated, on which two end points are x1 and xn . Then, for each k, calculate the distance between the approximating value and actual value at k. The Fig. 3 shows an illustration of the algorithm for a series of 6 points of data. First of all, assume one line segment for all 6 points; this line connects the first point and the final sixth point of the series. Then, check if this line segment meets the error bound or not. The segment is scanned from left to right and stopped at the fourth point because it does not meet the error bound. The line segment is then split at this point to make two sub-line segments (x1 , x4 ) and (x4 , x6 ). Recursively, the algorithm is applied for each line segment until all of the sub line segments meet the error bound. From this example, the first sub-line segment needs to be split one more time while the second does not. The final result gives three line segments which approximate the series of six points. After running the algorithm above, a set of line segments have been obtained, each of which meet the given error bound. Based on these observations, it can be seen that there are some adjacent line segments that can be grouped into one line segment, while retaining the condition of meeting the required error bound. The benefit of this method is that the number of line segments can be reduced so that the number of data points that need to be communicated is decreased. Transmission in wireless sensors uses considerable energy so reducing the number of data points transmitted gives the added advantage of conserving the life time of the sensor nodes. The following pseudo code presents the algorithm for merging the nearby line segments. 978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

1 2 3 4 5 6 7 8 9

Merging (Segs) for i = 1 to length(Segs) − 1 for j = i + 1 to length(Segs) if not meet_error_bound(Segs(i), Segs(j)) merge(Segs(i), Segs(j − 1)) break end i=j end end

The function meet error bound (Segs(i), Segs(j)) checks if the line segment, merged by (j − i) line segments from Segs(i) to Segs(j), meets the error bound or not. The new line segment is created in function merge(Segs(i), Segs(j)) by making a line connect the first point of Segs(i) and the last point of Segs(j). Variations on the Top-Down algorithm were independently introduced in several fields in the early 1970’s. For example, in cartography, it is known as the Douglas-Peucker algorithm [6] but the break point is the vertex furthest from the line segment which does not meet the required error bound, not the first one as in our algorithm. Also, in image processing, it is known as the Ramers algorithm [12] which produces a small number of vertices that lie on a given curve. Lavrenko et al. use the Top-Down algorithm to support the concurrent mining of text and time series [13] in an attempt to discover the influence of news stories on the financial markets. All of the previous works alternate the algorithm for general cases or for specific purposes such as image processing, or data mining, financial markets, etc. The new proposed algorithm is aimed to be applicable to the sensed data in sensor networks which have significant temporal correlation. Because of the temporal correlation, the values in the data series are quite similar therefore approximating them by the line segments will lead to benefits, and the number of lines segments obtained is likely to be small. The main motivation for this research is to reduce the complexity of the algorithm as much as possible, as this will have a large effect on the performance of the sensor nodes. We present the computational complexity analysis of the proposed algorithm next. B. Time complexity analysis Let T (n) be the time complexity required for the proposed algorithm in order to approximate a data series of length n. Then the following recurrence equations express T (n): T (n) = T (i) + T (n − i) + i T (1) = Θ(1) The parameter i is the number of points on the left of the series that meet the required error bound before encountering the first point that does not. The worst case scenario is where i = n−1 when the first point does not meet the error bound is the previous point of the right end point of the line segment. In that case, T (n) = T (n − 1) + Θ(1) + n − 1, yielding that 206

T (n) = Θ(n2 ). However, from the performance evaluation, the complexity-analysis point of view is that it has far superior average complexity. Although performing a worst case analysis on an algorithm is usually easy, it often results in a very pessimistic estimate of its performance. The known alternative performance measures seem more useful, but deriving their asymptotic behavior is significantly more difficult. One such measure is to establish the average complexity under a certain probability distribution of the input. It is hard to determine what is the average value of i, clearly it can be 2, 3, ..., n − 1. Let T a(n) denote the average time of the algorithm for X[1 : n], and so: T (n) = T (2) + T (n − 2) + 2 or T (n) = T (3) + T (n − 3) + 3 or ... T (n) = T (n − 1) + T (1) + n − 1 The average value of T (n) is then the sum of all the possible values divided by n − 1 T a(n) <

2 (T (1) + T (2) + ... + T (n − 1) n−1 +2 + 3 + ... + n − 1)

It can safely be assumed that each recursive call takes some average time and so we have: 2 (T a(1) + T a(2) + ... + T a(n − 1) T a(n) = n−1 + 2 + 3 + ... + n − 1) (n − 1)T a(n) < 2(T a(1) + T a(2) + ... + T a(n − 1) + n + n + ... + n) (n − 1)T a(n) = 2(T a(1) + T a(2) + ... + T a(n − 1)) + cn2 Substitute n − 1 for n across the board and the result is: (n − 2)T a(n − 1) = 2(T a(1) + T a(2) + ... + T a(n − 2)) + c(n − 1)2 (n − 1)T a(n) = 2T a(n − 1) + 2cn − c + (n − 2)T a(n − 1) (n − 1)T a(n) = nT a(n − 1) + c Dividing both sides by n(n − 1) gives: T a(n − 1) c T a(n) = + n n−1 n(n − 1) Define X(n) =

T a(n) n .

Adding and canceling we arrive at: X(n) = c ( 12 +

1 6

+ ... +

1 n(n−1) )

< c (1 +

1 2

+

1 3

Therefore, T a(n) = nX(n) < c nlog(n) = Θ(nlogn) Regarding the complexity of the merging algorithm, the worst case is also n2 . However the average case is more difficult to prove the complexity but it is expected to belong to Θ(nlogn). The performance evaluation next part show the comparison between our proposed algorithm and the target one. IV. P ERFORMANCE EVALUATION To analyze the efficiency of the proposed scheme, the performance of the proposed algorithm was compared with the target greedy PLAMLiS algorithm in EEDC. The proposed algorithm related to reducing the complexity of the previous one, where the running time is the main metric to compare therefore those algorithms were simulated using C++ language for ease of evaluation. In the simulations, the time series data is the illumination information collected from one sensor node (0 ≤ value ≤ 255). The sensor is fixed and collects the illumination during a day where each of 500 data values (points) are to be approximated by a number of line segments using the two algorithms presented above. After that, the number of running times of each algorithm is counted and the number of line segments output are also counted for the comparisons. The simulations are run for one day meaning that the input data changes following the illumination from the morning to evening. The main objective of the proposed algorithm is to reduce the complexity of previous existing algorithms. The complexity of the processes running at the sensor node has been decreased allowing energy savings to be made at the sensor nodes. Therefore, the run time of the algorithm is used as the metric for making the comparisons. Reducing the complexity not only increases the number of line segments, but it also increases the volume of data transmission for the sensor nodes. It is known that high sensor node data transmissions lead to excessive energy consumptions because of the high processing, will give bad results. Therefore, the metric of the number of line segments between two algorithms are compared to verify the performances.

Then this gives:

c n(n − 1) c X(n − 1) = X(n − 2) + (n − 1)(n − 2) ... c X(2) = X(1) + 2 X(n) = X(n − 1) +

Fig. 4.

978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

+ ... + n1 )

207

The full time series data of experiment.

978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

208

Enhance Exploring Temporal Correlation for Data ...

of interest such as illumination, temperature and humidity, ... resources for mining large time series online or offline. ... the sensor nodes using a relatively high computing cost. ..... decreased allowing energy savings to be made at the sensor.

3MB Sizes 0 Downloads 178 Views

Recommend Documents

A Heuristic Correlation Algorithm for Data Reduction ...
autonomously monitoring, analysing and optimizing network behaviours. One of the main challenges operators face in this regard is the vast amount of data ...

A Temporal Data-Mining Approach for Discovering End ...
of solution quality, scale well with the data size, and are robust against noises in ..... mapping is an one-to-one mapping m between two sub- sets Ai. 1 and Ai.

Exploiting evidence from unstructured data to enhance master data ...
reports, emails, call-center transcripts, and chat logs. How-. ever, those ...... with master records in IBM InfoSphere MDM Advanced. Edition repository.

Interactive web design for spatio-temporal data
Dec 7, 2013 - The source code (directly usable as GRASS script) is publicly accessible from ... showed layer as it is switched by time slider or automatic timer.

Graph Partition Model for Robust Temporal Data ...
Temporal data mining is a rapidly evolving area of research. Similar to other application fields, data segmentation is the fundamental task to temporal data.

Exploring temporal aspects of social identity: the ...
Fax: (0)1784 434347. e-mail: ... goes on to explore self-esteem maintenance and the drive for a sense of positive distinctiveness over .... admittedly a simplification (personal and social identities can surely become intertwined, for .... possible E

temporal data mining pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect ...

Using Ontologies to Enhance Data Management in ...
ontology as a means to represent contextual information; and (iii) ontology to provide ... SPEED (Semantic PEEr Data Management System) [Pires 2009] is a PDMS that adopts an ... Finding such degree of semantic overlap between ontologies ...

Download Python for Everybody: Exploring Data in Python 3 Online ...
Download Python for Everybody: Exploring Data in Python 3 Online Books. Books detail. Title : Download ... Data Visualization with Python and JavaScript.