www.globalbigdataconference.com Twitter : @bigdataconf
By Dr. Shyam Sundar Sarkar Ayush Sarkar (Company: AyushNet)
2
3
CALSTATDN Model
4
CALSTATDN Model
5
CALSTATDN Model
6
Source: Aggarwal, C: Managing and Mining Sensor Data. Springer Science and Business Media, New York, 2013.
7
• Broadly, there are two major approaches for data acquisition: pull based and pushbased (figure). In the pull-based sensor data acquisition approach, the user defines the interval and frequency of data acquisition. Pull-based systems only follow the user’s requirements, and pull sensor values as defined by the queries. For example, using the SAMPLE INTERVAL clause of Query (in figure), users can specify the number of samples and the frequency at which the samples should be acquired.
• On the other hand, in push-based approaches, the sensors autonomously decide when to communicate sensor values to the base station (figure). Here, the base station and the sensors agree on an expected behavior of the sensor values, which is expressed as a model. If the sensor values deviate from their expected behavior, then the sensors communicate only the deviated values to the base station. 8
Intel Berkeley Lab With 54 Sensors
Source: http://db.csail.mit.edu/labdata/labdata.html
9
Download Data Files for Analysis The x and y coordinates of sensors (in meters relative to the upper right corner of the lab) are given in a location file. The three columns correspond to mote id, x location, and y location. Main file includes the original log of about 2.3 million readings collected from these sensors. The file is 34MB gzipped, 150MB uncompressed. The schema is as follows: date : yyyy-mm-dd epoch : int moteid : int humidity : real
time : hh:mm:ss.xxx temperature : real light : real voltage : real
In this case, epoch is a monotonically increasing sequence number from each mote. Two readings from the same epoch number were produced from different motes at the same time. There are some missing epochs in this data set. Moteids range from 1-54; data from some motes may be missing or truncated. Temperature is in degrees Celsius. Humidity is temperature corrected relative humidity, ranging from 0-100%. Light is in Lux (a value of 1 Lux corresponds to moonlight, 400 Lux to a bright office, and 100,000 Lux to full sunlight.) Voltage is expressed in volts, ranging from 2-3; the batteries in this case were lithium ion cells which maintain a fairly constant voltage over their lifetime; 10
10
Example of Sensor Data from Lab (2.3 million records from 54 sensors)
11
Consider Heat Equation (Calculus)
For a function u(x,y,z,t) of three spatial variables (x,y,z) and the time variable t, the heat equation is shown above where u is an arbitrary function being considered; often it is temperature. 12
Consider K-means Clustering (Statistics) •
K-means clustering is a method popular for cluster analysis in machine learning and data mining. The k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
•
Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find (where μi is the mean of points in Si):
13
K-means Algorithm Given an initial set of k means m1(1),…,mk(1) (see below), the algorithm proceeds by alternating between two steps: (1) Assignment step: Assign each observation to the cluster whose mean yields the least within-cluster sum of squares (WCSS). Since the sum of squares
is
"nearest"
the
squared Euclidean
mean.
observations
[8]
distance,
(Mathematically,
according
to
this
the Voronoi
this
means
diagram
is
intuitively
partitioning generated
by
the the the
means).
where each
is assigned to exactly one
, even if it could be is
assigned to two or more of them. (2) Update step: Calculate the new means to be the centroids of the observations in the new clusters.
Since
the
arithmetic
mean
is
a least-squares
estimator,
this
also
minimizes the within-cluster sum of squares (WCSS) objective.
14
Computational Complexity of K-means Algorithm Computational Complexity If k and d (the dimension) are fixed, the problem can be exactly solved in time (𝒏𝒅𝒌+𝟏 𝐥𝐨𝐠𝒏) , where n is the number of entities to be clustered
15
Consider Database Normalization (Data Normalization) In this project, objective is to execute normalization on sensor data sets to reduce redundancy and improve partitioning followed by applying a parallel machine learning algorithm efficiently for analysis.
16
17
Complete Workflow of Sensor Data Analytics System
18
Five sensor clusters computed after K-mean clustering applied on Fifty four sensors with X and Y co-ordinates within Intel-Berkeley lab
19
Point graphs corresponding to double differentiations of temperatures w.r.to x and y coordinates and their K-means values for sensors located in physical clusters 0 and 1
20
Point graphs corresponding to values of double differentiations of temperatures w.r.to x and y coordinates and their K-means values for sensors located in physical clusters 2 and 3
21
21
Point graphs corresponding to values of double differentiations of temperatures w.r.t x and y coordinates and their K-means values for sensors located in physical cluster 4
22
Point graphs correspond to values of first differentiations of temperatures w.r.t time, using heat equation for physical clusters 0, 1, 2 and 3. Continuous graphs of temperatures w.r.to time for the clusters correspond to Runge-Kutta integration over the differentiations (using Berkeley Madonna Tool)
23
Point graphs correspond to values of first differentiations of temperatures w.r.t time, using heat equation for physical cluster 5. Continuous graphs of temperatures w.r.to time for the clusters correspond to Runge-Kutta integration over the differentiations (using Berkeley Madonna Tool)
24
Performance gain of CALSTATDN Model over the Control Variable
25
25
Conclusions A new model (CALSTATDN) for data normalization (DN) based on calculus (CAL) and Statistics (STAT) allows for • Data normalization leading to efficient data partitioning of very large sensor datasets; • Applying parallel, distributed, statistical machine learning algorithms on the normalized and partitioned sensor datasets; • Improving performance by 7.2 times over control variable which is the “raw” (denormalized) single sensor dataset stored in a big table. The Computational Complexity (𝒏𝒅𝒌+𝟏 𝐥𝐨𝐠 𝒏) of K-Means Algorithm played a major role in demonstrating the reduction of execution time using my CALSTATDN model. The CALSTATDN model has allowed for partitioning the dataset so that n, d and k are reduced for each parallel execution of the model, thus enabling a massive reduction in the total computational complexity compared to the original control variable. 26
Thank You! E-mail of Shyam Sarkar:
[email protected] E-mail of Ayush Sarkar:
[email protected]
27