Labeling Sensing Data for Mobility Modeling

Viewer
Transcript

Labeling Sensing Data for Mobility Modeling ˇ Jesse Read, Indr˙e Zliobait˙ e, Jaakko Hollm´en Helsinki Institute for Information Technology HIIT Aalto University School of Science, Department of Information and Computer Science PO Box 15400, FI-00076 Aalto, Espoo, Finland {jesse.read,indre.zliobaite,jaakko.hollmen}@aalto.fi

Abstract In urban environments, sensory data can be used to create personalized models for predicting efficient routes and schedules on a daily basis; and also at the city level to manage and plan more efficient transport, and schedule maintenance and events. Raw sensory data is typically collected as time-stamped sequences of records, with additional activity annotations by a human, but in machine learning, predictive models view data as labeled instances, and depend upon reliable labels for learning. In real-world sensor applications, human annotations are inherently sparse and noisy. This paper presents a methodology for preprocessing sensory data for predictive modeling in particular with respect to creating reliable labeled instances. We analyze realworld scenarios and the specific problems they entail, and experiment with different approaches, showing that a relatively simple framework can ensure quality labeled data for supervised learning. We conclude the study with recommendations to practitioners and a discussion of future challenges. Keywords: sensory data, sensor fusion, hidden Markov models, multi-label 1. Introduction The availability and penetration of smart mobile devices is increasing; smartphone penetration in Europe is already more than 50% [1], and is forecast to continue at a double-digit annual rate through to the end of 2017. Mobile sensing systems are finding their way in many application areas, such

Preprint submitted to Information Systems. Special Issue on Mining Urban DataJuly 31, 2015

as monitoring human behavior, social interactions, commerce, health monitoring, traffic monitoring, and environmental monitoring [2]. Pervasiveness of mobile phones and the fact that they are equipped with many sensor modalities makes them ideal sensing devices. Since mobile phones are personal devices, we can use the idea of mobile sensing to probe the owner of the phone and the environment, in which the user is moving. Our general interest is to use mobile phones to learn about the mobility patterns of people and to reason and predict about their mobility patterns in urban traffic environment. The idea of using mobile phones as sensors is not new: mobile phones have been used for context recognition (e.g. [3]) and for measuring social interactions (e.g. [4]) in complex social systems already about a decade ago. Nowadays, smart phones are equipped with a wide range of sensors, including motion, location and environment sensors, that allow collecting rich observational data about human mobility in urban areas. Various predictive modeling tasks can be formulated based on such data. For example, one can be interested in recognizing the current activity of a person [5], their levels of stress or depression [6] or other metrics of health, predicting the next location [7], or predicting a trajectory of movement [8, 9]. In this study, we present a methodology for preprocessing such sensory data for machine learning purposes and its use for analyzing, modeling and predicting human mobility in urban areas. Note that although our experiments involve activity recognition, solving this particular task is not our focus. There is already considerable literature on this topic (see e.g., [5, 10, 11]). Rather, we focus on cleaning partiallylabeled data, and general analytics and classification of this data, in particular with respect to the manual annotations. The goal is to ensure a degree of reliability such that the data can be used by supervised learning algorithms. The main contributions of this study are a survey of tasks involved with mobile sensing in urban environments, via case study identification of issues that arise in this domain, and formulation of a methodology for preprocessing and cleaning sensory data for predictive modeling, in particular to creating reliable labeled instances, as well as highlighting important questions for future research. We focus on the need to automate a process of cleaning and pre-processing, rather than relying on human analysis. This paper extends the preliminary report of [12]. We continue the paper with Section 2, giving an overview of our methodological approach. The sections following are organized with respect to the 2

plates of Fig. 1b: Section 3 deals with preprocessing for aggregation and fusion of both the input data and the output data (the latter case we term simply ‘labeling’), to form a set of time-indexed instances. Section 4 outlines a general methodology for the intermixed process of cleaning and classification of data. Section 5 deals with some of the analytical and modeling issues that can be approached once with reliable labeled data. Section 6 discusses overall results obtained from the experiments throughout the paper, offers recommendations to practitioners, and comments on future work. Finally, Section 7 provides conclusions. 2. Preprocessing Methodology We begin by presenting a methodological approach at a conceptual level. Then the following sections we discuss corresponding algorithmic techniques to be used at different steps of the preprocessing process. 2.1. Challenges in preprocessing The task of data preprocessing in mobile sensing is not trivial. Data from sensors is collected as a sequence of time stamped observation records, but these records are not equally spaced in time. Moreover, the timestamps of records from different sensors are not matching, i.e., not aligned. In addition, observation records can be of different types: recording discrete events (e.g. battery charger plugged in), processes realised over a period of time (e.g. acceleration), or static measurements (e.g. current temperature) of continuous fluctuations. For example, consider the battery sensor data (left) with accelerometer data (right): timestamp,temperature,voltage 1371211281,330,4191 1371211281,330,4191 1371211292,330,4190 1371211293,330,4190 1371211293,330,4190 1371211300,330,4119 1371211300,330,4119 1371211301,330,4152 1371211341,330,4190

timestamp,X,Y,Z 1371211283,-3.027305,7.893985,6.5144534 1371211283,-3.027305,6.1312504,7.817344 1371211283,-3.027305,6.1312504,7.8556643 1371211283,-3.1039455,6.1695704,7.7790236 ... 1371211283,-3.027305,6.207891,7.664063 1371211283,-3.1422658,6.207891,7.664063 1371211284,-3.1039455,6.207891,7.817344 1371211284,-3.180586,6.09293,7.7790236 1371211284,-3.027305,6.246211,7.664063 1371211284,-3.027305,6.09293,7.6257424 1371211284,-3.1039455,6.1312504,7.664063 1371211284,-3.027305,6.246211,7.7790236 1371211284,-2.9889846,6.09293,7.7407036

Standard machine learning approaches for predictive modeling require data to be represented as instances. An instance (or example, case, or record) 3

is a single object of the world from which a model will be learned, or on which a model will be used (e.g., for prediction) [13]. The main data preprocessing task is to aggregate data from timestamped records and convert it into timeindexed instances for machine learning. The problem of sensory data preprocessing is not new, although typically in the literature an arbitrary data aggregation approach is chosen and briefly mentioned (or not reported at all). There is a lack of dedicated studies focusing on the problem of preprocessing itself. Furthermore, the existing literature on preprocessing of mobile sensing data mainly deals with feature extraction from one kind of sensor (e.g. accelerometer or GPS signal) [14, 15], which is only one side of the problem. Aside from data records from sensors, human annotations also need to be turned into labels associated with instances. The annotations come in the form of timestamped start and stop tags. In our mobility analysis case, annotations denote one of k activities, for example start walk, stop walk. They are inevitably incomplete, imprecise and generally noisy, for a number of reasons: simple delays, forgetting to stop an annotation, an activity stopping an activity which was not yet started1 , or being retrospectively unsure when an activity began and annotating retrospectively with a guess. In an informal mobility-analysis situation, a user will only give an annotation when it is convenient or appropriate (e.g., not in social situations where using a mobile phone is inappropriate or difficult, like in a work meeting or riding a bike). Even in highly controlled expert settings, noisy annotations are often unavoidable. For example, in healthcare, even domain experts (doctors) do not always know when a disease has started to manifest in a body, and can only approximate this time [16] and may miss it entirely. The act of annotation itself can feed back into the accelerometer measurements (i.e., introduce noise). Since supervised learning relies on well-labeled training data, it is fundamental to process annotations into instances properly. Activities may also be multi-labeled. A person may walk on an escalator, be lecturing at work, or talking on the phone while on a bus. An unfortunate patient may suffer from various health issues at the same time. The multi-label case is becoming increasingly tackled in the literature of supervised machine learning (involving hundreds of papers in recent years – see [17] 1

A strict program interface can avoid this type of error, but many – such as the one we employed for data collection – does not

4

x ˜1

...

x ˜d

xt

x ˜1

a ˜j

...

x ˜d

xt

˜t y

a ˜j

˜t y

fusion/aggregation

labelling

xt , yt

xt , yt cleaning

h

h

model

(a)

(b)

Figure 1: The overall methodology from raw data to predictive models and analysis. See Table 1 for details of notation. Dashed lines represent irregular arrival of data, solid lines represent time-segmented data into t = 1, 2, . . .. The plates in Fig. 1b partition the sequence into processes which we deal with section by section in the paper.

for a recent review of several important methods), but is still not commonly addressed in the mobile sensing and activity-recognition literature. 2.2. Methodological approach Our approach towards preprocessing mobile sensing data for predictive modeling purposes can be summarized by Figure 1a, which outlines the full process from raw sensor data to a predictive model. Table 1 outlines relevant accompanying notation. In brief, the (possibly overlapping) steps are: 1. data aggregation/fusion into instances, from both • input sensor data, and • annotations (i.e., labeling);

2. instance cleaning; 3. model building.

We obtain sensor data from d different sensors (for example, GPS, accelerometer, battery level, temperature). They arrive at different times and at different sampling rates. This input is converted into instances at regular time points t, i.e., observation xt = [x1 , . . . , xd ] over time t = 1, . . . , T . 5

Table 1: Notation. Note that the t in xt represents time stamp t = 1, . . . , T , whereas the i (r) in xi represents feature index (i = 1 . . . , d). In cases where it is needed, x ˜i represents the (t) i-th feature value at time stamp r ∈ R, and xi at time step t ∈ {1, . . . , T }. Otherwise, it should be assumed that raw data (˜ x and a ˜) implicitly has time in it; namely unevenly spaced in continuous time, later segmented into an instance in time t. Symbol x ˜1 , . . . , x ˜d xt = [x1 , . . . , xd ] a ˜1 , . . . , a ˜k ˜ t = [˜ y y1 , . . . , y˜k ] ˜t xt , y xt , yt

description raw sensor data (e.g,. GPS, accelerometer) vector from d sensors aligned data, at time t, missing values dealt with annotations a ˜j = +1 (start) or a ˜j = −1 (stop) for k activities label-vector, representation of annotations at time t labeled training instance, at time t label-corrected training instance

Each time point t denotes an instance. In our main domain of interest activities can change within a few seconds and hence we consider time points in seconds, but different units may be more suitable to other domains. For example, a patient’s health may be measurable from hour to hour or even day to day. Obviously, a larger time step also produces smaller data files. In any case, it is fundamental to note the difference we make between time point (or time step), which serves as an instance ID, and a time stamp, which represents any moment in time. Thus, we create data set x1 , . . . , xT . Meanwhile, the sensor stream may be annotated by one of k possible labels; where a ˜j = +1 indicates the start of the j-th activity; i.e., that the current and following input data is to be annotated with the j-th label, until a ˜j = −1 indicates stop for this same activity j. Note that, for generality, we do not discard the possibility that two aj = +1 arrive without a aj = −1 (the same activity j is started twice). The goal is to create a label vector yt = [y1 , . . . , yk ], where yj = 1 if the j-th label is relevant to time t, corresponding to the sensor input xt . When input xt and output yt are paired together to form a data set (also viewable as a data stream), we can learn a predictive model h such that yt = h(xt ) estimates the relevance of the set of activities (yj = 1 indicates that the j-th activity is relevant). This is useful in a wide range of sensor applications, for example to determine if a user is currently working or in a meeting given the measurements from their mobile phone sensors, or determining if a patient has developed a particular disease, given sensor measurements monitoring 6

their health. To summarize the challenges, 1. Raw sensor observations x˜ do not fall naturally into instances t = 1, 2, . . . , T , and may contain missing and/or corrupted values. 2. Different sensors work at different frequencies (the GPS receiver may take a measurement every 10 seconds, whereas accelerometer values may arrive as a batch of several dozen in one second) whereas all input needs to be framed into instances on a uniform time scale. 3. Manual annotations may be missing, time shifted (delayed with respect to the input data), or simply incorrect. 4. The act of annotating (physical handling of the device) may feedback into the input data. 5. Annotations may overlap in time, and therefore a simple multi-class model trained to predict one of k values is not always appropriate; a subset of activities may be relevant simultaneously. In the following sections we discuss processing tasks and relevant algorithmic techniques in each step of our proposed methodological framework. Along with recommendations, we will experimentally illustrate the performance of these algorithmic techniques on an experimental dataset, covering a broad range of activities recorded over six-month period. 2.3. Experimental dataset The dataset has been collected using contextLogger32 [18], which is an open source software tool for smartphone data collection and annotation based on Funf open-sensing framework [19]. Data used in this study was collected during the period from 2013 February 7 to 2014 January 27 using Sony Ericsson Xperia Active phone with Android OS, v2.3 (Gingerbread). Summarizing our data collection up to January 2014, we have over 300 million timestamped records, resulting in approximately 13 Gb of data. The capacity of the battery is 1 200 mAh. Table 2 provides details on sampling period and duration. Period indicates how often a given sensor is activated, and duration indicates for how long the sensor is activated. For example, if the period is 120 and the duration is 30 it means that the accelerometer is activated every 120 sec, and is 2

https://github.com/apps8os/contextlogger3

7

Table 2: Data collection rates: sampling period and sampling duration (in sec.) Source AccelerometerSensorProbe ActivityProb AndroidInfoProbe BatteryProbe BluetoothProbe CellProbe GravitySensorProbe GyroscopeSensorProbe HardwareInfoProbe LightSensorProbe LocationProbe MagneticFieldSensorProbe OrientationSensorProbe RotationVectorSensorProbe RunningApplicationsProbe TemperatureSensorProbe WifiProbe DataUploadPeriod Activity annotation

Until June 14, 2013 Period Duration 86 400 1 800 600 30 120 86 400 1 800 30 1 800 30 1 800 30 120 30 60 1 800 600 3 600 manual

8

After June 14, 2013 Period Duration 30 30 30 30 86 400 1 800 300 300 30 30 30 30 86 400 120 30 120 30 120 30 30 30 30 30 30 300 3 600 manual

collecting data for 30 sec. On June 14 the settings for data collection rates were changed. From this full set we have also use a particular subset of about an hour and a half, as the user wanders through a terminal at Frankfurt Airport, either walking or riding the escalator, or both simultaneously (walking on the escalator), or nothing. We have made this section of the data publicly available for research3 . A summary related to aggregation and fusion of readings is given in the following section. Unlike our earlier work [12], this paper has a focus on labelling, which is why we took a specific manageable subset (in terms of visual appreciation of results) for the empirical sections. However, we include overall details and discussion of the full dataset to provide an idea of the scale and the type of data. 3. Data Fusion/Aggregation and Labeling We start our discussion with the first steps of data preprocessing methodology from Figure 1b – namely fusion/aggregation and labeling – covered in sections 3.1 and 3.2, respectively. We deal with these steps in the same section on account of their relatedness: sensor fusion involves data alignment, and labeling is simply data alignment in the label space; both processes output a time-indexed vector. However, they are complex for different reasons, as we detail in the following. 3.1. Fusion and Alignment: from raw sensor readings to instances The general task of integrating data from several different sources is known as data fusion. When these sources are sensors, this is known specifically as sensor fusion. The idea is to fuse and aggregate data from different sensors. Comprehensive reviews are given in [20, 21, 22, 16]. Prior to fusion, the data must be aligned so as to produce instances with features. General options for alignment and feature-creation options are detailed in [14] (with respect to accelerometer data), including moving averages, splines, fractional delay filters, and nearest-neighbour interpolation. Interpolation is a typical option. Consider irregularly spaced (raw) points 3

http://users.ics.aalto.fi/jesse/data/

9

{˜ x(r) } in time r ∈ R, where set R contains all the time points ‘close to’ time t, e.g., R = {t − 1 < r ≤ t + 1} (within one time step of discrete t), then a new point x(t) can be interpolated as x(t) =

1 X (r) x |R| r∈R

This is often called nearest neighbour smoothing and can be seen as a simple moving average over a window. In typical sensor applications, e.g., on mobile phones, the data readings are much more frequent than need be changed the labels, thus the window being extended forward such that some r > t can be totally acceptable. More advanced techniques involve kernel smoothing, where X x˜(r) w(r, t) P x(t) = 0 r0 w(r , t) r∈R and w(r, t) is a weight determined by some kernel. Fig. 2a demonstrates this technique with a Gaussian kernel on a simple toy example, using R = {t−5 < r ≤ t + 5}. We find the simpler nearest neighbour smoother adequate for the tasks we investigate, but instead {t − 1 < r ≤ t + 1}, since many readings are available for each second. These alignment methods have the additional effect of reducing noise by smoothing raw sensor readings, and is the method that we used in final experiments. Fig. 2b shows sensor fusion, where multiple sensor readings are fused into a single value. Note that this reading gives a much more clearer visual interpretation of change points compared to the original individual readings. In this case, the sensor values are related (they are the three axes of the same accelerometer), but in combination with alignment and a choice of fusion function, quite different sensors could also be combined. As the focus on this paper is on labelling, we do not look further into the specifics of data fusion, although we attach Appendix 8 to show how sensor fusion may be carried out with battery. Advanced methods for data fusion (as opposed to simply aligning and aggregating) are Kalman Filter smoothers and Monte Carlo simulation-based techniques [23, 20, 21] which can be applied to deal with the noise and inconsistency associated with sensor readings. These methods are widely used on the input, but less attention has been paid to possible outputs, which we address specifically in the following. 10

4

x˜1 x˜2 x1 x2

Sensor Fusion for 3D Accelerometer

3.0

2.5

x1 x2 x3

2.0

x0 (magnitude) var(x0t−1:t+1 )

1.5

2

Sensor Value

Sensor Value

3

Interpolation of two Sensors for Alignment (raw) (raw) (interpolation) (interpolation)

1

1.0

0.5

0.0 0 −0.5 −1

0

5

10 Time

15

−1.0

20

0

5

10

15

20

25

Time

(a) Sensor alignment

(b) Sensor fusion

Figure 2: An illustration of sensor alignment (2a) and sensor fusion (2b). Using a weighted average, values of two sensors x ˜1 , x ˜2 with irregularly and differently spaced measurements are interpolated, and new time indexed values can be taken at points t = 1, 2, . . . , 20 to create instances xt = [x1 , x2 ]. The variance of the magnitude of 3 readings is used to fuse the readings of multiple sensors ([x1 , x2 , x3 ]) together into a single series of values (indicated by solid black line). We used these techniques in our experimental section.

11

discard

keep

discard

min1

max1

Dataset 1

Dataset 2 min2

timestamps

max2

Figure 3: Alignment of the time ranges.

3.2. Labeling: from human annotations to instance labels Data alignment is particularly challenging with regard to human annotations. We shall refer to this task as labeling. In the alignment of sensor data we are typically dealing with binning data into the nearest time slot and perhaps filling in missing data and excluding corrupted data, but annotations have the additional unreliability of human input: they are inherently sparse and frequently incorrect. In a typical scenario, such as we deal with, annotations provide the start and the end time stamps of activities. Starts and ends are not necessarily paired, i.e., it may happen that there is a start, but no end, or there are three starts in a row and then one end of the same activity. Note that the annotation data consists of discrete values. 3.3. Experiment: aggregation and labeling By way of an experiment we illustrate data preprocessing challenges in modeling accelerometer data collection rate for different user activities. Accelerometer data is available only from June 14 in the full data set, hence, we use only that period of data in this experiment. Having two sets of recordings – event annotations and accelerometer records – we first find the minimum (earliest) and the maximum (latest) time stamps in both sets, and discard the records from non-overlapping parts, as illustrated in Figure 3. The main challenge in data preparation in this experiment is to extract activity labels from the event annotations. We process annotations in a sequence. If there is a start, we consider an activity happening (no matter how many other starts of the same activity follow) until either of the following three triggers appear: annotation stop or invalidate, or more than 6 hours have passed since the start. The latter rule is 12

chosen for this particular domain, based on the fact that activities typically carried out in northern Europe take at most around 6 hours at a time, as reported in [24] (except sleeping, which is not one of the possible activities in our study as it does not relate to mobility). In other domains, different time limits are more sensible, for example in the medical domain, a disease may progress for some months, or even years. In any case, we only enact this rule to catch exceptions, and therefore it is unlikely that variation will have any significant effect on results. We encode labels in a T × k label matrix Y, where T is the number of seconds from the beginning of data recording to the end, and k is the number of distinct activities recorded. After alignment and fusion (as described and exemplified in the previous subsection) we obtain matrix X where each t-th row represents the entry for the t-th second. Figure 4a gives an idea of the amount of data recorded over the full time of the study, and Fig. 4b gives a detailed view of a small section (namely the Frankfurt airport section we deal with later with different kinds of sensor fusion (all aggregated into seconds): the standardized coordinates of the accelerometer, the number of records per second (which is the output we deal with in the following), and the average variance of the magnitude as exemplified in Fig. 2b (which we follow up with detailed analysis in Section 4). The figure also shows a line to demonstrate how the activities can be separated by this fusion measure. Given label matrix Y and record vector X, we can obtain estimates for average records per second for each activity. There is an important modeling decision to be made. If two or more activities take place at the same time, how does it affect the number of records? Suppose activity j = 1 generates n1 records per second, and activity j = 2 generates n2 records. We could assume that if activities 1 and 2 take place at the same time, n1 + n2 records are generated. Alternatively, we can assume that if both these activities take place at the same time, max(n1 , n2 ) records are generated. In our experimental study we take the latter approach. Following the first assumption, data collection rate can be modeled as a linear regression, where the inputs are binary indicators of activities, and the output is the number of records generated. If the second assumption is adopted, each activity is modeled independently, as PT j=1 yji xj , r¯i = PT y ji j=1 13

(a) Number of records per second over full period

10

x1 x2 x3

8

c (num/sec) avg(ct−2:t+2 ) 0 var(xt−2:t+2 )

6

4

0.1

2

0

-2

-4 20:12

20:13

20:13

20:14

20:15

20:16

20:16

20:17

20:18

(b) Detailed records and fusion over short period. Data has been standardized. Figure 4: Accelerometer records over time.

14

where i denotes the ith activity, yji is the j th entry of activity i in matrix Y, xj is the j th entry of vector X. Note, that this approach will automatically exclude the periods when the phone was off and no data was collected, since in those cases xj = 0. With this experimental setup we anticipate that different activities generate different number of accelerometer records. Raw sensor data in Android is acquired by monitoring sensor events. A sensor event occurs every time a sensor detects a change in the parameters it is measuring [25]. We expect different activities to have different acceleration patterns, and in turn to result in different data collection rates. Figure 5 shows the resulting estimates of data collection rates for each activity. Data aggregated in such a way can be used, for instance, as a feature for activity recognition. While this feature alone would not be enough to separate all the activities, certain activities could be well distinguished, for instance, walking. We see that walking produces the most records per time period, while at home or in the office activities produce the least. These results intuitively make sense. At home or office the phone would typically stay still on the table, hence, there is not much motion involved. Moreover, we can see that conceptually similar activities appear close together, presenting similar amount of records. For example, elevator is very close to escalator and funicular, where we would expect a smooth not too fast movement following a straight path. On the other spectrum of the scale train and tram appear nearby, both are means of transportation over rail. From this pilot experiment we can conclude that this preprocessing approach achieves the desired effect of fusing converting raw annotations to regularly time-spaces instances, to be used as inputs in machine learning tasks. On the other hand, the labeling, although now time indexed, may still be considerably noisy. In the next section we look at the talk of cleaning data. 4. Data Cleaning The task of data cleaning involes taking the noisily-labeled input instances ˜ and producing a clean version. We can also view the data as a stream X, Y, ˜ t }Tt=1 (where, possibly T = ∞), on account of the strong time context, {xt , y and because in our analysis we consider incremental algorithms that work

15

Figure 5: Average number of accelerometer records per second for each activity.

either incrementally online or inside a moving window. Processing can rarely be totally offline in ongoing mobile sensor applications. As we discussed earlier, labels based on human annotation are typically unreliable. Even after filtering through correcting constraints (Section 3.3) labels will not be more precise that a few seconds of the actual event, and typically much more, so cleaning is a fundamental step. Possible solutions are to 1. align labels using semi-supervised machine learning methods; or 2. use an unsupervised method; labels can be ‘matched’ with clusters afterwards Aligning labels in a semi-supervised fashion has been investigated in e.g., [26] in the context of Hidden Markov Models (HMMs), which we use as inspiration for our methodology. There are many existing unsupervised algorithms which could be applied to the task, such as clustering – in particular time series clustering – and concept-drift detection [27], We note that unsupervised time series clustering can also be done within an HMM framework using the Baum-Welch algorithm (a version of the expectation-maximization 16

algorithm) [28]. However, although labeling can be very noisy, we show that it can be cleaned adequately in a semi-supervised manner. 4.1. General methodology for cleaning and classification of instances Generally, we can consider the task of noisy semi-supervised labeling, either in a stream or batch context. We choose a Hidden Markov Model (HMM) framework to present a general methodology – outlined in Algorithm 1 – for data cleaning and classification, two tasks which overlap considerably. Already a decade ago, [26] note that HMMs are well suited for modeling spatio-temporal data such as network traffic, either in a semi or fully supervised setting. Other HMM-based work includes [29, 30, 31]. Recently, [32] presented an extension to account for long-range dependencies. Typically a two-stage process is used, of ‘correcting’ a standard classifier with a Markov process that takes into account the time component. This approach has also surfaced numerous times in the general time-series literature, e.g., [33]. In Algorithm 1 we show a single input xt and output yt value, but shortly we explain how this can encompass the multi-label case. Although Algorithm 1 can be seen as implying a probabilistic framework (as is usually the case with HMMs), any kind of classifier can be used, as demonstrated in [34] (in generic data streams with temporal dependence): by training model h directly with an additional column of [˜ y−1 , . . . , y˜T −1 ]> appended to X. We use linear discriminant analysis. Internally, learning should approximate that of an HMM, since p(yt , xt , yt−1 ) ∝ p(xt |yt ) · p(yt |yt−1 ) ≈ φyt (xt ) · θyt |tt−1 . Note that in fact multiple columns can be stacked into X, adding up to column [˜ y0−` , . . . , y˜T −` ]> , thereby approximating a second, third, and up to `-order Markov process. For φ, we consider in this experiment a simple Gaussian model, such that (in the single-dimensional case) φj (xt ) := exp

n

−

(xt − µ ˆj )2 o 2ˆ σj2

(3)

where µ ˆj and σ ˆj is the empirical mean and standard deviation of inputs xt wrt labels y˜j ; thus making g(·) from Algorithm 1 Eq. (3) effectively linear discriminant analysis4 . 4

With multi-dimensional input, the covariance variance matrix would, in this case, be assumed constant over all labels j

17

Algorithm 1: HMM-inspired approach for cleaning data via reclassification. 1. Learn model g : X → Y from the noisy data {(xt , y˜t )}Tt=1 ; such that yˆt = g(xt ) = argmax φj (xt )

(1)

j∈{1,...,k}

such that φj (x) := p(yt = j|x). 2. Obtain transition probability matrix   θ1|1 · · · θk|1  ..  .. θ =  ... . .  θ1|k · · · θk|k such that θj|j 0 := p(yt = j|yt−1 = j 0 ), by either • Empirical counts of yt = j|yt−1 from the training data

• The confusion matrix resulting from reclassifying the data with yˆt = g(xt ), described in [33]

• A human expert with domain knowledge of the problem Or, add offset y˜t−1 -columns directly to each xt . 3. Reclassify labeled data by plugging θ and φ together into model h, yˆt = h(xt , yt−1 ) = argmax φj (xt ) · θj|yt−1 j∈{1,...,k}

18

(2)

Note that marginal inference on HMMs can be run in an online fashion (typically called filtering, done in a forward pass), or retrospectively (smoothing with a backward pass). The most likely state (MAP estimate) can be obtained by the Viterbi algorithm. Although the basic components of this methodology exist in the literature, we point out that most related work is not specific to scenarios dealing with sensor data and does not deal comprehensively with a few important aspects that are relevant to many real-world contexts, namely • The extent of unreliable labeling present (as opposed to simple measurement noise from the input) • Perturbations caused by the actual annotation (i.e., physically handling the phone for this purpose) • The relative advantages of working in an online, offline, or batch setting • The multi-label case (where annotations are not necessarily mutually exclusive) We have already discussed several of the above challenges. In the following we deal with the multi-label case. 4.2. Learning with multi-labeled data Typically, existing work considers the mutual-exclusive case for labels where yt ∈ {1, . . . , k}, and yt = h(x). Although certain domains such as traffic-mode recognition can be mutually exclusive (it is not possible to be on the bus and train at the same time), this is not the case for more general annotations. For example, walking and escalator may be simultaneous annotations. Particular pairs will always be mutually exclusive, and others co-occur with varying frequency, for example walking and bus will be much more infrequent than sitting and bus or standing and bus. In the medical domain, a patient can easily suffer from multiple health issues at the same time, where some are more likely to be diagnosed together at the same time than others. There is a wealth of literature in the area of multi-label classification, see, e.g., [17, 35, 36], where it is well established that modeling dependence among labels is highly desirable, as it can improve modeling performance, albeit at a computational cost. Given k activities, the multi-labeling task

19

gives us a total space of 2k activity combinations. Thus, yt ∈ {0, 1}k = h(x) = [y1 , . . . , yk ] Briefly we review two of the common approaches from a probabilistic perspective. The label powerset method (LP) treats label vectors as single labels in a multi-class problem. Typically, the vector space is the same as the set of unique labelings found in the training set. In this method, yt = hLP (x) = argmax p(y|x)

(4)

y∈Ytrain

where any off-the-shelf multi-class classifier can be plugged in to hLP . Methods have been created to deal with the inherent complexity O(2k ). For example, [35] proposes ensembles of LP on subsets of labels (activities, in this case). A large set of activities could, in this way, be modelled in subsets of at 0 most k 0 at a time (where k 0 < k), thereby incurring O(m · 2k ) complexity for an ensemble of m models. Furthermore, these subsets can be pruned such that only the top k 00 < k 0 activity combinations are considered, reducing complexity further to O(m · k 00 ). [37] puts this into a general framework, and showed up to several orders of magnitude speedup over standard LP (depending on the configuration of k 0 and k 00 ) with little or no degradation of predictive performance on, large multi-labeled datasets. The Binary Relevance method (BR), treats a k-label problem as k independent binary problems, thus yt = [h1 (x), . . . , hk (x)] = argmax p(y1 |x), · · · , argmax p(yk |x) (5) y1 ∈{0,1}

yk ∈{0,1}

where any off-the-shelf binary classifier can model hj independently for each label j = 1, . . . , k. In this basic formulation, it is considered a baseline (since it assumes independence among labels), but many modifications have been developed to take into account dependence without a major degrade to its properties of scalability, e.g., [36] which includes the predicted label yˆj−1 as additional input into the j-th binary model. Note that these are so-called data transformation methods, which means that the multi-label problem is cast to and solved as either several binary or multi-class problems, and thus the methodology of Algorithm 1 does not need to be modified: Either Eq. (4) or Eq. (5) can be plugged in as g(·) (Eq. (1)), 20

y1 y2 y3 y4

W 1 0 1 0

E 0 1 1 1

(a) original dataset, k = 2

y1 y2 y3 y4

W, E 2 1 3 1

(b) label powerset transformation, k 0 = 3

y1 y2 y3 y4

W 1 0 1 0

y1 y2 y3 y4

E 0 1 1 1

(c) binary relevance transformation, k 0 = 2

Figure 6: Example of data transformation for multi-label classification, into a multiclass problem via the label powerset method (Fig. 6b), and two binary problems via the binary relevance method (Fig. 6c). Note that, in the multi-class case, y ∈ {1, 2, 3} ≡ {[0, 1], [1, 0], [1, 1]}. In reference to the Frankfurt Airport data, we can imagine that the first label is Walk (W ), and the second Escalator (E).

or h(·) (Eq. (2)). Of course there are important changes to k in both cases. This is best illustrated with an example: Fig. 6 shows a transformation for two labels. We shall refer to the number of classes in the transformed problem as k 0 . Note that k 0 = 3 for the transformed multi-class problem, and that there are k instantiations under binary relevance (with k 0 = 2 in each case). Hence, different data transformation strategies lead to different performance both in terms of accuracy and scalability. 4.3. Experiment: cleaning up noisy labeling For this experiment, we take the Frankfurt Airport section of data, where two annotations are relevant: walking, and escalator. Of 87 minutes of time, we created 1008 instances of labeled sensor observations (each instance t represents one second – but because sensors are not recording constantly, only about one fifth of real seconds were turned into instances). We created the sensor observations xt as simply the variance of the length of the 3D accelerometer vector over the previous three time steps, i.e., we fuse the accelerometer variables together as one. This is a basic approach, but this experiment focusses on the labeling rather than the input space. From 38 aj annotations we created the 1008 labels for the instances, as vectors yt , although with the label powerset transformation, this can be equally represented as yt ∈ {0, 1, 2, 3} ≡ {[0, 0], [0, 1], [1, 0], [1, 1]} ≡ {∅, E, W, W E} for W alking and Escalator; see also Fig. 6. The empirical prior distribution of this variable is π = 0.06 0.04 0.84 0.06 21

of nothing, walking, escalator, and walking on the escalator (in that order). In other words, most of the recorded activity was only walking, with roughly equal time spent riding the escalator, walking on the escalator, and neither (in which case, sensor observations are available, but no label was provided). We learn classifier g for each of these classes, using the empirical means wrt xt , µ ˆ = [1.7, 0.52, 3.08, 2.96] indicating that it is difficult to distinguish between walking and walking on the escalator, as one might intuitively expect. Also we note that more activity coincides with ‘nothing’ than escalator. This indicates that the null label may in fact be more like ‘unknown’ or ‘unlabeled’ rather than ‘nothing’. We can also calculate the transition matrix from counts from the training data,   0.75 0.08 0.17 0.00 0.02 0.95 0.02 0.00  θ= 0.01 0.00 0.98 0.01 0.00 0.00 0.08 0.92 This tells us that the activities take place in segments, since the diagonals (which indicate the probability that the same activity continues in the next time step) are so high. If the user is walking (third label), it is highly likely he will still be walking during the following seconds. Values are only shown to two decimal places, but in any case we can say that transitioning from ‘nothing’ at time t to walking on the escalator at time t + 1 never happens – in fact, in the app we used, this is only possible if the user can set both annotations within the same second. It also seems that there is a preference to walk off the escalator, rather than stop walking and continue riding. As mentioned earlier, the Android application does not record continuously; a necessary limitation to preserve battery charge5 . In most cases, some annotation covers the gaps but for lack of input information we choose to ignore these instances. Therefore yt and yt+1 may in fact be separated by more than one second. In the future we intend to look at different ways of dealing with this issue. 5

Since the time of initial submission, Android has introduced a new batch-mode for sensors which helps mitigate battery drain from continuous sensing, https://source. android.com/devices/sensors/batching.html

22

Table 3: Methods used. The Vit. indicates that we use the Viterbi algorithm (batch setting).

key BR LP BRt LPt LPt-Vit. Combi.

model classes (t) (t) p(y1 |xt ), p(y2 |xt ) y1 ∈ {0, 1}, y2 ∈ {0, 1} p(yt |xt ) yt ∈ {0, 1, 2, 3} (t) (t−1) (t) (t−1) p(y1 |xt , y1 ), p(y2 |xt , y2 ) y1 ∈ {0, 1}, y2 ∈ {0, 1} p(yt |xt , yt−1 ) yt ∈ {0, 1, 2, 3} p(yt |x1 , . . . , xT ) yt ∈ {0, 1, 2, 3} Majority vote of BRt, BRt, LPt-Vit.

Fig. 7 shows samples of 60 instances at different points in the data (these instance cover around twice as much real time). The blue dots represent the feature input xt ; the squares represent the labels. The top line of squares represents the original annotations (∈ {∅, E, W, W E}. Tab. 3 lists the methods that we looked at for cleaning the data. We trained the classifiers (by taking empirical counts) and reclassified on the same data. This reclassification is shown (for small sections) also in the figure. We make the following conclusions from the figure (which are representative of the entire scenario): • The original (manually constrained) labels are noisy. For example, in Fig. 7 (top, left) there is a visible gap of several seconds after t = 55 where the user is presumably changing the label from escalator to walking. • The models that ignore transition context (BR, LP) perform very poorly, defaulting to the majority class (walking) most of the time, and LP favours learning the null label (although this could be useful for identifying transition points). • Taking into account the time context, BRt, LPt perform well, subjectively speaking, and create a smoother labeling, where null labels are overwritten. LPt is closest to the original. • When we use the Viterbi algorithm for decoding observations (rather than simple forward classification), LPt should be the most powerful. This is not clear, although it does provide the most distinctive cleaning, compared with the others. The variance in the results led us to try an ensemble combination of different approaches, with a simple majority vote (keyed Combi.). Tab. 4 23

x E W W+E

Frankfurt Airport / Cleaning: t ∈ {3, ..., 144}

Frankfurt Airport / Cleaning: t ∈ {65, ..., 235} x E W W+E

Orig. BR LP BRt LPt LPt-Vit Combi. 0

20

40

60

80

100

120

140

Orig. BR LP BRt LPt LPt-Vit Combi. 60

160

Frankfurt Airport / Cleaning: t ∈ {1285, ..., 1483} x E W W+E

Orig. BR LP BRt LPt LPt-Vit Combi. 1250

1300

1350

1400

1450

80

100

120

140

160

180

200

220

Frankfurt Airport / Cleaning: t ∈ {1700, ..., 1813}

Orig. BR LP BRt LPt LPt-Vit Combi. 1700

1500

1720

1740

1760

1780

1800

240

x E W W+E

1820

Figure 7: Samples from Frankfurt Airport. Each sample is taken over the range specified in the title of each plot, one entry per second, where gaps within this range indicate that no sensor data is recorded.

shows the similarity of the result produced by each method to the original aggregated label space. Similarity is defined as Similarity :=

T 1X 1y =ˆy T i=1 t t

where yˆt is the reclassified label (and yt is the original) at time index t. This is a subjective analysis, since no ground truth is available. Hence we would expect to be close but not exactly the same as the original. In Fig. 8 we show the results of a simulation, using the transition matrix from the real data (see above), but different emission functions (the true emission function is unknown) based on Eq. (3), with random µj s, but varying σj s to obtain an idea of the robustness of labelling missing segments. The 24

Table 4: Average similarity to the original label sequence after training and reclassification on FrankfurtAirport. Similarity does not necessarily represent good performance.

Method

Similarity

LP BR BRt LPt LPt-Vit. Combi.

0.51 0.49 0.89 0.92 0.76 0.86

data is generated by rolling the transition matrix forward yt = j|yt−1 = k with probability θj|k and creating observation x ∼ N (µj , σj ) accordingly). The results emphasise how important the selection of sensor fusion is. A poorer separation can lead to markedly different results. Fig. 9 shows that LDA performs best given a good separation of concepts, perhaps unsurprisingly given that this classifier is based on a Gaussian distribution, which is also how the data is generated. However, as the transition probability becomes more important, other classifiers are more competitive, such as SVMs. Fig. 10 instead varies the problem transformation method. As expected (and documented in the multi-label literature), the LP transformation offers some advantage. 5. Modeling and Analytics In this section we review the classification and prediction tasks of interest, that can be carried out with cleaned data. 5.1. Classification of the present The immediate task of interest is to semi or fully automate the annotation of activities. In an online setting, we may want to annotate new points as they arrive (either as a suggestion to the user, or an automatic label). At the present time point, a straightforward classification is y ˆt = h(xt , y ˆt−1 ). Note that the estimate y ˆt−1 is used, wherever the true value is not available.

25

100

µ =0.6,0.4,1.0,0.0 µ =0.3,1.0,0.3,0.7 µ =0.4,0.1,0.2,0.7 µ =0.5,0.2,0.1,0.4 µ =0.4,0.8,0.1,0.2 µ =0.2,1.1,0.6,0.8 µ =0.5,1.2,0.1,0.5

90

accuracy

80

70

60

50

40 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

σ

Figure 8: Performance of LPt-Vit. on simulated data, using the same transition matrix as in the real data. Each line in the plot represents a different set of generated means (µ = [µ1 , µ2 , µ3 , µ4 ]). The horizontal access represents increasing values of variance σj2 which applies to all plots. Each point is the average of 10 results of filling a random gap of size 100 over 10000 data points generated using parameters µj , σj2 .

5.2. Prediction of the future In other scenarios, we may want to look ahead, i.e., predict y ˆt+t1 = h(xt , y ˆt ) until some future time point t1 . This task is particularly difficult, unless the user’s activities are regular over time and if time-context is an input, i.e., y ˆt+t1 = h(xt , y ˆt , t1 ). The following task encompasses this idea. 5.3. Classification of the past: filling in the gaps We may want to annotate a stretch of time yt1 , . . . , yt2 between times t1 and t2 , in the past and possibly up to the current time. Possible approaches depend on if 1. there is no sensor data available, for example the battery ran out and the phone was off between time t1 and t2 ; or if 2. the sensor data is available, for example the phone and app were running, but the user did not supply any manual annotation 26

µ =0.0,0.4,0.8,1.2

100

LR LDA SVM DT

95 90

accuracy

85 80 75 70 65 60 55 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

σ

Figure 9: Performance of LPt-Vit. on simulated data for different base classifiers: logistic regression (LR), linear discriminant analysis (LDA), support vector machines (SVM) and decision trees (DT). As in Fig. 8, variance is varied along the horizontal axis. For clarity, only a single mean (evenly separated, µ = [0.0, 0.4, 0.8, 1.2]) is used.

In the first case, we rely heavily on prior probabilities; such as the Markovian transition probabilities encoded in θ. If y(t0 −1) = work and y(t1 +1) = work, and θwork|work is highest of diag(θ), then most probably yt0 , . . . , yt1 = work. This will work well for small gaps but the probability decays rapidly. Even if θwork|work = 0.99, after 60 seconds the probability that the user has continued working will be 0.9960 ≈ 0.55. Specifically, yt = argmax θj 0 |yt1 θjτ0 |j 0 θyt2 |j 0 j0

where τ = t2 − t1 , meaning that if the label is bus at t1 and t2 , then yt1 +1 , . . . , yt2 −1 as work may be more probable under this model (assuming that more time is spent at work than on a bus). Clearly, long term dependencies should help, particularly where there is little variance in the length of an activity, e.g., if the activity work typically lasts 8 ± 0.25 hours. Furthermore, time context can be used as input, as well as weather, e.g., p(yt = bike|tuesday, 8am, sunny). The unavailable sensor data (GPS, accelerometer, and so on) can meanwhile be simply treated as missing data (a 27

µ =0.0,0.4,0.8,1.2

100

BR LR 95

accuracy

90

85

80

75 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

σ

Figure 10: Performance of LP versus BR as problem transformation methods on the data, with LDA as a base classifier in both cases.

common scenario, for which there are existing algorithms to deal with, e.g., [38]), and then the scenario can be dealt with as if input data is available (with several components simply missing). In the case that input data is available (the phone was on, sensors were recording, but there are missing annotations), the framework of Algorithm 1 (Section 4.1) should be ample. The Viterbi algorithm, for example, is well suited to infer label missing segments. If running purely incrementally online, then classification of the present (as in Section 5.1) will work. 5.4. Experiment 3: classification of missing segments We take the same data as in Section 4.3, but with the data cleaned with LPt-Vit as ground truth. Before (re)training, we place random gaps in the labelling, such that the gaps occupy 50% of the data (i.e., a 50/50 train/test split): For random t1 , . . . , t2 we create a gap and hold this aside as testing data, and train on the remaining (training) data. We repeat this a number of times. Figure 11 shows the results on the same samples. Tab. 4 quantifies this. Unlike the earlier cleaning experiment, we are looking for a closeas-possible reconstruction, and it seems that Viterbi performs better than 28

Frankfurt Airport / Filling Gaps: t ∈ {3, ..., 144}

Frankfurt Airport / Filling Gaps: t ∈ {65, ..., 235}

x E W W+E

Clean/‘Orig.’

x E W W+E

Clean/‘Orig.’

Missing

Missing

Fix./LPt

Fix./LPt

Fix./LPt-Vit.

Fix./LPt-Vit. 0

20

40

60

80 time

100

120

140

160

60

100

120

140

160

180

200

220

240

time

Frankfurt Airport / Filling Gaps: t ∈ {1285, ..., 1483}

Frankfurt Airport / Filling Gaps: t ∈ {1700, ..., 1813}

x E W W+E

Clean/‘Orig.’

x E W W+E

Clean/‘Orig.’

Missing

Missing

Fix./LPt

Fix./LPt

Fix./LPt-Vit.

Fix./LPt-Vit.

1250

80

1300

1350

1400

1450

1500

time

1700

1720

1740

1760 time

1780

1800

Figure 11: The cleaned data (under the LPt-Vit model of Fig. 7) is used here as ground truth. In each view, one random continuous section is ‘blanked out’ (set to 0, unlabeled) and our approach is trained to fill in these gaps. The Original manual annotations represent the labelling of LPt-Viterbi – See Fig. 11.

a single online forward classification pass; and the ‘fix’ by Viterbi matches perfectly the original in 3/4 cases, compared to 1/4 for the forward pass only. There are many real-world scenarios where working within a time window is feasible, especially when that window is less than a minute (as in these samples). Let us point out the strong connection between cleaning data and classifying it. 6. Discussion: Recommendations and Open Challenges Our study addressed different challenges with data preprocessing, cleaning, and modeling. Our methodology for cleaning and learning from partially29

1820

Table 5: Average similarity to the cleaned space after creating random gaps, retraining, and filling in the gaps. Unlike in Tab. 4, similarity is a representation of accuracy.

Method

Similarity

LPt (forward) LPt Viterbi

0.80 0.93

labeled sensor data from an urban enviornment was able to deal with most challenges presented by the case studies we looked at. It was flexible enough to incorporate a range of classification schemes, and thus to allow combination of several of the best practices from the literature. The approach was relatively simple, but it proved effective, and can work in either a purely online or batch-incremental setting. 6.1. Recommendations for practitioners Manually annotated data is typically very scarce and noisy, but if wellcleaned, it can be used to formulate effective supervised models. Simple intuition-based rules (such as our 6-hour cut-off rule) are an important initial step, but a classification framework is necessary to smooth out further noise. This can be done in a purely online preprocessing step with incremental classifiers; and also within a batch/window setting. When dealing with overlapping activities (the multi-label case), any of the major data transformation approaches will allow the application of an existing single (exclusive) label framework. We did not find any major difference between the binary relevance and label powerset approach. It is worth trying both. Modern improvements to these approaches mean that scalability is not an issue for most practical applications (of less than say, 1000 activity labels). We do remark that it is fundamental to take into account time context for cleaning data. Although we used for experiments a rather primitive input space, even well-engineered features will not be able to distinguish well between activities such as home and work without treating data as a time series. The same methodology used for cleaning labels can also be used for classification and prediction. Pieces of missing data could successfully be filled-in. In our experiments, it appeared that working in batches may produce more accurate results than working totally online. This is not necessarily an issue, 30

since major applications include filling in past information where the sensing device (e.g., phone) was off, or labeling was unavailable – i.e., offline labeling. It is clear that external data, such as day and time and weather context can be extremely useful, although we leave this particular aspect for future investigation. 6.2. Open challenges Setting an appropriate aggregation time step presents one challenge for future investigation. The smaller the step, the faster the reaction time. However, the accuracy of the analysis may suffer if the step is too small to present an informative summary of what is happening. On the other hand, an excessively large time step only slows down the reaction time in real-time applications (e.g. a person starts walking, but recognition is delayed). Moreover, a large time step may capture heterogeneous data, for example, a mixture of several activities (which are not all on at all times during the time step, therefore not covered by the usual multi-label case). The performance of HMM-inspired strategies my vary considerably. A further evaluation of what exactly the ‘null label’ means will be useful in detailed analysis. In some contexts it will be necessary to learn to distinguish between it being associated with periods of inactivity, an unlabeled activity, and transition points. In transition points, the act of manual-labeling may cause perturbations and general feedback to the sensors themselves. We simply attempted to smooth this area, but further work may be able to identify (by vay of the null labels) and neutralise the effect on sensor readings, i.e., clean and smooth the labels and the sensor inputs together. Another important open challenge is how to distinguish the periods of inactivity from the periods when no data is being collected. In this study we assumed that if there are no accelerometer records, then there is no activity. This is a crude approximation. Accelerometer sensor may be off or accelerometer sampling rate may be set to very large value (e.g. sample every 10 min). Failing to distinguish periods of inactivity from the periods when no data is being collected introduces noise in the resulting computational models. Such noise could be ignored, if there were only a few periods of inactivity or no data collection. However, when analyzing human mobility typically there are many more inactive periods than active periods. Unless a person is, for instance, a taxi driver, during a typical working day there would be several spans of movement and quite a lot of inactive periods, when the phone is resting in a bag or on a table. 31

Therefore, reliable methods for filtering out the periods of no data collection and disambiguating the periods of inactivity need to be developed. We already mentioned the importance of addressing long range dependencies, and the possibility to use weather and time and date information for cleaning and classification. A further step would be to model patterns in the user’s daily activity. Combined with sensor information, this would help future prediction tremendously. For example, at a certain time of day, when the sensor readings indicate high probability of bus, we could get ontop of that a good estimate as to when the traveller is likely to get off the bus, not to mention signalling anomalies when this activity takes longer or less than usual (e.g., in the case of traffic disruption). Ground truth data (i.e., verification of when the original data is correct) would help gauge the effectiveness of algorithms, but this goes against the very premise of having a lot of noisily-labeled data to learn from, rather than a small and artificially clean dataset, as is often assumed by studies. 7. Summary and Conclusions We presented a methodological framework for preprocessing sensory data for predictive modeling, and explored various possibilities for aggregation and cleaning of this data, and the challenges presented, from a machine learning point of view – in particular dealing with inherently sparse and noisy human labeling. Our methodology for cleaning and learning from partially-labeled sensor data was able to deal with most challenges presented by the case studies we looked at, involving activity annotation in an urban environment. The methodology was flexible enough to incorporate a range of classification schemes. We chose relatively simple ones, several of which proved capable in the scenarios upon which we tested it. It functioned both to clean the data and as an automated labeller for data where no annotations where available, working in either an purely online or window-incremental setting. 8. Acknowledgements This work was supported by the Aalto University AEF research programme (http://energyefficiency.aalto.fi/en/), and Academy of Finland grant 118653 (ALGODAN).

32

References [1] Mobile economy europe 2014, Report, GSMA (2014). [2] W. Z. Khan, Y. Xiang, M. Y. Aalsalem, Q.-A. Arshad, Mobile phone sensing systems: A survey, IEEE Communications Surveys and Tutorials 15 (1) (2013) 402–427. [3] J. Himberg, K. Korpiaho, H. Mannila, J. Tikanm¨aki, H. Toivonen, Time series segmentation for context recognition in mobile devices, in: Proc. of the 2001 IEEE Int. Conf. on Data Mining, IEEE ICDM, 2001, pp. 203–210. [4] N. Eagle, A. Pentland, Reality mining: sensing complex social systems, Personal and Ubiquitous Computing 10 (2006) 255–268. [5] J. R. Kwapisz, G. M. Weiss, S. A. Moore, Activity recognition using cell phone accelerometers, SIGKDD Explor. Newsl. 12 (2) (2011) 74–82. [6] R. Wang, F. Chen, Z. Chen, T. Li, G. Harari, S. Tignor, X. Zhou, D. Ben-Zeev, A. T. Campbell, Studentlife: Assessing mental health, academic performance and behavioral trends of college students using smartphones, in: Proc. of the 2014 ACM Int. Joint Conf. on Pervasive and Ubiquitous Computing, UbiComp, 2014, pp. 3–14. [7] H. Gao, J. Tang, H. Liu, Mobile location prediction in spatio-temporal context, in: The Procedings of Mobile Data Challenge by Nokia Workshop at the 10th Int. Conf. on Pervasive Computing, 2012. [8] A. Monreale, F. Pinelli, R. Trasarti, F. Giannotti, Wherenext: A location predictor on trajectory pattern mining, in: Proc. of the 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, KDD, 2009, pp. 637–646. ˇ [9] O. Mazhelis, I. Zliobait˙ e, M. Pechenizkiy, Context-aware personal route recognition, in: Proc. of the 14th Int. Conf. on Discovery Science, DS, 2011, pp. 221–235. [10] L. Chen, J. Hoey, C. Nugent, D. Cook, Z. Yu, Sensor-based activity recognition, IEEE Trans. on Systems, Man, and Cybernetics, Part C: Applications and Reviews 42 (6) (2012) 790–808. 33

[11] T. Pl¨otz, N. Y. Hammerla, P. Olivier, Feature learning for activity recognition in ubiquitous computing, in: Proc. of the 22nd Int. Joint Conf. on Artificial Intelligence, IJCAI, 2011, pp. 1729–1734. ˇ [12] I. Zliobait˙ e, J. Hollm´en, Mobile sensing data for urban mobility analysis: A case study in preprocessing, in: Proceedings of the Workshops of the EDBT/ICDT 2014 Joint Conference (EDBT/ICDT 2014), Vol. 1133 of CEUR Workshop proceedings, 2014, pp. 309–314. [13] R. Kohavi, F. Provost, Glossary of terms. Editorial for the special issue on applications of machine learning and the knowledge discovery process, Machine Learning 30 (2/3). [14] D. Figo, P. C. Diniz, D. R. Ferreira, J. a. M. Cardoso, Preprocessing techniques for context recognition from accelerometer data, Personal Ubiquitous Comput. 14 (7) (2010) 645–662. [15] J. Zhang, J. Xu, S. S. Liao, Aggregating and sampling methods for processing GPS data streams for traffic state estimation, IEEE Transactions on Intelligent Transportation Systems 14 (4) (2013) 1629–1641. [16] C. C. Aggarwal (Ed.), Managing and Mining Sensor Data, Springer, 2013. [17] M.-L. Zhang, Z.-H. Zhou, A review on multi-label learning algorithms, IEEE Trans. on Knowledge and Data Engineering 26 (8) (2014) 1819– 1837. [18] P. Mannonen, K. Karhu, M. Heiskala, An approach for understanding personal mobile ecosystem in everyday context, in: Proc. of the 15th Int. Conf. on Electronic Commerce, ICEC, 2013, pp. 135–146. [19] N. Aharony, W. Pan, C. Ip, I. Khayal, A. Pentland, Social fMRI: Investigating and shaping social mechanisms in the real world, Pervasive and Mobile Computing 7 (6) (2011) 643–659. [20] B. Khaleghi, A. Khamis, F. O. Karray, S. N. Razavi, Multisensor data fusion: A review of the state-of-the-art, Information Fusion 14 (1) (2013) 28 – 44.

34

[21] H. Durrant-Whyte, T. Henderson, Multisensor data fusion, in: B. Siciliano, O. Khatib (Eds.), Springer Handbook of Robotics, Springer Berlin Heidelberg, 2008, pp. 585–610. [22] F. Castanedo, A review of data fusion techniques, The Scientific World Journal 2013. [23] G. N. Bifulco, L. Pariota, F. Simonelli, R. D. Pace, Real-time smoothing of car-following data through sensor-fusion techniques, Procedia - Social and Behavioral Sciences 20 (2011) 524 – 535. [24] European Commission, How Europeans Spend Their Time: Everyday Life of Women and Men : Data 1998-2002, 2004. [25] Android developer’s guide, http://developer.android.com/guide/ topics/sensors/sensors_overview.html. [26] M. Dunham, Y. Meng, J. Huang, Extensible Markov model, in: Proc. of the 4th Int. Conf. on Data Mining, IEEE ICDM, 2004, pp. 371–374. ˇ [27] J. Gama, I. Zliobait˙ e, A. Bifet, M. Pechenizkiy, A. Bouchachia, A survey on concept drift adaptation, ACM Comput. Surv. 46 (4) (2014) 44:1– 44:37. [28] D. Barber, Bayesian Reasoning and Machine Learning, Cambridge University Press, 2012. [29] L. Liao, D. J. Patterson, D. Fox, H. Kautz, Learning and inferring transportation routines, Artif. Intell. 171 (5-6) (2007) 311–331. [30] P. Widhalm, P. Nitsche, N. Brandie, Transport mode detection with realistic smartphone sensor data, in: Proc. of the 21st Int. Conf. on Pattern Recognition, ICPR, 2012, pp. 573–576. [31] S. Reddy, M. Mun, J. Burke, D. Estrin, M. Hansen, M. Srivastava, Using mobile phones to determine transportation modes, ACM Trans. Sen. Netw. 6 (2) (2010) 13:1–13:27. [32] U. Avci, A. Passerini, Improving activity recognition by segmental pattern mining, IEEE Trans. on Knowledge and Data Engineering 26 (4) (2014) 889–902. 35

[33] B. Esmael, A. Arnaout, R. Fruhwirth, G. Thonhauser, Improving time series classification using hidden markov models, in: Proc. of the 12h Int. Conf. on Hybrid Intelligent Systems, HIS, 2012, pp. 502–507. ˇ [34] I. Zliobait˙ e, A. Bifet, J. Read, B. Pfahringer, G. Holmes, Evaluation methods and decision theory for classification of streaming data with temporal dependence, Machine Learning (2014) 1–28. [35] G. Tsoumakas, I. Katakis, I. Vlahavas, Random k-labelsets for multilabel classification, IEEE Trans. on Knowledge and Data Engineering 23 (7) (2011) 1079–1089. [36] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multilabel classification, Machine Learning 85 (3) (2011) 333–359. [37] J. Read, A. Puurula, A. Bifet, Multi-label classification with meta labels, in: Proc. of IEEE Int. Conf. on Data Mining, IEEE ICDM, 2014, pp. 941–946. ˇ [38] I. Zliobait˙ e, J. Hollm´en, Optimizing regression models for data streams with missing values, Machine Learning (2014) 1–27.

Appendix 8.1. Experiment: Estimating the rate of change from static measurements Sensors record static measurements; however, sometimes our interest may be to estimate dynamic characteristics. Examples include estimating speed of a moving object from GPS coordinates, estimating energy consumption from battery level indications, estimating flow rates from observed level of liquid. The task in this experiment is to estimate how much energy is being consumed during data collection, given unequally time spaced observations of the battery level. The main challenges are: deriving conversion equations, filtering out uninformative observations, identifying and handling the periods with missing information (when the data collection application is off).

36

8.1.1. Methodology For energy rate estimation we use level, voltage and status information from the Battery Probe. Level indicates the percentage of battery charge remaining. Voltage indicates current voltage. Status indicates whether the phone is charging, discharging or the battery is full. All the records have the same time stamps. Energy consumption in watt-hours (Wh) is computed as E(Wh) = Q(mAh) × V(V ) /1000,

where Q is the electric charge in milliampere-hours (mAh), V is voltage in volts (V). Given data recorded by ContextLogger2, the electric charge during the th i time period,which starts at time ti and ends at time ti+1 can be estimated as Qi = Qbattery × (Li − Li+1 ),

where Li and Li+1 are battery levels (in percentage) at the start and the end of the period. However, there are two challenges. Firstly, data records are not equally spaced in time. As a result, time period i is not necessarily equal to i+1 and, hence, Qi is not comparable to Qi+1 . Secondly, battery levels are presented in low granularity (in rounded percents). As a result, estimation becomes stepped, where for several records the estimated energy consumption is zero (because Li = Li+1 ), then suddenly jumps and becomes zero again. The first challenge can be overcome by estimating the rate of energy consumption instead of the amount of energy consumed. The rate of consumption is known as power P (in Watts), which during time period i can be computed as Pi = Qi × 3 600/(ti+1 − ti ).

It is assumed that t is measured in seconds. The second challenge can be addressed by discarding all the records of battery level, where the level remains the same as in the preceding record. This way we get less time intervals to consider, while the intervals themselves are longer.

8.1.2. Results and observations Figure 12 plots the resulting energy consumption rate over time. We can see that most of the time energy consumption with ContextLogger is around 5 W. Negative energy appears when the phone is plugged for charging. 37

Figure 12: Estimated energy power estimates over time.

There are higher peaks of energy power, which may be due to switching ContextLogger on and off, when it is partially charged. In order to estimate energy more exactly at these points, we would need to know or detect when context logger is switched on and off. Currently this information is not available from the logs. Overall, from this pilot experiment we can conclude that it is possible to estimate the distribution of dynamic characteristics, such as energy consumption, from static sensor observations. However, this kind of preprocessing requires some domain knowledge input (e.g. knowing from physics how energy is defined). Nevertheless, we anticipate that it is possible to define a generic model form of such estimates for any sensor. This remains a subject of future investigation.

38

Labeling Sensing Data for Mobility Modeling

For example, in healthcare, even domain experts (doctors) do not always know .... General options for alignment and feature-creation options are detailed in [14] ...

Download PDF

741KB Sizes 2 Downloads 152 Views

Report

Labeling Sensing Data for Mobility Modeling

Recommend Documents