2013 BigData clustering.pdf

Viewer
Transcript

2013 IEEE International Conference on Big Data

Hierarchical Feature Learning from Sensorial Data by Spherical Clustering Bonny Banerjee and Jayanta K. Dutta Institute for Intelligent Systems, and Dept. of Electrical & Computer Engineering The University of Memphis Memphis, TN 38152, USA [email protected], [email protected]

amount of human effort usually invested in manually studying the data and coming up with appropriate features for accurate classification in each domain or application. Manually encoding features for detection, recognition or prediction is not always feasible given the 5Vs (Volume, Velocity, Variety, Value, Veracity) of sensorial Big Data. The contribution of this article is a brain-inspired multilayered neural model where each layer selectively clusters its input which is the output of the lower layer. Recursive layerby-layer clustering is not a new idea (see [1] for example); however, our model stands out due to its computational properties that help cope with the 5Vs of Big Data. The properties may be summarized as follows. 1. Unsupervised and online operation to cope with volume, velocity and value; no personal information is compromised as the data is never stored; 2. Invariant feature learning from spatiotemporal data to cope with veracity by discovering objects, actions and events from repeating coincident patterns; 3. Scaling to full-sized high-dimensional input data to cope with volume and variety; 4. Scaling to an arbitrary number of layers thereby having the capability to capture features at any level of abstraction to discover objects, actions and events; 5. Hierarchical feature learning from data in multiple modalities with the learning rule derived from the same objective function to cope with variety; no modality-specific assumption is made for data from different kinds of sensors; 6. Fully-learnable with only two manually tunable parameters, the learning rate and threshold decay parameter, to cope with volume and variety. We present results by deploying the model for learning hierarchies of rich meaningful features from sensorial data in multiple modalities – images and videos. Results from this brain-inspired model that support findings in neuroscience are highlighted.

Abstract—Surveillance sensors are a major source of unstructured Big Data. Discovering and recognizing spatiotemporal objects (e.g., events) in such data is of paramount importance to the security and safety of facilities and individuals. What kind of computational model is necessary for discovering spatiotemporal objects at the level of abstraction they occur? Hierarchical invariant feature learning is the crux to the problems of discovery and recognition in Big Data. We present a multilayered convergent neural architecture for storing repeating spatially and temporally coincident patterns in data at multiple levels of abstraction. A node is the canonical computational unit consisting of neurons. Neurons are connected in and across nodes via bottom-up, top-down and lateral connections. The bottom-up weights are learned to encode a hierarchy of overcomplete and sparse feature dictionaries from space- and time-varying sensorial data by recursive layer-bylayer spherical clustering. The model scales to full-sized highdimensional input data and also to an arbitrary number of layers thereby having the capability to capture features at any level of abstraction. The model is fully-learnable with only two manually tunable parameters. The model is generalpurpose (i.e., there is no modality-specific assumption for any spatiotemporal data), unsupervised and online. We use the learning algorithm, without any alteration, to learn meaningful feature hierarchies from images and videos which can then be used for recognition. Besides being online, operations in each layer of the model can be implemented in parallelized hardware, making it very efficient for real world Big Data applications. Keywords-learning hierarchical representations; repeating coincidences; spherical clustering; Hebbian rule

I. I NTRODUCTION Surveillance sensors (e.g., video surveillance cameras) that monitor a facility (e.g., airport) or an individual (e.g., as in assisted living residences and nursing homes for dementia patients) round-the-clock generate a formidable amount of unstructured data. Discovering and recognizing spatiotemporal objects (e.g., visual objects, actions, events) in such data is of paramount importance to the security and safety of the facility or individual. Determining appropriate features is the crux to the problems of discovery and recognition. In recent years, there has been a surge of interest in learning feature hierarchies from data using multilayered or deep learning models largely motivated by the layered organization of the neocortex. From an application standpoint, a good feature learning algorithm, if developed, would save a significant

978-1-4799-1293-3/13/$31.00 ©2013 IEEE

A. Objective function Formulating an objective function helps to understand a model’s global behavior. Two approaches to feature learning using deep models are prevalent. In one approach, an objective or energy function is minimized by iteratively updating connection weights with respect to an error signal

7

that is propagated backwards. In case of supervised learning, this signal is often the gradient of the classification error which results in learning discriminative features while in the unsupervised case, it is the gradient of the reconstruction error which results in learning generative features. An appropriate regularization term is often used to avoid overfitting and induce sparsity. Variants of deep belief networks [2], convolutional neural networks [3], [4], and sparse/denoising autoencoders [5] are trained using this approach (also see [6], [7]). In the other approach, unsupervised learning can be conceptualized as capturing the distribution of recurring coincident patterns in the data. Clustering is an example of this approach though not the only one. Variants of Neocognitron [8], HMAX [9] and Hierarchical Temporal Memory [10] are trained using this approach. While a systematic comparison between the two approaches is missing in the literature, it has been shown that the deep belief network is outperformed by very simple mixture models at capturing the statistics of natural images; adding layers does not effect its performance if the first layer is trained well enough [11]. Our model learns feature hierarchies from recurring coincidences in the data in an unsupervised and online manner, minimizing the following objective function on convergence: ℓ(X , W) =

n 1X X kxj − wi k2 2 i=1

and infer simultaneously, we use activations and states of neurons – learning depends on the state of a neuron while inference is based on activation strength; the state changes for a winner neuron if its activation crosses an adaptive threshold unique for each neuron (section II-C). Our weight update rule (section II-D) has been well-studied and is shown to converge to a stable state for sufficiently small learning rate. Empirically, our model learns features from different kinds of data that remain stable over prolonged periods of online learning. C. Scaling to full-size high-dimensional data Algorithms have been proposed that learn useful features from small patches of data (e.g., 10 × 10 pixel image patch). See [14] for example. Scaling them to realistic-sized data (e.g., 1024×1024 pixel image) in a computationally tractable way is a non-trivial problem [4]. An ensemble of neurons, called a node, is the canonical computational unit in our architecture. Each node learns from a small patch of data while each neuron in a node learns to respond to a unique feature. After seeing a large number of patches, all nodes in a layer will have learned the same set of features. Hence, instead of learning at each node independently, we share the learning in a node throughout a layer. This has three benefits. First, learning occurs much faster in all nodes as each gets the opportunity to learn from all the patches in each data. Second, the invariant properties (e.g., translation invariance) learned in one node is shared throughout the layer. Third, all nodes behave uniformly thereby making the model’s behavior tractable. See section II and Fig. 1.

(1)

j∈N (i)

where X = {x1 , x2 , ...xN } and W = {w1 , w2 , ...wn } are the set of d-dimensional data points and features respectively, NS (i) is the set of data points in the neighborhood n of wi , | i=1 N (i)| < N , |.| denotes the cardinality of a set. Each data point and feature is normalized to have unit norm. Each layer in our model learns a set of nonorthogonal features that soft-partitions a subset Sn of the normalized input space; this subset, given by i=1 N (i), does not contain outliers. Such a formulation may be construed as soft-clustering on the surface of a hypersphere of unit radius (a.k.a. spherical clustering [12]) where the outliers are not allowed to influence the cluster centers. Orthogonal matching pursuit (OMP) [13], an iterative algorithm widely used to compute the coefficients for the features in a generative model, is a generalization of coefficient computation in spherical clustering as, for an input, the latter requires only one feedforward pass of OMP to determine the coefficient. Thus, spherical clustering is computationally more efficient than generative models employing OMP-like iterative algorithms. B. Online learning

D. Scaling to arbitrary number of layers When designing multilayered architectures, a pertinent question to ask is: how many layers are necessary or sufficient for hierarchical feature learning? No clear answer could be found in the literature. We present an architecture where the process of feature learning at any layer from its lower layer activations may be continued in an arbitrary number of layers thereby learning arbitrarily large and complex features. However, features will be learned in a layer only if a subset of neurons in the lower layer are repeatedly active together. If such a subset does not exist, features in a higher layer will neither be stable nor meaningful. Thus, the number of layers necessary and sufficient for hierarchical feature learning is driven by the data [1]. Indeed, we will show in section III-A that three layers of features can be learned from one set of data but only two layers can be learned from another. Being able to scale a model to arbitrary number of layers is crucial for discovering the largest possible features which are closer to symbolic/linguistic representations.

By online learning, we refer to two capabilities in a model – it should be able to learn from instances where each instance is seen only once in a lifetime, and it should be able to learn and infer simultaneously. In order to learn

E. Common learning algorithm for different modalities The common cortical algorithm hypothesis states that the objectives of learning algorithms operating in the different perceptual cortices are very similar [15]–[18]. We

8

present a minimal model that can be fully learned with only two manually tunable parameters and is noncommittal to any modality-specific processing of the data. The features learned by this model from images and videos are in accordance to well-established findings in neuroscience. We claim that this model is a promising candidate for verifying the common cortical algorithm hypothesis, and that the hypothesis indeed seems plausible and is worth more investigation. In the next section, we describe the model followed by experimental results and conclusions.

1 0 0 0 1 0 0 0 1 (i)

(ii)

(iii)

1 0 0 0 1 0 0 0 1

II. N ETWORK MODEL A. Architecture Our network architecture consists of a hierarchy of layers of nodes (see Fig. 1). A node is a canonical computational unit consisting of an ensemble of densely connected neurons. In the general case, neurons in a node are sparsely connected to the neurons in the neighboring nodes in the same layer, one layer above and one layer below by lateral, bottomup (or feedforward) and top-down connections respectively. The first (or lowest) layer in the hierarchy receives input from external spatiotemporal data (e.g., images, videos). The architecture is convergent, i.e. multiple neurons in a layer report to one neuron in the immediate higher layer. The receptive field (RF) of a neuron is the spatial region in which the presence of a stimulus may influence its activation. In our hierarchical model, the RF for a neuron is determined by the neurons in the immediate lower layer reporting to it. The RF size is the same for all neurons in a layer and increases as we ascend up the hierarchy. In this paper, we will concentrate on learning feature hierarchies using the feedforward connections only.

(iv)

(v)

(vi)

Figure 2. Operation of our architecture is based on the simple cell wiring model of Hubel & Wiesel [20]. In (i), a 3 × 3 grid of nine partially overlapping on-center-off-surround ganglion cells are shown that connect to a simple cell in the higher layer (primary visual cortex or V1). If the three cells along the main diagonal are repeatedly active together while the other six are not, the corresponding ideal weight matrix for the connections and the receptive field of simple cell would be as shown in (ii) and (iii) respectively. We replicate this wiring model in each layer in our architecture. Thus, an example of a receptive field obtained by wiring the simple cells arranged in a 3 × 3 grid is shown in (vi) which corresponds to the receptive field of a cell in the secondary visual cortex (or V2).

layers in the form of activations. The goal of computations in each node is to selectively cluster the data into groups [19]. Over time, each neuron in a node gets tuned to a unique feature which represents a cluster center. Functionally, a node is a bag of filters all of which are applied to each patch of the input data. The output (or state, see section II-C) of a layer is the input to the next higher layer. The same operation is executed in each node in any layer. Thus, the feedforward weights in this hierarchical model are learned by recursive layer-bylayer spherical clustering. Fig. 2 depicts the operation of this model. Notation. N (l) (i) is the set of neurons in layer l that connect to the ith neuron in some layer. This is referred to (k,l) as the neighborhood in layer l of the ith neuron. Wji (t) is the weight or strength of connection from the j th neuron (l) in layer k to the ith neuron in layer l at time t. Ai (t), (l) (l) Si (t) and θi (t) are the activation, state and threshold of th the i neuron in layer l at time t. l = 0 for the input (or lowest) layer. C. Neuron In our model, the activation of a neuron is given by

Figure 1. Architecture used in our model is shown. The circles in each layer denote nodes each of which contains an ensemble of neurons. See section II for details.

(l)

Ai (t) =

X

(l−1,l)

Wji

(l−1)

(t) × Aj

(t)

(2)

j∈N (l−1) (i)

B. Operation At each sampling instant, our model accepts spatial data as input through the first layer which is passed on to higher

In matrix form, A(l) = A(l−1) × W (l−1,l) where the neighborhood information is implicit. Since each feature in

9

W (l−1,l) and A(l−1) are normalized, A(l) is the normalized dot product of the input with each feature. This allows a neuron to act as a suspicious coincidence detector [21], responding with high activation if the input pattern matches the feature encoded in its receptive field. For a given input, all neurons in a node receive activations. The maximally activated neuron in a node is the “winner”. While we compute the winner using a max operation, it is more biologically plausible to consider lateral connections within a node using which neurons inhibit each other at a faster time scale eventually settling at some stable state. Lateral inhibition has been used for similar purposes in many models, such as, in [14], in the form of V -cells in Neocognitron [8] and in the LISSOM model [22]. The state of a neuron is binary and is given by

(l) Si (t)

=

   1,   0,

(l)

normalized to have unit norm, which allows all neurons in a layer to compete on an equal footing. A new neuron is not recruited unless the incoming pattern is more similar to the initialized feature than to any of the learned features. After each update, weights to each neuron are normalized to have unit norm. Thus, feedforward connection from a presynaptic neuron (i) to a postsynaptic one (j) that fire together are strengthened while the rest (to j) are weakened. The weakening of connections is crucial for robustness as it helps remove infrequent coincident patterns from memory which are probably noise. The threshold is updated as follows: (l) θi (t

(3)

The threshold θ is adaptive and unique for each neuron. Only the winner in a node is assigned the state 1 if its threshold is crossed. This is how our model implements the winnertake-all mechanism which allows only the neuron of highest activity to learn. We say a neuron has fired if its state reaches unity. Thus, a neuron integrates all inputs over its RF until it reaches its threshold when it fires if it is the winner. As soon as it fires or if it fails to fire, it discharges and then starts integrating again. The discharge from a neuron inhibits neighboring neurons in its own layer. As in [14], it may be assumed that this lateral inhibition is proportional to a neuron’s total accumulated charge (or activation) and operates at a faster time scale. The inhibition is required to ensure that neurons in a layer do not get tuned to the same feature. The inhibition influences a neuron’s activation which in turn influences its inhibition. This cycle ensues until a stable state is reached. In most practical cases, this inhibition is observed to be strong enough to drive all neurons close to their baseline activation. In our implementation, we assume this baseline to be zero which does not affect our features qualitatively. (l)

(l−1,l)

(t+1) = (1−α)×Wij

(l−1)

(t)+α×Si

(5)

The proposed model was deployed for learning feature hierarchies from data in different modalities in an unsupervised and online manner with the learning rule derived from the same objective function as in equ. 1. The feedforward weights were learned layer by layer with α(t) = α(t − 1)/(1 + t/106 ), α(0) = 0.1. θ were initialized to ones. Overlap between patches for adjacent nodes was 75% and 25% in the first and second layers respectively. The top layer had only one node. The number of nodes in each layer is a function of the % overlaps and the RF sizes of neurons. Features for the second and third layers were reconstructed as follows. For a neuron in the second layer, a

Feedforward weights to neuron j in layer l with Sj (t) = 1 are updated following Hebbian rule. (l−1,l)

(l)

if Si (t) = 1 otherwise

III. E XPERIMENTAL RESULTS

D. Learning: Updating weights and thresholds

Wij

(l)

Ai (t), (l) (1 − η) × θi (t),

where η is the threshold decay parameter, a constant, 0 < η < 1. Due to the threshold, only a subset of stimuli can trigger learning. If η = 1, all stimuli are used in learning as in traditional clustering algorithms. If η = 0, no stimulus can cross the threshold, hence learning does not occur. Size of the set of effective stimuli reduces with reduction in the value of η. The threshold decay mechanism ensures that the size of the effective subset remains fixed throughout the learning process, thereby maintaining the plasticity of the network. The winner-take-all mechanism along with the threshold favor neurons with sparsely distributed activity. In the proposed model, a neuron always passes on its activations to its neighboring neurons in all layers irrespective of whether it fires or not. This is crucial for online operation where learning and inferencing proceed simultaneously and not in distinct phases. If a pattern has been learned and a part of it is shown, a partial pattern of activations will stimulate the remaining neurons of the pattern to become active thereby completing the whole pattern. However, the strength of connections will not be altered unless enough of the pattern has been seen (as determined by θ) and the RFs of the presynaptic neurons are the best match to the incoming pattern in their respective nodes to fire the postsynaptic neuron in the higher layer.

(l)

if Ai (t) > Aj (t), ∀j 6= i, and (l) (l) Ai (t) > θi (t) otherwise

+ 1) =

(

(t) (4)

where α is the learning rate that decreases with time for finer convergence, 0 < α < 1, S (0) = A(0) . This weight update rule is obtained by applying gradient descent on the objective function in equ. 1 in an online setting. Feedforward weights leading to each neuron are initialized to ones and

10

neuron in each first layer node that most strongly connected to it was chosen. The features represented by these neurons were weighed by the connection strengths and spatially organized taking into consideration the % overlaps among nodes. Once the second layer features were constructed, the same procedure was carried out for each third layer neuron to construct their features. To reconstruct unknown data, a winner neuron was computed in each node in the highest layer. A neuron in each node in the lower layers was chosen based on strongest connection to the winner. The chosen lowest layer features, each multiplied by the norm of the corresponding input data patch and spatially organized based on the % overlaps among nodes, reconstructed the input.

Figure 4. A hierarchy of features were learned from handwritten numerals in MNIST dataset in first, second and third layers with receptive field sizes 10 × 10, 16 × 16 and 28 × 28 respectively. 400, 150 and 50 features from first (top left), second (top right) and third (bottom) layers are shown.

A. Images Our model learned three layers of features from natural images (downloaded from Google images). The images were converted to grayscale, and convolved with a Laplacian of Gaussian filter to crudely highlight edges (performed by center-surround cells before the signal reaches V1). The features learned in the first layer were edges/bars in different orientations and phases, similar to RFs of simple cells in V1 [20]. The features learned in the second layer were different combinations of these edges, similar to RFs found in V2 [23], [24] (see Fig. 3). Features learned in the third layer were unstable and did not show any coherent pattern. Our model also learned three layers of features from 60,000 images of ten handwritten numerals {0, 1, ...9} from the MNIST dataset [3]. As shown in Fig. 4, parts of numerals were learned in first layer, larger parts in the second layer, and whole numerals in the third layer. The grayscale intensity denotes the strength of a feature. η was chosen as 10−4 for natural images and 10−1 for MNIST as there are many more outliers in the former data set compared to the latter. Thus, the same model could learn three layers of features from the MNIST data but only two layers from natural images due to the absence of recurring coincidences among second layer features in the latter case.

Figure 5. 30 out of 100 features learned in first layer from action videos (e.g., walking, waving) are shown. Each row is a spatiotemporal feature with spatial RF size 10 × 10, temporal RF size 5, and direction from left to right.

features have often been learned from voxels for computer vision and machine learning applications, particularly for action recognition (see for example, [25], [26]). When our model was exposed to videos of ten actions (e.g., walking, waving) performed by nine subjects from the Weizmann dataset [27] with η = 10−2 , the first layer neurons with RF size 10 × 10 × 5 learned edges in different orientations and moving in different directions. That is, they developed orientation- and direction-selective RFs as in complex cells in V1 [20] (see Fig. 5). Consequently, they respond to static edges/bars in a particular orientation in different locations within their RFs, and therefore, have learned positioninvariant features. C. Clustering As stated in section I-A, our learning algorithm may be construed as a special case of clustering. We compared its clustering performance to that of three algorithms with interesting properties. First, the k-means is one of the most widely used clustering algorithms and its performance will serve as a benchmark. Second is the algorithm proposed by Einh¨auser et al. [14] for learning features from natural videos. It has two distinct properties: division by past trace

Figure 3. Features of size 10 × 10 and 20 × 20 were learned from natural images in first (left) and second layers (right). 49 out of 150 and 70 out of 100 features from first and second layers are shown.

B. Videos Spatiotemporal video features were learned in our model from 3D voxels where time is the third dimension. Such

11

90

% accuracy

80

repeating coincidences in spatiotemporal data in different modalities in an online and unsupervised manner. We used the McCulloch-Pitts neuron model with a variable threshold that is unique for each neuron and adaptive to the data. A constant parameter was used to decay this threshold such that the influence of outliers on learning may be controlled. This is crucial for using the same model for learning from data with different proportion of outliers, such as, natural images with a lot of outliers and clean handwritten numerals, as in MNIST data set, with very few outliers. Learning was facilitated by the Hebbian rule and winnertake-all mechanism. Generative models employing iterative coefficient computation offer more explanatory power but are computationally less efficient than the winner-take-all mechanism; efficiency is of primary importance when dealing with Big Data at the lower levels of abstraction. Our network architecture consisted of a hierarchy of layers of nodes, a node being a canonical computational unit consisting of an ensemble of neurons. We showed how our algorithm, without any alteration, could learn the feedforward weights in this architecture which embody the features, from images and videos. Neurons, when exposed to spatiotemporal data (videos), got tuned to position-invariant and direction-selective features. The model scales to realisticsized high-dimensional data and arbitrary number of layers. Operations in each layer of the model can be implemented in parallelized hardware, making it very efficient for real world Big Data applications. We conclude that the proposed model is a promising candidate for verifying the common cortical algorithm hypothesis as well as for discovering spatiotemporal objects in real world Big Data.

Iris Wine Glass Vehicle Segment Mean

70

60

50

40 −12

−10

−8

−6

−4

−2

0

Threshold decay parameter, η (in log scale)

Figure 6. The influence of η on the performance of our model on five UCI datasets is shown. The errorbars indicate standard deviations.

for achieving translation or viewpoint invariance, proposed by F¨oldi´ak [21], and lateral inhibition for determining the winner. Third is the topology adaptive self-organizing neural network or TASONN [28] for skeletonization of data sets. It belongs to the class of algorithms known as growing neural gas [29] which start with a very few neurons and strategically add neurons and connections with learning until a stopping criterion is met. Hence, the final result is immune to bad initializations. Five datasets from the UCI machine learning repository [30] were used in our experiments (see Table I). Table II shows the performance (mean µ ± std. dev. σ) of four unsupervised and two supervised algorithms over 1000 trials on each of the datasets. The advantage of TASONN and our model over k-means for initialization is revealed by the σ. On average over all datasets, the classification accuracies of Einh¨auser et al.’s model and TASONN were 45%, kmeans and our model were 64%, and the two supervised algorithms were 74%. For measuring similarity, k-means and TASONN use Euclidean distance while Einh¨auser et al.’s and our models use dot product. Among the four unsupervised algorithms, our model performed with highest accuracy and lowest σ. Fig. 6 shows the variation in performance of our model for different values of η for each dataset. The best performance is achieved at η = 10−2 ; however, for natural data with many more outliers, η = 10−4 performs better.

ACKNOWLEDGMENT Research reported in this paper was partially supported by NSF CISE Grant No. 1231620. We thank Ravi P. Kasani for simulating the algorithms on video data. R EFERENCES [1] L. Zhu, C. Lin, H. Huang, Y. Chen, and A. Yuille, “Unsupervised structure learning: Hierarchical recursive composition, suspicious coincidence and competitive exclusion,” in Proc. 10th European Conf. Computer Vision, 2008, pp. 759–773. [2] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.

IV. C ONCLUSIONS Manually studying and encoding features for discovery, recognition or prediction is not always feasible given the 5Vs of sensorial data. In the past decade, a number of relatively complex models that require long training time to tune a large number of hyperparameters have been proposed for learning feature hierarchies. We have shown that a fullylearnable model, with only two manually tunable parameters, can learn rich meaningful feature hierarchies from

[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [4] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Intl. Conf. Machine Learning, 2009, pp. 609–616.

12

Table II P ERFORMANCE OF ALGORITHMS ON THE UCI DATASETS

Table I B ENCHMARK UCI DATASETS Name of dataset Iris Wine Glass Vehicle Segment

No. of points 150 178 214 846 2310

No. of dimensions 4 13 9 18 19

No. of classes 3 3 6 4 7

Name of dataset Iris Wine Glass Vehicle Segment

Unsupervised (µ ± σ) Einh¨auser et al.’s Our model model [14] (η = 0.01) 47.1 ± 4.3 71.4 ± 2.9 59.4 ± 1.7 89.6 ± 1.9 46.6 ± 2.3 48.7 ± 1.6 30.1 ± 0.6 39.1 ± 2.0 37.0 ± 1.8 67.2 ± 2.4

k-means (Matlab) 82.7 ± 12.4 95.0 ± 4.0 43.2 ± 2.8 37.0 ± 0.7 60.2 ± 6.8

TASONN model [28] 90.3 ± 1.3 42.8 ± 2.6 45.7 ± 4.0 27.7 ± 1.3 18.7 ± 0.7

Supervised (µ) SVM Mean (Matlab) Classifier 76.7 93.3 99.4 97.2 59.8 51.4 73.9 45.3 60.3 84.2

[18] P. Bach-y-Rita and S. W. Kercel, “Sensory substitution and the human-machine interface,” Trends in Cognitive Sci., vol. 7, no. 12, pp. 541–546, 2003.

[5] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Intl. Conf. Machine Learning, 2008, pp. 1096–1103.

[19] J. K. Dutta and B. Banerjee, “Learning features and their transformations by spatial and temporal spherical clustering,” arXiv:1308.2350, 2013.

[6] M. Ranzato, Y. Boureau, S. Chopra, and Y. LeCun, “A unified energy-based framework for unsupervised learning,” J. Machine Learning Res., vol. 2, pp. 371–379, 2007.

[20] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cats visual cortex,” J. Physiology, vol. 160, pp. 106–154, 1962.

[7] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring strategies for training deep neural networks,” J. Machine Learning Res., vol. 10, pp. 1–40, 2009.

[21] P. F¨oldi´ak, “Forming sparse representations by local antihebbian learning,” Biological Cybernetics, vol. 64, pp. 165– 170, 1990.

[8] K. Fukushima, “Neocognitron for handwritten digit recognition,” Neurocomputing, vol. 51, no. 1, pp. 161–180, 2003.

[22] J. Sirosh and R. Miikkulainen, “Topographic receptive fields and patterned lateral interaction in a self-organizing model of the primary visual cortex.” Neural Computation, vol. 9, no. 3, pp. 577–594, 1997.

[9] T. Serre, A. Oliva, and T. Poggio, “A feedforward architecture accounts for rapid categorization,” Proc. Natl. Acad. Sci., vol. 104, no. 15, pp. 6424–6429, 2007. [10] D. George, “How the brain might work: A hierarchical and temporal model for learning and recognition,” Ph.D. dissertation, Stanford University, CA, 2008.

[23] J. Hegde and D. C. Van Essen, “Selectivity for complex shapes in primate visual area V2,” J. Neurosci., vol. 20, no. 5, pp. RC61 1–6, 2000.

[11] L. Theis, S. Gerwinn, F. Sinz, and M. Bethge, “In all likelihood, deep belief is not enough,” J. Machine Learning Res., vol. 12, pp. 3071–3096, 2011.

[24] M. Ito and H. Komatsu, “Representation of angles embedded within contour stimuli in area V2 of macaque monkeys,” J. Neurosci., vol. 24, no. 13, pp. 3313–3324, 2004.

[12] I. S. Dhillon and D. S. Modha, “Concept decompositions for large sparse text data using clustering,” Machine Learning, vol. 42, no. 1-2, pp. 143–175, 2001.

[25] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” in Intl. Conf. Machine Learning, 2010, pp. 221–231.

[13] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in 27th Asilomar Conf. Signals, Systems and Computers. IEEE, 1993, pp. 40–44.

[26] Q. V. Le, W. Zou, S. Yeung, and A. Y. Ng, “Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis,” in IEEE Conf. Computer Vision Pattern Recog., 2011, pp. 3361–3368.

[14] W. Einh¨auser, C. Kayser, P. K¨onig, and K. P. K¨ording, “Learning the invariance properties of complex cells from their responses to natural stimuli,” European J. Neurosci., vol. 15, no. 3, pp. 475–486, 2002.

[27] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247–2253, 2007.

[15] M. Constantine-Paton and M. I. Law, “Eye-specific termination bands in tecta of three-eyed frogs,” Science, vol. 202, no. 4368, pp. 639–641, 1978.

[28] A. Datta, S. K. Parui, and B. B. Chaudhuri, “Skeletonization by a topology-adaptive self-organizing neural network,” Pattern Recognition, vol. 34, no. 3, pp. 617–629, 2001.

[16] C. Metin and D. O. Frost, “Visual responses of neurons in somatosensory cortex of hamsters with experimentally induced retinal projections to somatosensory thalamus,” Proc. Natl. Acad. Sci., vol. 86, no. 1, pp. 357–361, 1989.

[29] B. Fritzke, “A growing neural gas network learns topologies,” in Advances in Neural Information Processing Systems 7. MIT Press, 1995, pp. 625–632.

[17] L. von Melchner, S. L. Pallas, and M. Sur, “Visual behaviour mediated by retinal projections directed to the auditory pathway,” Nature, vol. 404, no. 6780, pp. 871–876, 2000.

[30] C. L. Blake and C. J. Merz, “UCI repository of machine learning databases,” 1998, University of California Irvine, Available at www.ics.uci.edu/∼mlearn.

13

2013 BigData clustering.pdf

Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 2013 BigData clustering.pdf. 2013 BigData clustering.pdf. Open.

Download PDF

557KB Sizes 0 Downloads 120 Views

Report

2013 BigData clustering.pdf

Recommend Documents