Active Perception based on Multimodal Hierarchical Dirichlet Processes Tadahiro Taniguchi, Toshiaki Takano Department of Human and Computer Intelligence, Ritsumeikan University Shiga, Japan {taniguchi, takano}@em.ci.ritsumei.ac.jp Ryo Yoshino Graduate School of Information Science and Engineering, Ritsumeikan University Shiga, Japan [email protected]

Abstract We describe an optimal active perception method for recognizing multimodal object categories that are formed using the multimodal hierarchical Dirichlet process (MHDP). The MHDP is a multimodal categorization method that enables a robot to organize object categories using multimodal information such as audio, visual, and haptic information in an unsupervised manner. In general, a robot requires a certain length of time to sense a target object for obtaining modality information. Therefore, a robot should select appropriate actions to recognize a target object efficiently within a limited amount of time. For this purpose, we introduce an action selection method for multimodal object category recognition on the basis of the MHDP and information gain criterion. Its optimality, which is based on the Kullback-Leibler divergence between a final recognition state and a current recognition state, is also proved. Furthermore, we show that the criterion has submodularity by virtue of the graphical model of the MHDP. Sequential action selection methods, a greedy algorithm, and a lazy greedy algorithm are proposed based on the submodular property. We conduct experiments using an upper-torso humanoid robot and synthetic data. The results show that the method enables the robot to actively select actions and efficiently recognize target objects.

1

Introduction

Humans form object categories using multimodal information such as audio, visual, and haptic information [1]. In human cognitive systems, object categorizes are not learned externally through classes in pattern recognition problems. In supervised learning problems, a pattern recognizer is required to predict labels that are provided as part of a training set. In contrast, our cognitive system essentially forms object categories on the basis of multimodal information in an unsupervised manner [2]. Various models for multimodal object category formation have been studied [3–14]. The multimodal hierarchical Dirichlet process (MHDP) proposed by Nakamura et al. is one of the eminent candidates for computational models achieving multimodal object categorization [9]. Moreover, Nakamura et al. proposed a series of multimodal categorization methods in their related works [5–10, 15] and conducted many experiments to show that the MHDP and its variants enable robotic systems to form object categories similar to a human using audio, visual, and haptic information in an unsupervised manner. They also integrated a nonparametric Bayesian language model into the multimodal categorization model and achieved unsupervised lexical acquisition from continuous speech signals [10]. 1

Visual image (looking around)

Haptic information (grasping)

λ

πj

αm0

t mjn

θ mk

xmjn ∞

Auditory input (hitting)

Auditory input (shaking)

γ

kjt

N mj

β ∞ J

M

Figure 1: The robot used in the experiment (left), and a graphical representation of the MHDP with M modalities corresponding to actions for perception (right) [16]. However, in practice, humans and robots cannot obtain all of the multimodal information related to a target object at the same time. Sensory information that can be obtained passively is limited. Most sensory information must be obtained actively, i.e., by executing certain sensing behaviors. For example, audio information about maracas can be obtained by shaking them, and tactile information can be obtained by grasping them. Some objects may require many actions, e.g., grasping, shaking, and hitting, to obtain sensory information. In fact, obtaining all of the possible multimodal information may require an excessive amount of time. Therefore, autonomous cognitive system, such as humans and robots, should select appropriate actions to efficiently recognize a target object within a limited amount of time. Active exploration and perception behavior is essential for autonomous cognitive systems such as human and animals. In this study, we provide a theoretical action selection method for multimodal object category recognition based on the MHDP. Owing to our unsupervised learning approach, the optimality of the proposed method is proved using the Kullback-Leibler divergence between a final recognition state and a current recognition state as an evaluation criterion. Furthermore, the submodularity of the criterion of action selection is proved on the basis of the graphical model of the MHDP. Using the submodular property of the information gain criterion, we propose cost efficient sequential action selection methods. We conduct experiments using an upper-torso humanoid robot and synthetic data. The results show that the method enables the robot to actively select actions and efficiently recognize target objects.

2 2.1

Active Perception for Multimodal Object Category Recognition Multimodal hierarchical Dirichlet process

The graphical model of the MHDP is shown in Fig. 1. The MHDP is a multimodal extension of the hierarchical Dirichlet process (HDP) proposed by Teh et al. [17]. The HDP is a nonparametric Bayesian extension of latent Dirichlet allocation (LDA), which was originally proposed for document-word clustering [18]. The MHDP assumes that a robot obtains sensory information for each modality by executing an action corresponding to the modality. Sensory information for each modality is represented by a bag-of-features (Fig. 1). Topics in LDA correspond to object categories in the MHDP, and words in LDA correspond to features in the MHDP. Gibbs sampling procedure enables the MHDP to estimate latent variables in the model. 2.2

Active perception using information gain

Obtaining information from all modalities is not necessary for the MHDP to calculate the posterior distribution over object categories when a robot attempts to recognize a target object. However, insufficient information deteriorates the accuracy of the posterior distribution or recognition result. For example, the grasping behavior may not be sufficient to distinguish a radish from a carrot. In such cases, color information may help a robot distinguish a radish from a carrot. Active perception can be formulated as the selection of a subset of a set of modalities A ⊂ M. 2

m

∪A

In the MHDP, the recognition state is represented by a posterior distribution over P (zj |Xj o j ), m where zj = {{kjt }1≤t≤Tj , {tm jn }m∈M,1≤n≤Nj } is a latent variable representing the j-th object’s topic, i.e., its object category and information, and moj ⊂ M is the set of modalities the robot has already observed. The final recognition state after the information from all modalities M is obtained is represented by P (zj |XjM ). The MHDP is an unsupervised learning method; it uses no outside knowledge. Therefore, P (zj |XjM ) is regarded as the true recognition result. A posterior distribution that is close to P (zj |XjM ) is also considered to be a good recognition result. We propose using the information gain criterion, which is the same as the expected Kullback-Leibler m m (KL) divergence, to select modalities to sensor A ∈ FL oj , where FL oj represents a family with no more than L elements and does not include elements in moj . It can be shown that the proposed method is optimal if the distance between the final state and recognition state is measured after executing selected actions using KL divergence. m

Theorem 1. A set of next actions A ∈ FL o j maximizing the expected KL divergence between the posterior distribution over zj after executing the set of actions A and the current posterior distribution over zj minimizes the expected KL divergence between the posterior distribution over zj after the information of all modalities is observed and after executing the actions A, i.e., ( m ∪A ) argmin E M\mo j mo j [KL P (zj |XjM ), P (zj |Xj oj ) ] mo j

A∈FL

|Xj

Xj

) ( A∪mo j m ), P (zj |Xj oj ) ]. = argmax EX A |X mo j [KL P (zj |Xj mo j

A∈FL

j

(1)

j

Proof. To prove this theorem, we refer to [16]. ( ) Define the function IG(X; Y |Z) as follows: IG(X; Y |Z) = KL P (X, Y |Z), P (X|Z)P (Y |Z) . IG(X; Y |Z) is the information gain of Y for X, which is calculated using the probability distribution commonly conditioned by Z. By using IG, the active perception strategy when #(A) = 1 is as follows: m m∗j = argmax IG(zj ; Xjm |Xj oj ). (2) m∈M\moj

In other words, a robot selects an action m∗j that can obtain Xjm , which maximizes the information m gain for the recognition state zj under the condition that the robot has already observed Xj o j . Straightforward calculations of (2) are computationally inefficient and even impossible in practice. However, an efficient Monte Carlo method can be derived as follows [16]: mo j

IG(zj ; Xjm |Xj

)≈

1 ∑ log K k

2.3

m[k]

1 K

[k]

m

P (Xj |zj , Xj oj ) . ∑ m m[k] [k′ ] |zj , Xj oj ) k′ P (Xj

(3)

Sequential decision making based on submodularity

When the number of modalities, or types of sensing behaviors, becomes large, the number of candidates in the optimal subset A that maximizes the IG criterion becomes combinatorially large. Selecting actions is regarded as a costly combinatorial search problem. However, by virtue of the graphical model of the MHDP, the IG criterion can be shown to possess the submodular and non-decreasing properties [19]. In fact, it provides an efficient approximate solution to the search problem. moj

Theorem 2. The evaluation criterion for multimodal active perception IG(zj ; XjA |Xj modular and non-decreasing function with respect to XjA .

) is a sub-

Proof. To prove this theorem, we refer to [16]. Nemhauser et al. proved that the greedy algorithm can select a subset that is at most a constant factor of (1 − 1/e) worse than the optimal set if the evaluation function F (A) is submodular, nondecreasing and F (∅) = 0, where F (·) is a set function and A is a set [20]. A lazy greedy algorithm, 3

Algorithm 1 Greedy algorithm for multimodal active perception [16] Require: The MHDP is trained using a training data set. The j-th object is found. m moj is initialized, and Xj oj is observed. for l = 1 to L do for all m ∈ M \ moj do for k = 1 to K do m [k] m[k] Draw (zj , Xj ) ∼ P (zj , Xjm |Xj oj ). end for mo m[k] [k] ∑ P (Xj |zj ,Xj j ) 1 IGm ← K mo j m[k] [k′ ] k log 1 ∑ K

k′

P (Xj

|zj

,Xj

)

end for m∗ ← argmaxm IGm ∗ Execute the m∗ -th action to the j-th target object and obtain Xjm . moj ← moj ∪ {m∗ } end for Category (estimated)

Visual image

Category 1 Soft ball (Vinyl) Category 2 Plastic bottle (Empty) Category 3 Plastic bottle

Category (estimated)

Object ID

Visual image

Object ID

2, 3, 4

Category 5 Cup (Plastic)

5, 6, 7

12, 13, 14

Category 6 Cup (Metal)

8

15, 16, 17

Category 7 Hard ball (Polyethylene)

1

(Containing bells)

Category 4 Can (Steel)

9, 10, 11

Figure 2: Target objects used in the experiment and their categorization results [16]. which makes the greedy algorithm more efficient, can be used as well [21]. Finally, the greedy algorithm in Algorithm 1 is obtained for sequential active perception. The performance of the method is supported theoretically. The lazy greedy algorithm can also be derived (see [16]).

3 3.1

Experiments Experiment using a robotic system

Figure 1 shows the robotic system that was used in the experiment. The robot was equipped with a camera, microphone, and encoders to obtain visual, audio, and haptic information, respectively. Four sensing actions were prepared: looking around, grasping, hitting, and shaking an object. Seventeen objects were prepared as target objects (Fig. 2). Based on the multimodal information obtained using the four types of sensing behaviors, the robot formed object categories, as shown in Fig. 2, in an unsupervised manner. The number of object categories was determined automatically by the HDP. After the robot obtained visual information, it selected the next action. The proposed method was evaluated using KL divergence. Figure 3 shows the KL divergence after executing an action selected by the IG criterion. The KL divergence between the recognition state after executing the second action and the final recognition state was calculated for all objects, as shown in the box plot in Fig. 3. This plot shows that an action having greater information brings the recognition of its state closer to the final recognition state. Moreover, IG .max clearly reduced the uncertainty of the target objects. We also determined that sequential action selection using the greedy algorithm and lazy greedy algorithm exhibits approximately the same performance compared to optimal action selection in the 4

2.5

1.5

v_only IG.min IG.mid IG.max

1.0 0.5

KL div ergence

KL div ergence

2.0

0.75

Worst.case Average Lazy.greedy Greedy Best.case

0.50 0.25 0.00

0.0 v_only IG.min IG.mid IG.max Cr iter ion of action selection

Step

0.6

0.4

0.4

0.2

Random

0.6 Greedy

Posterior Probability

Figure 3: Reduction in the KL divergence when executing an action selected using the IG criterion (left), and the KL divergence of the final state at each step for each sequential action selection procedure (right). Note that the line for the greedy algorithm overlaps the line for the lazy greedy algorithm [16].

0.2 0.0

0.0

Step

Step

C1 C2 C3 C4 C5 C6 C7

C8 C9 C10 C11 C12 C13 C14

Figure 4: Time series of the posterior probability of the category for object 51 during sequential action selection based on the greedy algorithm (left), and the random selection procedure (right) [16].

sequential decision making task (Fig. 3). In our experimental setting, evaluation of IGm took less than one second which is quite shorter than the duration required to perform an action. This means that the proposed method is practical for real-time applications. 3.2

Experiment using synthetic data

Our robotic system was limited to evaluating the potential of our proposed method. Therefore, we also conducted an experiment using synthetic data, i.e., a virtual robotic environment. In this experiment, synthetic data, including 21 object types, 63 objects, and 20 actions or modalities, were prepared. Experimental results showed that our method coincides with theoretical predictions. A characteristic example of the results is shown in Fig. 4. We intentionally prepared an object that can be classified into two categories at the same time, i.e., categorization is somewhat vague. In this case, the appropriate inference result is not determining one class as a deterministic recognition result, but outputting a proper posterior distribution having a high probability for two classes, i.e., exhibiting a confusing recognition state. Even for such a target object, our method worked correctly. Figure 4 shows that the proposed method promptly estimated the target object as a somewhat confusing object, i.e., having high probabilities for two classes. Further experimental results can be found in [16].

4

Conclusion

An active perception method for the MHDP was proposed based on the information gain criterion. We proved that maximizing IG is the optimal criterion for active perception because an action that can reduce the expected KL divergence between a final recognition state and a current recognition state can be selected using the criterion. Moreover, we showed that maximizing IG has a submodular and non-decreasing properties by virtue of the graphical model of the MHDP. The properties guarantee that the greedy and lazy greedy algorithms work effectively. Two experiments were conducted, and the validity of the proposed method was evaluated. 5

To develop embodied autonomous cognitive systems, multimodal object categorization, computationally modeled using multimodal machine learning architecture, is a fundamental and indispensable capability. As human cognitive systems gradually develop on the basis of antecedently developed cognitive capabilities, various intelligent functionalities can be developed using multimodal object categorization, e.g., lexical acquisition and active perception [10, 16]. Multimodal machine learning is vital for a constructive approach towards human developmental intelligence [2].

References [1] Lawrence W. Barsalou. Perceptual symbol systems. Behavioral and Brain Sciences, 22(04):1– 16, 1999. [2] Tadahiro Taniguchi, Takayuki Nagai, Tomoaki Nakamura, Naoto Iwahashi, Tetsuya Ogata, and Hideki Asoh. Symbol emergence in robotics: A survey, 2015. arXiv:1509.08973. [3] Hande Celikkanat, Guner Orhan, Nicolas Pugeault, Frank Guerin, Sahin Erol, and Sinan Kalkan. Learning and Using Context on a Humanoid Robot Using Latent Dirichlet Allocation. In Joint IEEE International Conferences on Development and Learning and Epigenetic Robotics (ICDL-Epirob), pages 201–207, 2014. [4] Jivko Sinapov and Alexander Stoytchev. Object Category Recognition by a Humanoid Robot Using Behavior-Grounded Relational Learning. In IEEE International Conference on Robotics and Automation (ICRA), pages 184 – 190, 2011. [5] Takaya Araki, Tomoaki Nakamura, Takayuki Nagai, Shogo Nagasaka, Tadahiro Taniguchi, and Naoto Iwahashi. Online learning of concepts and words using multimodal LDA and hierarchical Pitman-Yor Language Model. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1623–1630, 2012. [6] Yoshiki Ando, Tomoaki Nakamura, Takaya Araki, and Takayuki Nagai. Formation of hierarchical object concept using hierarchical latent dirichlet allocation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2272–2279, 2013. [7] Tomoaki Nakamura, Takayuki Nagai, and Naoto Iwahashi. Multimodal object categorization by a robot. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2415–2420, 2007. [8] Tomoaki Nakamura, Takayuki Nagai, and Naoto Iwahashi. Grounding of word meanings in multimodal concepts using LDA. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3943–3948, 2009. [9] Tomoaki Nakamura, Takayuki Nagai, and Naoto Iwahashi. Multimodal categorization by hierarchical dirichlet process. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1520–1525, 2011. [10] Tomoaki Nakamura, Takayuki Nagai, Kotaro Funakoshi, Shogo Nagasaka, Tadahiro Taniguchi, and Naoto Iwahashi. Mutual Learning of an Object Concept and Language Model Based on MLDA and NPYLM. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’14), pages 600 – 607, 2014. [11] Shane Griffith, Jivko Sinapov, Vladimir Sukhoy, and Alexander Stoytchev. A behaviorgrounded approach to forming object categories: Separating containers from noncontainers. IEEE Transactions on Autonomous Mental Development, 4(1):54–69, 2012. [12] Naoto Iwahashi, Komei Sugiura, Ryo Taguchi, Takayuki Nagai, and Tadahiro Taniguchi. Robots That Learn to Communicate: A Developmental Approach to Personally and Physically Situated Human-Robot Conversations. In Dialog with Robots Papers from the AAAI Fall Symposium, pages 38–43, 2010. [13] Deb K. Roy and Alex P. Pentland. Learning words from sights and sounds: a computational model. Cognitive Science, 26(1):113–146, 2002. [14] Jivko Sinapov, Connor Schenck, Kerrick Staley, Vladimir Sukhoy, and Alexander Stoytchev. Grounding semantic categories in behavioral interactions: Experiments with 100 objects. Robotics and Autonomous Systems, 62(5):632–645, 2014. 6

[15] Tomoaki Nakamura, Takayuki Nagai, and Naoto Iwahashi. Bag of multimodal LDA models for concept formation. IEEE International Conference on Robotics and Automation, pages 6233–6238, 2011. [16] Tadahiro Taniguchi, Toshiaki Takano, and Ryo Yoshino. Active perception for multimodal object category recognition using information gain, 2015. arXiv:1510.00331. [17] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006. [18] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. The Journal of Machine Learning Research, 3(1):993–1022, 2003. [19] Andreas Krause and Carlos E. Guestrin. Near-optimal Nonmyopic Value of Information in Graphical Models. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, 2005. [20] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions-I. Mathematical Programming, 14(1):265–294, 1978. [21] Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques, pages 234–243. Springer, 1978.

7

Active Perception based on Multimodal Hierarchical ...

IEEE International Conference on Robotics and Automation, pages. 6233–6238 ... tions for maximizing submodular set functions-I. Mathematical Programming, ...

1MB Sizes 0 Downloads 285 Views

Recommend Documents

Active perception/vision
[Blake & Yuille, 1992] Active Vision approach, for instance, ... a human or a computer) to select the environmental ... mainstream in computer vision research.

Multimodal Visualization Based On Non-negative ...
Apr 26, 2010 - 2 Problem Definition ... visual content to represent image content and to project similarity .... the distribution of images in the collection. We used ...

Multimodal Visualization Based On Non-negative ...
Apr 26, 2010 - Traditionally image collection visualization approaches only use visual content to represent image content and to project similarity relationships ...

a multimodal search engine based on rich unified ... - Semantic Scholar
Apr 16, 2012 - Copyright is held by the International World Wide Web Conference Com- ..... [1] Apple iPhone 4S – Ask Siri to help you get things done. Avail. at.

a multimodal search engine based on rich unified ... - Semantic Scholar
Apr 16, 2012 - Google's Voice Actions [2] for Android, and through Voice. Search [3] for .... mented with the objective of sharing one common code base.

Hierarchical Location Management Scheme Based on ...
COMMUN., VOL.E87–B, NO.3 MARCH 2004. PAPER Special Section on Internet Technology IV ... tail stores, coffee shops, and so forth. Also, sensor networks.

SeDas A Self-Destructing Data System Based on Active Storage ...
SeDas A Self-Destructing Data System Based on Active Storage Framework..pdf. SeDas A Self-Destructing Data System Based on Active Storage Framework..

Multi-Graph Enabled Active Learning for Multimodal Web Image ...
Nov 11, 2005 - data. This proposed approach, overall, tackles the problem of unsupervised .... we resort to some web-page analysis techniques. A Visual- ...

Multi-Graph Enabled Active Learning for Multimodal Web Image ...
Nov 11, 2005 - A key challenge of WWW image search engines is precision performance. ...... multi-modalities by decision boundary optimization. As training ...

Efficient Hierarchical Graph-Based Video Segmentation
els into regions and is a fundamental problem in computer vision. Video .... shift approach to a cluster of 10 frames as a larger set of ..... on a laptop. We can ...

Hierarchical Dynamic Neighborhood Based Particle ...
Abstract— Particle Swarm Optimization (PSO) is arguably one of the most popular nature-inspired algorithms for real parameter optimization at present. In this article, we introduce a new variant of PSO referred to as Hierarchical D-LPSO (Dynamic. L

Hierarchical Phrase-Based Translation ... - Research at Google
analyze in terms of search errors and transla- ... volved, with data representations and algorithms to follow. ... tigate a translation grammar which is large enough.

Active Bayesian perception and reinforcement learning
were mounted with their centers offset to align their closest point to .... to ≲1mm at the center (Fig. .... In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ.

a decison theory based multimodal biometric authentication system ...
Jul 15, 2009 - ... MULTIMODAL BIOMETRIC. AUTHENTICATION SYSTEM USING WAVELET TRANSFORM ... identification is security. Most biometric systems ..... Biometric Methods”, University of Nevada, Las Vegas. [3]. Ross, A., Jain, A. K. ...

ACTIVE LEARNING BASED CLOTHING IMAGE ...
Electrical Engineering, University of Southern California, Los Angeles, USA. † Research ... Ranking of. Clothing Images. Recommendation based on. User Preferences. User Preference Learning (Training). User-Specific Recommendation (Testing). (a) ...

On Enhanced Hierarchical Modulations
Aug 27, 2007 - Qualcomm Incorp., FLO Air Interface Specification , 80-T0314-1 Rev. D. 6. 3GPP2, Ultra Mobile Broad Physical Layer, C.P0084-001, February ...

Multimodal Information Spaces for Content-based ...
gies to search for relevant images based on visual content analysis. ..... Late fusion, i.e. combining different rankings, is also referred to as rank ... have been evaluated for image retrieval, using a text search engine and a content- ..... A soft

On Enhancing Hierarchical Modulations
... QPSK/64QAM, 13.6/4.5Mbps with coding rate ¾ and ½. ... The major gain: higher throughput on the base ... Achievable Gains (1/2). 0. 5. 10. 15. 20. 25. 30. 0.

a decison theory based multimodal biometric ...
Jul 15, 2009 - E-MAIL: [email protected], [email protected], [email protected], ... gamma of greater than 1 to create greater contrast in a darker band of .... For the analysis of the iris and the speech templates we are.

Multimodal Information Spaces for Content-based ...
One of the main challenges to develop effective image retrieval systems is ... related web pages with historical information, technical data and tour guides [22].

On Enhancing Hierarchical Modulations
The key advantage: minimum complexity increase. ▫ The major gain: higher ..... Feasibility of DVB-H Deployment on Existing. Wireless Infrastructure, IWCT 2005.