Joint Learning from Video and Caption Tingran Wang Peking University
Nishant Shukla, Caiming Xiong University of California, Los Angeles
[email protected]
{nxs,cxiong}@ucla.edu
Abstract
2. Method We adopt a slightly different way of learning from demonstration than previous work. Instead of only a video recording of a human performing tasks, we allow captions for each segments. To encapsulate a hierarchical and stochastic understanding of the world, we adopt the Spatial, Temporal, and Causal And-Or Graph(STC-AoG) [6] as our knowledge representation. We first parse video and text separately, then join their parsing results to form a hierarchial STC-AoG. Finally, we incorporate a dialogue system so the robot could communicate with a human to get more specific knowledge according to its need.
We propose a multi-media knowledge acquisition framework for a robot learning from demonstration (LfD). Unlike previous LfD systems which rely mainly on video input, our system utilizes both video and text for more tractable and accurate learning. We show that unifying text commands with video demonstrations can help a robot ground hierarchical knowledge, as well as provide natural human-robot interactions.
1. Introduction Personal robotics focus on building robots that collaborate with human at home or in the workplace. As tasks differ between users due to their respective needs and unfamiliar environments, it is unrealistic to build personal robots by merely hand-coding or even evoking a simple pre-trained model. Consequently, learning from demonstration(LfD) [1] or learning through dialogue [2] becomes a more reasonable approach to train a robot. Previous LfD systems yield great results in different tasks including action structure detection [5], pick-andplace [3], and grasping [8] tasks. However, these systems focus mainly on trajectory learning or action pattern finding and lack a hierarchical understanding of the task. Shukla et al. [4] adopts a hierarchical spatial, temporal, and causal knowledge representation, but their structure is partly manually labeled. On the other hand, natural language is full of rich structural information, and is therefore useful for action representation and human-robot dialogue. We follow Xiong et al.’s work[7], and incorporate text information and dialogue into their system. The robot gains incremental knowledge through text commands, and thus improves the generalizability and accuracy of the learn knowledge. The contributions of this paper include the following: 1. We demonstrate how to learn from joint video and text input to represent the learned spacial, temporal and causal knowledge into a uniform hierarchical structure. 2. We incorporate interactive dialogue into the LfD system for more self-motivated learning.
2.1. Video Parsing To parse the video, we track action trajectory and spatial configuration to define causal dependencies according to each action. After watching a few human demonstrations, the robot develops a STC-AoG to represent the knowledge, as per [4].
2.2. Text Parsing To understand the text, we first use the semantic parsing result produced by Tu et al. [6]. We interpret their parsing result and generate an And-Or Graph (AoG) including spacial and temporal knowledge. Spatial knowledge includes objects detected or undetected in the given video. Temporal knowledge includes actions needed for the task. For example, in the sentence “Fold the right sleeve to the middle of cloth”, two regions of the t-shirt, “sleeve” and “middle”, represent the spatial knowledge we want to understand, and “fold” is the primary action. We differentiate nodes in the AoG into three categories: abstract concepts, concrete concepts, and examples. Abstract concepts are ideas of objects and actions, like “sleeve” and “fold”. Concrete concepts are abstract concepts with constraints, and examples are instances detected in video. The taxonomy of these three concepts enable us to link video and text and learn more general knowledge in dialogue system, which will be shown in sections below. 1
Figure 2. General knowledge learned after the dialogue in Table 1. Our system learns a generalizable way to model learned knowledge from the parametrization of the “fold” action with “patient” and “locationTo”.
order of the detected points should remain consistent with the order of object names in the text representation. Moreover, if a point in the t-shirt is frequently detected when the word “right sleeve” appears in the text, there is a high probability that those two concepts refer to the same object. We use such information to ground names of objects with real world samples. An example of the final Spacial and Temporal parse graph is shown in Figure 1, where causal knowledge learned from video parsing is hidden for clarity.
2.4. Dialogue System Figure 1. Spatial and Temporal And-Or graph representation of the sentence “fold the right sleeve to the middle of cloth”. The concrete action of “fold” requires two parameters: “patient” and “locationTo.” It is linked to a video example, which is linked to two nodes in the t-shirt, learned as “right sleeve” and “middle of cloth”.
Agent Robot Human Human Human Robot
Content In the step of “Fold the right sleeve to the middle of cloth”, how to fold? First, move to the right sleeve. (Robot: OK. ) Then, grip it. (Robot: Understand. ) Next, move the middle of the cloth and release your hand. All right. Get it.
Table 1. Example human-robot dialogue generated by the system
2.3. Knowledge Grounding In order for a robot to understand the meaning of spatial and temporal knowledge rather than merely treat them as text, we ground the knowledge into the real world. Action phrases from text can be linked to the part of video to which it belongs. Thus, this correspondence enables the causal knowledge detected from the video to use the text parsing result and append the accumulating STC-AoG. Within our spatial knowledge structure, we store spatial key points of each object and link them to the text graph. To resolve this correspondence, chronological and historical information are essential. We hold the assumption that the appearance
There are three key parts for the dialogue system: the question generator, the question cost function, and the natural language parser. Currently, there are two kinds of questions our system may ask according to its video input: identification confirmation, and action specifics. When a robot has low confidence in linking an abstract concept with its concrete example, the robot is more likely to ask the human for help. When the robot discovers a new but useful action, it will try to get more detailed decomposition of the action for better understanding. The question is generated by a template and each question has a cost according to its question type, question amount, and information expected to acquire from answer. On the other hand, since human responses are either in the form of a command or in a few predictable forms such as acknowledgement or negation, we continue to use the parsing framework outlined in the previous section. The difference is that we add a detector for those simple actions and a memory structure to deal with recurring pronouns such as “it” and “there”. By incorporating dialogue, the robot not only learns specific details of an action, but it also exploits the hierarchical knowledge representation to get a general understanding of the procedure. The new knowledge learned after interaction through dialogue shown in Table 1 is in Figure 2.
3. Conclustion In this paper, we proposed a joint learning system that can utilize video and text information in the same representation. By using a hierarchical structure STC-AoG, we are able to link the video and text parsing result and learn more general knowledge from demonstration and dialogue. 2
References [1] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. 2009. Robotics and autonomous systems, 57(5), 469-483. [2] L. S. Lopes and A. Teixeira. Human-robot interaction through spoken language dialogue. 2000. IEEE/RSJ International Conference on Intelligent Robots and Systems. [3] M. Mhlig, M. Gienger, and J. J. Steil. Interactive imitation learning of object movement skills. 2012. Autonomous Robots 32.2: 97-114. [4] S. C. Z. N. Shukla, C. Xiong. A unified framework for humanrobot knowledge transfer. 2015. AAAI Fall Symposium on AI for Human-Robot Interaction. [5] S. Niekum, S. Osentoski, G. Konidaris, S. Chitta, B. Marthi, and A. G. Barto. Learning grounded finite-state representations from unstructured demonstrations. 2014. The International Journal of Robotics Research, 34(2), 131-157. [6] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S. C. Zhu. Joint video and text parsing for understanding events and answering queries. 2014. MultiMedia, IEEE, 21(2), 42-70. [7] C. Xiong, N. Shukla, W. Xiong, and S. C. Zhu. Robot learning with a spatial, temporal and causual and-or graph. 2015. ICRA. [8] Y. Yang, Y. Li, C. Fermuller, and Y. Aloimonos. Robot learning manipulation action plans by unconstrained videos from the world wide web. 2015. The Twenty-Ninth AAAI Conference on Artificial Intelligence(AAAI-15).
3