Joint Learning from Video and Caption Tingran Wang Peking University

Nishant Shukla, Caiming Xiong University of California, Los Angeles

[email protected]

{nxs,cxiong}@ucla.edu

Abstract

2. Method We adopt a slightly different way of learning from demonstration than previous work. Instead of only a video recording of a human performing tasks, we allow captions for each segments. To encapsulate a hierarchical and stochastic understanding of the world, we adopt the Spatial, Temporal, and Causal And-Or Graph(STC-AoG) [6] as our knowledge representation. We first parse video and text separately, then join their parsing results to form a hierarchial STC-AoG. Finally, we incorporate a dialogue system so the robot could communicate with a human to get more specific knowledge according to its need.

We propose a multi-media knowledge acquisition framework for a robot learning from demonstration (LfD). Unlike previous LfD systems which rely mainly on video input, our system utilizes both video and text for more tractable and accurate learning. We show that unifying text commands with video demonstrations can help a robot ground hierarchical knowledge, as well as provide natural human-robot interactions.

1. Introduction Personal robotics focus on building robots that collaborate with human at home or in the workplace. As tasks differ between users due to their respective needs and unfamiliar environments, it is unrealistic to build personal robots by merely hand-coding or even evoking a simple pre-trained model. Consequently, learning from demonstration(LfD) [1] or learning through dialogue [2] becomes a more reasonable approach to train a robot. Previous LfD systems yield great results in different tasks including action structure detection [5], pick-andplace [3], and grasping [8] tasks. However, these systems focus mainly on trajectory learning or action pattern finding and lack a hierarchical understanding of the task. Shukla et al. [4] adopts a hierarchical spatial, temporal, and causal knowledge representation, but their structure is partly manually labeled. On the other hand, natural language is full of rich structural information, and is therefore useful for action representation and human-robot dialogue. We follow Xiong et al.’s work[7], and incorporate text information and dialogue into their system. The robot gains incremental knowledge through text commands, and thus improves the generalizability and accuracy of the learn knowledge. The contributions of this paper include the following: 1. We demonstrate how to learn from joint video and text input to represent the learned spacial, temporal and causal knowledge into a uniform hierarchical structure. 2. We incorporate interactive dialogue into the LfD system for more self-motivated learning.

2.1. Video Parsing To parse the video, we track action trajectory and spatial configuration to define causal dependencies according to each action. After watching a few human demonstrations, the robot develops a STC-AoG to represent the knowledge, as per [4].

2.2. Text Parsing To understand the text, we first use the semantic parsing result produced by Tu et al. [6]. We interpret their parsing result and generate an And-Or Graph (AoG) including spacial and temporal knowledge. Spatial knowledge includes objects detected or undetected in the given video. Temporal knowledge includes actions needed for the task. For example, in the sentence “Fold the right sleeve to the middle of cloth”, two regions of the t-shirt, “sleeve” and “middle”, represent the spatial knowledge we want to understand, and “fold” is the primary action. We differentiate nodes in the AoG into three categories: abstract concepts, concrete concepts, and examples. Abstract concepts are ideas of objects and actions, like “sleeve” and “fold”. Concrete concepts are abstract concepts with constraints, and examples are instances detected in video. The taxonomy of these three concepts enable us to link video and text and learn more general knowledge in dialogue system, which will be shown in sections below. 1

Figure 2. General knowledge learned after the dialogue in Table 1. Our system learns a generalizable way to model learned knowledge from the parametrization of the “fold” action with “patient” and “locationTo”.

order of the detected points should remain consistent with the order of object names in the text representation. Moreover, if a point in the t-shirt is frequently detected when the word “right sleeve” appears in the text, there is a high probability that those two concepts refer to the same object. We use such information to ground names of objects with real world samples. An example of the final Spacial and Temporal parse graph is shown in Figure 1, where causal knowledge learned from video parsing is hidden for clarity.

2.4. Dialogue System Figure 1. Spatial and Temporal And-Or graph representation of the sentence “fold the right sleeve to the middle of cloth”. The concrete action of “fold” requires two parameters: “patient” and “locationTo.” It is linked to a video example, which is linked to two nodes in the t-shirt, learned as “right sleeve” and “middle of cloth”.

Agent Robot Human Human Human Robot

Content In the step of “Fold the right sleeve to the middle of cloth”, how to fold? First, move to the right sleeve. (Robot: OK. ) Then, grip it. (Robot: Understand. ) Next, move the middle of the cloth and release your hand. All right. Get it.

Table 1. Example human-robot dialogue generated by the system

2.3. Knowledge Grounding In order for a robot to understand the meaning of spatial and temporal knowledge rather than merely treat them as text, we ground the knowledge into the real world. Action phrases from text can be linked to the part of video to which it belongs. Thus, this correspondence enables the causal knowledge detected from the video to use the text parsing result and append the accumulating STC-AoG. Within our spatial knowledge structure, we store spatial key points of each object and link them to the text graph. To resolve this correspondence, chronological and historical information are essential. We hold the assumption that the appearance

There are three key parts for the dialogue system: the question generator, the question cost function, and the natural language parser. Currently, there are two kinds of questions our system may ask according to its video input: identification confirmation, and action specifics. When a robot has low confidence in linking an abstract concept with its concrete example, the robot is more likely to ask the human for help. When the robot discovers a new but useful action, it will try to get more detailed decomposition of the action for better understanding. The question is generated by a template and each question has a cost according to its question type, question amount, and information expected to acquire from answer. On the other hand, since human responses are either in the form of a command or in a few predictable forms such as acknowledgement or negation, we continue to use the parsing framework outlined in the previous section. The difference is that we add a detector for those simple actions and a memory structure to deal with recurring pronouns such as “it” and “there”. By incorporating dialogue, the robot not only learns specific details of an action, but it also exploits the hierarchical knowledge representation to get a general understanding of the procedure. The new knowledge learned after interaction through dialogue shown in Table 1 is in Figure 2.

3. Conclustion In this paper, we proposed a joint learning system that can utilize video and text information in the same representation. By using a hierarchical structure STC-AoG, we are able to link the video and text parsing result and learn more general knowledge from demonstration and dialogue. 2

References [1] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. 2009. Robotics and autonomous systems, 57(5), 469-483. [2] L. S. Lopes and A. Teixeira. Human-robot interaction through spoken language dialogue. 2000. IEEE/RSJ International Conference on Intelligent Robots and Systems. [3] M. Mhlig, M. Gienger, and J. J. Steil. Interactive imitation learning of object movement skills. 2012. Autonomous Robots 32.2: 97-114. [4] S. C. Z. N. Shukla, C. Xiong. A unified framework for humanrobot knowledge transfer. 2015. AAAI Fall Symposium on AI for Human-Robot Interaction. [5] S. Niekum, S. Osentoski, G. Konidaris, S. Chitta, B. Marthi, and A. G. Barto. Learning grounded finite-state representations from unstructured demonstrations. 2014. The International Journal of Robotics Research, 34(2), 131-157. [6] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S. C. Zhu. Joint video and text parsing for understanding events and answering queries. 2014. MultiMedia, IEEE, 21(2), 42-70. [7] C. Xiong, N. Shukla, W. Xiong, and S. C. Zhu. Robot learning with a spatial, temporal and causual and-or graph. 2015. ICRA. [8] Y. Yang, Y. Li, C. Fermuller, and Y. Aloimonos. Robot learning manipulation action plans by unconstrained videos from the world wide web. 2015. The Twenty-Ninth AAAI Conference on Artificial Intelligence(AAAI-15).

3

Joint Learning from Video and Caption

Personal robotics focus on building robots that collabo- rate with human at home or in the workplace. As tasks differ between users due to their respective needs ...

217KB Sizes 7 Downloads 132 Views

Recommend Documents

Learning, Information Exchange, and Joint ... - Semantic Scholar
Atlanta, GA 303322/0280, [email protected]. 2 IIIA, Artificial Intelligence Research Institute - CSIC, Spanish Council for Scientific Research ... situation or problem — moreover, the reasoning needed to support the argumentation process will als

Learning Homophily Couplings from Non-IID Data for Joint Feature ...
Learning Homophily Couplings from Non-IID Data for. Joint Feature Selection and Noise-Resilient Outlier Detection ... In recent applications such as insider trad- ing, network intrusion detection and fraud detection, a key task is to ...... ACM Trans

Learning, Information Exchange, and Joint ... - Semantic Scholar
as an information market. Then we will show how agents can use argumentation as an information sharing method, and achieve effective learning from communication, and information sharing among peers. The paper is structured as follows. Section 2 intro

Joint Optimization of Data Hiding and Video Compression
Center for Visualization and Virtual Environments and Department of ECE. University ..... http://dblp.uni-trier.de/db/conf/icip/icip2006.html#CheungZV06. [5] Chen ...

160517_COFFEE JOINT LEARNING PLATFORM 2016 o.pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item.

Optimal Streaming of Layered Video: Joint Scheduling ...
We consider streaming layered video (live and stored) over a lossy packet network in order to maximize the .... We also show that for streaming applications with small playout delays (such as live streaming), the constrained ...... [1] ISO & IEC 1449

Caption writing 101.pdf
Use the formula above and write one great caption for each photo. The rules matter, so take the. time to follow them. Eden Stewart &. Paul Stewart. at Nascar in ...

Show and Tell: A Neural Image Caption Generator - The Computer ...
The main inspiration of our work comes from recent ad- vances in machine translation, where the task is to transform a sentence S written in a source language, ...

Learning the Fusion of Audio and Video Aggression ...
aggression using a multimodal system, given multiple unimodal features. ... a lot, making a lot of noise, not showing the ticket to the conductor, disturbing the ... classifier has a good performance in predicting the multimodal label given the ...

Enhancing Image and Video Retrieval: Learning via ...
School of Computer Science and Engineering and the Center for Neural Computation. The Hebrew ... on the clustering and retrieval of surveillance data. Our.

Learning to Design Organizations and Learning from ...
management texts include typologies of organizational structures, departmental ... a system that requires participants to adhere strictly to ... Expanding business degree programs .... enced by, broadening networks of widely distributed ele-.

System and method for synchronization of video display outputs from ...
Jun 16, 2009 - by executing an interrupt service routine by all host processors. FIG. 9 .... storage medium or a computer netWork Wherein program instructions are sent over ..... other information include initialization information such as a.

License Plate Recognition From Still Images and Video Sequences-A ...
License Plate Recognition From Still Images and Video Sequences-A survey.pdf. License Plate Recognition From Still Images and Video Sequences-A survey.

License Plate Recognition From Still Images and Video Sequences ...
License Plate Recognition From Still Images and Video Sequences-A survey.pdf. License Plate Recognition From Still Images and Video Sequences-A survey.

System and method for synchronization of video display outputs from ...
Jun 16, 2009 - media include magnetic media such as hard disks, ?oppy disks, and ... encompass data signals embodied in a carrier Wave such as the data ...

Insights from DoubleClick Video advertising momentum
bringing new energy to media companies, content creators, and advertisers. Big ... This global analysis is based on billions of served video ads which are aggregated to preserve ..... Using our tools, advertisers can build ads, measure results ...

Insights from DoubleClick Video advertising momentum
bringing new energy to media companies, content creators, and advertisers. Big ... video, while entertainment broadcasters are moving to online distribution. ... notes with insights gleaned from Google's video advertising solutions. We'll.

Learning to Rank with Joint Word-Image ... - Research at Google
notation that can scale to learn from such data. This includes: (i) .... tors, which is expensive for large Y . .... computing fi(x) for each i ∈ Y as the WARP loss does.

Learning to Rank with Joint Word-Image Embeddings
like a fast algorithm that fits on a laptop, at least at ..... Precision at 1 and 10, Sibling Precision at 10, and Mean ..... IEEE Conference on Computer Vision and.

Experiential Learning in International Joint Ventures ...
technical; sales; clerical and administrative support; service; agricultural, forestry, ... was defined as the average distance between the focal transaction f and the ...

Joint statement from the PHG Foundation, the European Genetic ...
Oct 22, 2013 - is inherited or acquired during prenatal development”. Amendment 72 – to ... information provision, such as lifestyle 'apps.' We know of no ...