Explainable Image Understanding using Vision and Reasoning Somak Aditya Department of Computer Science, Arizona State University, Tempe, USA

AAAI-17 Doctoral Consortium

Image Understanding Through Text

General Architecture What is Understanding? (do you understand) ● Ask students questions about a subject, if the student can answer then he/she “understands” it. ● UNDERSTANDING here is equivalent to Question-Answering Quality of understanding (how much do you understand) ● Increase difficulty of questions. ● According to Bloom’s Taxonomy [4], they are:

Image Understanding through text: ● Gained huge popularity recently. ● Primarily two tasks are designed: ○ Caption Generation ○ Visual Question Answering

○ Knowledge, Comprehension, Application, Analysis, Synthesis, Evaluation Architecture Should: ー Explicitly model Connections between Vision and Reasoning and Knowledge. (Modular or not, reasoning and knowledge has to be modeled/learnt by any Image Understanding system, VQA and Captioining models learn some such knowledge implicitly, but not clearly)

Explainability

i. ii. iii. iv.

(Knowledge) List the objects in the image. (Comprehension) what will the man do next? (Application) how to cut tofu? (Analysis) Why is the man holding the bowl with his other hand? v. (Synthesis) Can you propose how else to cut a tofu? vi. (Evaluation) Is there a better way to cut a tofu?

Got the Results: ● Why did you do that? ● Why not something else? ● When do you succeed? ● When do you fail? ● When Can I trust You? ● How do I correct an error? How Do you Explain: ● (Customer) Natural Language, simple ● (Manager) Structured, detailed.

Difficulty in Current Architectures Difficulties of Explainability in End-to-End Learning: ● What to fix? ○ module/parameter/function ● ● Some work on understanding Learnt Models: [3] ● Can we explain in Natural Language space or symbolic space? ● ● Is structured explanations possible? ● Largely Un-explored. ● What Can we do? ● ○ [Impose A Structure] Use Knowledge and Probabilistic Reasoning to replace final layers. (Current Work) ○ [Explanation Interface] Use Knowledge and ● Probabilistic Reasoning to explain. (Have plans to explore)

DeepIU: An Architecture

What Do we Try? [Examples] Visual Commonsense for Scene Understanding using Perception, Semantic Parsing and Reasoning [Image Captioning] Image Understanding Through Scene Description Graphs [Architecture] DeepIU: An Architecture for Image Understanding [Puzzles/New Challenge] Image Riddles using Vision Reasoning

An Example was The Implementation Used In SDG Another Example

Type of Knowledge Used: • A Knowledge Graph: Combined semantic Parses of sentences. • Bayesian Network: Dependencies among objects and scene constituents.

How did we store the knowledge: • Graph on File-system with self-built Query-Engine

Reasoning Module Used: • Probabilistic Reasoning using Bayesian Network, and IF-THEN reasoning using Constructed Knowledge-Base

Components Used

Application: Image Riddles

Toy Example What is the Connecting Word? Answer is: Fall Why? i) First image depicts the season Fall, ii) the second one has water-fall, iii) the third one has rainfall and iv) a statue is “fall”ing

The Tofu example: ● A snapshot of a cooking video. ● We detect triplets : . ● Here, We detect . ● Downstream inference: “Knife is cutting bowl” (?) ○ Not possible !!! ○ Humans use commonsense. ○ Knife cutting something inside a bowl i.e. tofu.

Motivations:

GUR+All: Is our Method. Higher bars on the right means more correct puzzles and more intelligent (less gibbersih) answers

● Visual Question Answering: Task requires explicit model of commonsense reasoning. However a big percentage of the dataset concentrates on “what” and “where”, “how many” questions. ● VQA: Constrained set of answers. ● Image Riddles: target answers in train and test are mostly unique. (Zero-shot) ● Lastly, puzzles are fun.

● We Used: ○ Commonsense Knowledge: Artifact cutting artifact is abnormal ○ (Ontological Knowledge) From Semantic Parser (KParser): Knife and bowl are artifacts ○ (Reasoning) Using Answer Set Programming: easy to conclude that “Knife is cutting bowl” is not true.

Image Riddles

Interpretable Intermediate Structure Undirected Labeled Graph: ● Nodes: are verbs (actions), nouns (objects, regions, attributes). ● Verbs connected to concrete nouns with semantic roles. ● Inferred Aspects connected to dummy node SCENE

Visual Detections Used: • CNN-ILSVRC trained Object Recognition [Girshick et.al. ] • Scene Classification [Zhou et. Al.] • Scene Constituent: Pre-trained CNN (VGG-16) +SVM multi-class

Visual Detections Used: Residual Network (ResNet-200), Clarifai API (commerical, finetuned) Reasoning Module Used: Probabilistic Soft Logic: Uses First-Order-Logic syntax to define Markov Random Field Potentials

Utilities: - Question-Answering, Sentence Generation.

Type of Knowledge Used: ● ConceptNet: (Semi-Curated), large Vocabulary. How did we Create/store the knowledge: ● Publicly Available APIs

Components Used

https://imageriddle.wordpress.com/imageriddle/

Summary Drawbacks: • Explainability: How did you get the result? • Explainability: Do you know what to fix? • Interpretable Intermediate Structure

Scene Description Graph

• (KR/SRL Paradigms) Connecting Large-scale Knowledge Bases and Reasoning Efficiently • ASP : Not Probabilistic. • Probabilistic Soft Logic: Not well-documented, inflexible in incorporations of Phrase-semantics

Solutions: • Modular Architecture (performing comparable to a previous DL method) • Defined SDG (can perform reasoning, can be used in QA, Sentence Generation) • Modifying and Extending Probailistic Logic Semantics. Will be made available for use • Constraint Addition Capabilities. • Predicates: not just symbols, phrases and words (similarity calculated using Embedding)

Future Work Extending Current Probabilistic Soft Logic. Visual Question Answering Using Unsupervised Semantic Parsing to create Graph from Scenes. Extending ideas to Robotic vision

References Check out: http://bit.ly/1MMN1wZ

[1] Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." In ICML, vol. 14, pp. 77-81. 2015. [2] Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B. and Darrell, T., 2016, October. Generating visual explanations. In European Conference on Computer Vision (pp. 3-19). Springer International Publishing. [3] Agrawal, A., Batra, D. and Parikh, D., 2016. Analyzing the behavior of visual question answering models. arXiv preprint arXiv:1606.07356. [4] Bloom, B.S., 1956. Taxonomy of educational objectives. Vol. 1: Cognitive domain. New York: McKay, pp.20-24.

Explainable Image Understanding using Vision and ...

Department of Computer Science,. Arizona State University, Tempe, USA. Image Understanding Through Text. Summary. General Architecture. Application: Image Riddles. AAAI-17 Doctoral. Consortium. Image Understanding through text: ○ Gained huge popularity recently. ○ Primarily two tasks are designed: ○ Caption ...

1MB Sizes 0 Downloads 131 Views

Recommend Documents

Towards Explainable Writer Verification and ...
fication and identification systems have been implemented and impressive ... These systems usu- ..... systems usually are into a more transparent one, since ba-.

Vision-based hexagonal image processing based hexagonal image ...
computer vision and pattern Recognition, Las Vegas, June 2006. [8] R.M. Mersereau, “The processing of Hexagonally Sampled Two-. Dimensional Signals,” Proceedings of the IEEE. 67: pp. 930 949, 1979. [9] X. He and W. Jia, “hexagonal structure for

Image processing using linear light values and other image ...
Nov 12, 2004 - US 7,158,668 B2. Jan. 2, 2007. (10) Patent N0.: (45) Date of Patent: (54). (75) ..... 2003, available at , 5.

Image inputting apparatus and image forming apparatus using four ...
Oct 24, 2007 - Primary Examiner * Cheukfan Lee. (74) Attorney, Agent, or Firm * Foley & Lardner LLP. (57). ABSTRACT. A four-line CCD sensor is structured ...

Quantifying explainable discrimination and removing ...
present two techniques to remove illegal discrimination from the training data. Section 6 ..... There are two programs: medicine (med) and computer science (cs) with potentially different ...... Science) degree from University of the Central Punjab (

Remote Sensing and Image Understanding as Reflected
... are an important source of information for monitoring climate change issues, ... due to cloud coverage, which may be especially high over rain forests. So the ...

Vision-based hexagonal image processing based hexagonal ... - IJRIT
addresses and data of hexagonal pixels. As shown in Fig. 2, the spiral architecture is inspired from anatomical consideration of the primate's vision system.

Invertible Conditional GANs for image editing - Computer Vision Center
with a cGAN, which we call Invertible cGAN (IcGAN), enables to re-generate real ..... Generative Adversarial Networks,” International Conference on Learning ...

Spread spectrum Image and Audio Transmission using ...
digital data is one, and the transmission is switched off when the digital data is zero. ..... communication using a chaos based signal encryption scheme", IEEE ...

Image Segmentation using Global and Local Fuzzy ...
Indian Statistical Institute, 203 B. T. Road, Kolkata, India 700108. E-mail: {dsen t, sankar}@isical.ac.in. ... Now, we present the first- and second-order fuzzy statistics of digital images similar to those given in [7]. A. Fuzzy ... gray values in

Image and video retargeting using adaptive scaling function - eurasip
Aug 28, 2009 - ABSTRACT. An image and video retargeting algorithm using an adaptive scaling function is proposed in this work. We first construct an importance map which uses multi- ple features: gradient, saliency, and motion difference. Then, we de