MINING VISUALNESS Zheng Xu1â, Xin-Jing Wang2 ...

Viewer
Transcript

MINING VISUALNESS Zheng Xu1∗ , Xin-Jing Wang2 , Chang Wen Chen3 1

University of Science and Technology of China, Hefei, 230027, P.R. China 2 Microsoft Research Asia, Beijing, 100080, P.R. China 3 State University of New York at Buffalo, Buffalo, NY, 14260, U.S.A. [email protected], [email protected], [email protected] ABSTRACT

To understand which concepts are visualizable and to what extent they can be visualized are worthwhile for multimedia and computer vision research. Unfortunately, few previous works have ever touched such topics. In this paper, we propose an unified model to automatically identify visual concepts and estimate their visual characteristics, or visualness, from a large-scale image dataset. To this end, an image heterogeneous graph is first built to integrate various visual features, and then a simultaneous ranking and clustering algorithm is introduced to generate visually and semantically compact image clusters, named visualsets. Based on the visualsets, visualizable concepts are discovered and their visualness scores are estimated. The experimental results demonstrate the effectiveness of the proposed schema. Index Terms— visualness, visualsets, image heterogeneous graph, ranking, clustering 1. INTRODUCTION Despite decades of successful research on multimedia and computer vision, the semantic gap between low-level visual features and high-level semantic concepts remains a problem. Instead of generating more powerful features or learning more intelligent models, researchers have started to investigate which concepts can be more easily modeled by existing visual features [1, 2, 3]. To understand to what extent a concept has visual characteristics, i.e. “visualness”, has many values. For instance, it can benefit recent research efforts on constructing image databases [4, 5]. These efforts generally attempt to attach images onto pre-defined lexical ontology, while existing ontology were built without taking visual characteristics into consideration. Knowing which concepts are more likely to find relevant images will help save labors and control noises in database construction. Visualness estimation is also useful for image-to-text [2, 6] and text-to-image [7] translation, e.g., words of more visualizable concepts are potentially better annotations for an image. ∗ This

work was performed at Microsoft Research Asia.

Eagle 1.90

Cordless phone 5.01

Egypt Sphinx 4.06

Pink 0.75

Offshore oilfield 4.00

Fig. 1. Examples of simple concepts (left) and compound concepts (right) with estimated visualness. Albeit the usefulness, a general solution of visualness estimation faces many challenges: 1) It is unknown which concepts or which types of concepts are visualizable, i.e. whether representative images can be found to visualize its semantics. For instance, “dignity” and “fragrant” are both abstract nouns, but the former is more difficult to visualize as “fragrant” is closely related to visual concepts such as flowers and fruits; 2) Different visual concepts have diverse visual compactness and consistency, especially for collective nouns (e.g., “animal”) and ambiguous concepts (e.g., “apple”, which may represent a kind of fruit or a company); and 3) Even though a concept is highly visualizable, it may still be difficult to capture the visual characteristics due to the semantic gap, e.g., “tiny teacup chihuahua”. Few previous works in the literature have touched this research topic. To the best of our knowledge, Yanai et al. [1] are the first to propose visualness of a concept, who estimate visualness of adjectives with image region entropy. Zhu et al. [7] measure “picturablility” of a keyword by the ratio of the number of images to that of webpages retrieved by commercial search engines. Lu et al. [2] attempt to identify concept lexicon with small semantic gaps by clustering images based on visual and textual features, and then extracting the most frequent keywords occurred in image clusters. Berg et al. [8] utilize SVM classifiers to measure visualness of at-

global feature

Web images

h2

…

h3

…

g1

…

g2

…

g3

…

local feature

h1

w2

h2

w3

h3

w4 w5 Green Red Car Flower Ocean Sky Blue Cloverleaf

g1 g2 g3

…

1) attribute nodes

2) image heterogeneous graph

a) Building image heterogeneous graph

w1

h1

w2

h2

w3 g1 g3

Green Car Flower Sky Cloverleaf

1) image heterogeneous subgraph iii) cluster refinement

Red 0.25 Flower 0.25 Sky 0.01 … Car 0.31 Red 0.19 Auto 0.05 … Ocean 0.23 Blue 0.22 Sea 0.21 …

… 0.31

0.12

Ocean 0.39

0.17

Cloverleaf 0.22 …

Green flower 0.17 0.25

0.24

Green 0.24 Nature 0.21 …

… 0.17

…

Car 0.42 …

0.23

Red flower 0.63

0.09

2) visualsets

…

…

…

…

Asia

w1

h1

…

Ocean Highway Blue Overpass Sea Onramp Nature Macro Cloverleaf Flowers Shamrock Africa Red Nature Sky Macro Kleeblatt Nature Rose Cloverleaf Clouds Green Auto Ocean Lily Auto Verde Beetle Car Green Pest Canon Beatle Alfaromeo Ferrari Car Cloverleaf Voiture Park Old Flower Blue Beautiful Classic Sky Car Cloud

Visualness

3) posterior representation

b) Mining visualsets

Fig. 2. The proposed framework of visualness mining: a) build image heterogenous graph from web images with multi-type features; b) mine visualsets from image heterogenous graph by iteratively performing i) ranking nodes ii) posterior estimation and iii) cluster refinement; c) estimate visualness with visualsets. tributes. Sun et al. [9] investigate if a tag well represents its annotated images in the social tagging scenario, by leveraging visual compactness of the annotated images and their visual distinctness to the entire image database. Jeong et al. [3] quantify visualness of complex concepts of the form “adjective(attribute)+noun(concept)” by integrating intra-cluster purity, inter-cluster purity, and entropy of retrieved images from commercial search engines. There are two major disadvantages of previous works: 1) concept lists are pre-defined, and 2) concepts are insufficiently assumed prototypical. Two notable works are reported in [2] and [3]. Lu et al. [2] mine the concepts of small semantic gaps automatically from an image dataset, but assume each concept prototypical. In contrast, Jeong et al. [3] focus on “complex concepts” that imply reliable prototypical concepts, but the concept vocabulary is pre-defined. In fact, many concepts are semantically ambiguous (e.g., “apple” as a fruit or a company) or have multiple senses (e.g., “snake” as a noun meaning the reptile snake or a verb suggesting a snake-like pattern). In this paper, we attempt to discover and quantify the visualness of concepts automatically from a large-scale dataset. The quantitative measure of a concept is based on visual and semantic synsets (we call “visualsets”), rather than a single image cluster or keyword as in previous works. Visualsets perform disambiguation on the semantics of a concept and ensures visual compactness and consistency, which is inspired by synsets in the work of ImageNet[4] and Visual Synsets[6]. In our approach, a visualset is a group of visually similar images (we call “member images”) and related words, both are scored by their membership probabilities. Visualsets contain prototypical visual cues as well as prototypical semantic concepts. Given the visualsets, the visualness of a concept is thus modeled as a mixture distribution on its corresponding visualsets. Moreover, we discover both simple concepts (keywords) and compound concepts (combination of unique keywords) simultaneously from the generated visualsets, see Fig. 1.

2. VISUALNESS MINING In this section, we describe our approach of quantifying the visualness of concepts. 2.1. The Framework Our approach contains three steps: 1) build an image heterogenous graph with attribute nodes generated from multitype features; 2) mine visualsets from the heterogeneous graph with an iterative clustering-ranking algorithm; and 3) estimate visualness of concepts with visualsets. Fig. 2 illustrate the entire process. Given a (noisily) tagged image dataset such as a web image collection, we connect the images into a graph to facilitate the clustering approach for visualsets mining, as shown in Fig. 2(a). Specifically, we extract multiple types of visual features and textual feature for images to generate attribute nodes (see Fig. 2(a1)). The edges of the graph are defined by links between images and attribute nodes (see Fig. 2(a2)) instead of image similarities which are generally adopted in previous works [2]. That is, images are implicitly connected if they share some attribute nodes. Then, an iterative ranking-clustering approach is applied to form visual and textual synsets, i.e. visualsets, as shown in Fig. 2(b). In each iteration, we start with the guess on image clusters (see Fig. 2(b1)). Based on the guess, we score and rank each image as well as attribute nodes in each visualset (Fig. 2(b2)). Images are then mapped to the feature space defined by the visualsets mixture model (Fig. 2(b3)). Clusters are refined based on the estimated posteriors, which gives the guess on image clusters for the next iteration. After the clustering-ranking approach converges, we estimate the visualness of concepts (simple and compound) from the visualsets based on final scores of images and attribute nodes, as shown in Fig. 2(c). 2.2. Building An Image Heterogeneous Graph Given images X = x1 , x2 , . . . , x|X | , visual features GIST, color histogram and SIFT are extracted for each image. We

f req(xi , nj ) represents the frequency of (visual) word nj occurs in image xi . 2.3. Mining Visualsets

(a) Color histogram clusters

(b) GIST clusters

Fig. 3. Examples of image clusters generated by different visual features. They suggest how different visual features favor distinct visual properties. Color histogram captures intensity and contrast, while GIST prefers structure and edge. i j

Visual words

Color

k

Algorithm 1 The Visualsets Mining Algorithm Input: G = hV, Ωi Output: C Initialization: get {Ck(0) }K k=1 by randomly partitioning images Iteration: for each t ∈ [1, N ] do Ranking: compute P(S|Ck(t) ), P(A|Ck(t) ) by Eq. 2, Eq. 3 compute P(X |Ck(t) ) by Eq. 4 Clustering: estimate p(Ck(t) ) and p(Ck(t) |xi ) by Eq. 7 refine Ck(t) by Section 2.3.3 end for return C = {Ck(N) |k = 1, . . . , K}

Image m GIST

Tags

n Set-based attribute nodes

Word-based attribute nodes

Fig. 4. A heterogeneous sub-graph centered on one image. Each image has only one set-based attribute node for each type of global features (h, g), and multiple word-based attribute nodes from local features or textual features (w, t). exploit multiple types of visual features as they capture different perspectives of visual properties which compensate each other (Fig. 3). Our proposed visualness mining method is independent of the visual features selected. We do k-means clustering on the three types of visual features respectively, and treat the cluster centers as the attribute nodes. We represent each image with one attribute node from GIST and color histogram respectively (so-called set-based attribute nodes), and a bag of words on SIFT and texts respectively (so-called word-based attribute nodes). Fig. 4 illustrates the star-structure centered on an image. Denote the attribute nodes from GIST, color histogram, SIFT, and texts as G = g1 , g2 , . . . , g|G| , H = h , h , . . . , h , W = w1 , w2 , . . . , w|W| , T = 1 2 |H| t1 , t2 , . . . , t|T | respectively. We have the set of attribute nodes as A = G ∪ H ∪ W ∪ T . We define the heterogeneous image graph G = hV, Ωi as a weighted V = X ∪A are the nodes undirected graph , where and Ω = ωxi nj |xi ∈ X , nj ∈ V define the edges. We have ωxi nj

  1, nj ∈ G ∪ H, xi ∼ nj = c, nj ∈ W ∪ T , f req(xi , nj ) = c   0, otherwise

(1)

where xi ∼ nj denotes a link between xi and nj , and

Denote the set of K visualsets as C = {Ck |k = 1, . . . , K}. Topologically Ck ⊂ G, which is a subgraph defined on the member images of Ck and related attribute nodes. Each node of Ck is associated with a score indicating its importance to Ck . Images in Ck are assumed to be consistent in visual appearance, while texts in Ck form a synset which has disambiguated semantics, and together they construct a visually and semantically compact and disambiguated visualset. C defines a generative mixture model for images on which visualness can be measured. We adopt the ranking-clustering framework [10] to generate C. The framework provides a method can simultaneously partition a heterogeneous graph and weight nodes in each cluster, hence is suitable for visualsets mining. Meanwhile, addressing authority of nodes by ranking during image clustering can help handle the noisy web data. We choose a representative image instead of the mean as centers in cluster refinement to further facilitate the process. Algorithm 1 outlines our method. We initialize the algorithm with a random partition on X and assign a uniform (0) score to image xi ∈ Ck . We propagate the scores via the edges Ω of the graph G to rank attribute node nj ∈ A, which in turn updates the scores of xi in each visualset Ck respectively. This is the ranking step. Images are then mapped to the measurement space defined by the mixture model of visualsets. The partition of X is then refined according to posteriors. This is the clustering step. The process iterates until convergence, which gives final visualsets C. We detail the iteration process below. 2.3.1. Ranking nodes Denote XC k as the member images of Ck and P(XC k ) as the authority of member images in Ck . Denote set-based attribute

nodes as S ∈ {G, H} and word-based attribute nodes as A ∈ {W, T }. Let P(S|Ck ) and P(A|Ck ) be the authority of the elements of S and A in Ck . Let WSCk and WACk be the weight matrices of edges between attribute nodes S, A and member images XC k respectively. We have WSCk = WCTk S and WACk = WCTk A . Assuming that different types of attribute nodes are independent, we first compute the authority scores of each type of attribute nodes respectively. We adopt the HITS [11] like iterative algorithm on the bipartite graph between each type of word-based attribute nodes A and XCk to get P(A|Ck ) as Eq. 2. We introduce smooth λA to P(A|Ck ) to tolerate errors in the clustering for the generation of attribute nodes.  WCk A · P(A|Ck )   P(XCk ) = kW   Ck A · P(A|Ck )k1

1 Y p(xi |Q, Ck ) Z Q

p(xi |Q, Ck ) =

P

(3)

and Z =

P

(4)

ωxi nj p(nj |Ck ) P

nj ∈Q

ωxi nj

(5)

xCk = arg max p(Ck |x) x∈XC

We then have xi ∈ Ck , if ∀k ′ ∈ {1, 2, . . . , K}, k ′ 6= k, dist p (xi ) , p(xCk ) < dist p (xi ) , p(xCk′ ) .

P (T |Ck ) =

We then estimate the posterior p(Ck |xi ), Ck ∈ C, xi ∈ X , based on which image clusters are refined. We use the EM algorithm to maximize the likelihood (Eq. 6) to generate images from the mixture model built on visualsets.

i=1

log p(xi ) =

i=1

log(

k=1

p(xi |Ck )p(Ck ))

WT X · P (X |Ck ) kWT X · P (X |Ck )k1

(10)

P (T |Ck ) measures the importance of words in a semantically and visually compact visualset. We then discover concepts and estimate their visualness in the following way: We form a concept as t = {tl }L l=1 , tl ∈ T , L = 1 indicates a simple concept, whereas L > 1 identifies a compound concept. The associated score of concept t with visualset Ck is p (t|Ck ) = min p(tl |Ck ). Visualness V (t) of concept t is tl ∈t

2.3.2. Posterior estimation

log L =

(9)

k

I (t, Ck ) =

K X

(8)

measured on highly relevant visualsets which have p (t|Ck ) larger than the threshold τ . Define the indicator function as

p(xi |Ck ) is the normalization factor.

X X

kp(xi )k2 · kp(xCk )k2

where xCk is the most representative image of Ck chosen according to posterior by Eq. 9.

xi ∈X

|X | X

p′ (xi ) · p(xCk )

After visualsets converge, we compute final scores of textual words T by

We then estimate the authority of each image xi ∈ X in Ck , i.e. p(xi |Ck ). Since each type of attribute nodes Q ∈ {G, H, W, T } are assumed independent, we have

nj ∈Q

After posterior estimation, each image xi is represented as a K-dimensional vector p(xi ) = (p(C1 |xi ), p(C2 |xi ), . . . , p(CK |xi )). The cluster refinement is to reassign xi to the nearest visualset as member images. The distance of xi against visualset Ck is measured with cluster center xCk by

2.4. Visualness estimation

 WACk · WCk S · P(S|Ck )  P(A|Ck ) =    kWACk · WCk S · P(S|Ck )k1 

  WSCk · WCk A · P(A|Ck )  1  P(S|Ck ) = (1 − λS ) + λS kWSCk · WCk A · P(A|Ck )k1 |S|

(7)

2.3.3. Cluster refinement

dist(p(xi ), p(xCk )) = 1 −

On the other hand, authority P(S|Ck ) of each type of setbased attribute nodes S is estimated with the help of wordbased attribute nodes A, bridging by XCk . We involve A to rank S as word-based attribute nodes are either semantic keywords or visual words that indicate semantically and visually similarity. The scoring function is as Eq. 3 below.

where

 p(xi |Ck )p(Ck )   p(Ck |xi ) =   K  P  p(xi |Ck )p(Ck ) k=1   |X |   p(Ck ) = 1 · P p(Ck |xi )  |X | i=1

(2)

  WACk · P(XCk )  1  P(A|Ck ) = (1 − λA ) + λA kWACk · P(XCk )k1 |A|

p(xi |Ck ) =

Introducing p(Ck ) as hidden variables, p(Ck |xi ) and p(Ck ) are estimated iteratively as

(6)

(

1, p (t|Ck ) > τ 0, otherwise

(11)

Visualness score V (t) of concept t is a combination of associated scores on relevant visualsets:

V (t) =

K P

p (t|Ck ) ·p (Ck ) ·I (t|Ck )

k=1 K P

k=1

(12) I (t|Ck )

Visualset ID: 59

Visualset ID: 530

Visualset ID: 1223

Alley 7.79 Street 2.78 Building 2.20 Light 2.15 Italy 1.49

Ceiling 12.77 Glass 12.63 British 10.59 Museum 10.45 London 10.45

Giant 10.67 Panda 10.46 Zoo 10.44 California 10.04 Bear 9.59

Visualset ID: 91

Visualset ID: 886

Visualset ID: 1338

Canyon 17.59 Antelope 16.25 Arizona 12.21 Page 8.11 Slot 5.51

Red 5.40 Orange 2.20 Bravo 1.34 Light 1.04 Wall 0.95

Gymnastics 85.40 Senior 4.76 Sports 0.87 Gymnast 0.86 Floor 0.43

Visualset ID: 406

Visualset ID: 1120

Visualset ID: 1404

Butterfly 17.43 Flower 6.78 Jesters 5.26 Nature 4.73 Mariposa 3.84

Foals 22.65 Horses 17.41 Horse 5.97 Pony 3.81 Small 3.73

Herd 12.91 Horse 12.35 Gaucho 11.59 Criollo 11.41 Argentina 10.69

Fig. 5. Examples of visualsets. Top ranked images and terms are shown. 3. EXPERIMENT 3.1. Dataset The NUS-WIDE dataset [12] containing 269,648 images and 5,018 unique tag words from Flickr is used in our experiment. Two types of global features, 64-D color histogram and 512-D GIST, are extracted. Each type of global features is further clustered into 2000 clusters by k-means clustering, whose centers form the set-based attribute nodes of the image heterogeneous graph. Local SIFT features are also extracted and clustered into 2000 visual words by k-means clustering, based on which word-based attribute nodes are generated. 3.2. Visualsets

visualness

Saw Tollbooth Foal Cordless Cheerleader Cellphone Firefighter Tennis Donkey Pavilion Telephone Beard Gymnastics Foxhole Camel

11.01 9.18 8.61 8.28 8.19 8.06 7.57 6.90 6.83 6.40 6.33 6.30 6.03 5.88 5.77

Asian October Colorful Ladybug Available

Saw: 34.71 War: 1.66 Army: 1.03 Tollbooth: 30.66 Glasgow: 6.94 Clock: 6.53 Tollbooth: 40.15 Road: 8.20 Indiana: 1.99 Pavilions: 17.56 Architecture:11.45

Italy: 11.08 Pavilions: 17.05 China: 6.70 Temple: 2.25 Architecture: 6.10

Building: 5.81 Asian: 3.83

…

Firstly, we evaluate the performance of our visualsets generation approach presented in Section 2.3. By setting visualsets number K = 3000, smooth parameter λS = 0.3 for set-based attribute nodes, λA = 0.1 for word-based attribute nodes, which are optimal parameters via experiments, visualsets mining finishes in less than 30 iterations. Fig. 5 shows some randomly selected visualsets. In each visualset, top ranked images and top ranked words are displayed. Generally, images in a visualset are visually compact and consistent, and are semantically related to the top ranked words. Meanwhile, the associated scores of words illustrate the effectiveness of visualsets. For example, the five words of visualset 530 are highly relevant to the member images and have close scores. However, there is a big jump on scores of the top two words in visualset 1338 as apparently “Gymnastics” is much more relevant to the images than “senior”. Moreover, visualsets 1120 and 1404 suggest to learn synsets is valuable. Both visualsets are about horses, but 1120 focuses on foals whereas 1404 is about a herd of horses. Visualsets also reveal relationships between concepts. For example, visualset 59 may mean “Alley” is a narrow “street” between “buildings”. The visualset 91 turns out to suggest “antelope canyon” in “Arizona”. Visualset 406, on the other hand, implies frequent co-occurrences between “butterfly” and “flower”, while visualset 1223 has very close scores for “giant panda”, which suggests a compound concept of “giant panda” for the panda images.

concept

3.10E-2 2.64E-2 2.53E-2 4.89E-2 1.01E-4

Cheerleader:7.31 Girls: 7.31 Asian: 7.31 Ladybug: 8.33 Pattern: 8.33 Texture: 8.33

Fig. 6. Visualness of simple concepts. 3.3. Visualness Now we evaluate the effectiveness on discovering simple concepts and compound concepts as well as estimating their visualness. We set threshold τ = 0.04, and then have all visualness scores normalized. 3.3.1. Simple concepts We find that 2,424 simple concept out of the 5,018 unique tag words in NUS-WIDE are visualizable, i.e., visualness scores exceed τ = 0.04. Fig. 6 illustrates the top and bottom simple concepts, their visualness scores, and top images and words from the corresponding visualsets. A concept may relate to multiple visualsets, e.g. the two visualsets of “tollbooth” show the semantics in daytime and at night respectively. From Fig. 6, we can see that concepts of more compact visualsets generally have higher visualness scores, e.g. “Asian” has very low visualness scores because its relevant visualsets (we show only two due to space limit) are quite different in

concept

visualness

Camel Goat Moose Rabbit Bear Giraffe Lion

5.77 4.02 4.02 3.58 3.35 3.31 3.12

… Antelope Animal

2.63 2.59

Moose: 32.42 Alaska: 25.65 Animal: 7.79 Antelope: 10.04 Zoo: 5.06 Nature: 4.75 Canyon: 14.98 Antelope: 14.67 Arizona: 12.49 Giraffe: 26.82 Zoo: 12.21 Animal: 6.58 Fox: 14.70 Wildlife: 6.95 Animal: 6.11

Fig. 7. Visualness of animals. concept

visualness

London graffiti Cordless phone Wedding ceremony Wedding bride Egypt sphinx Offshore oilfield Sky flags Car pavilions Sport tennis Horses foals Snow avalanche

5.82 5.01 4.66 4.24 4.06 4.00 3.86 3.76 3.68 3.61 3.57

… Cute animal Vacation outdoors Pretty things Surprise things

0.98 2.30E-2 1.01E-4 1.01E-4

Graffiti: 46.33 London: 16.90 Artillery: 16.67 London: 36.27 Graffiti: 35.56 Art: 9.37 Bride: 11.42 Wedding: 11.37 Portrait: 3.47 Wedding: 12.55 Groom: 11.37 Bride: 11.71 Avalanche:11.02 Snow: 9.57 Washington:4.82

Harbors: 16.37 Outdoors: 11.85 Vacation: 4.52

Fig. 8. Visualness of compound concepts. their visual and textual features. Though “ladybug” is intuitively a highly visualizable concept, it is scored low because there is few ladybug images in the NUS-WIDE dataset. 1 We also found the following concepts are not visualizable: “positive”, “planar”, “reflected”, “fund”, “sold”, “vernacular”, “grief”. Despite dataset effect, generally speaking, abstract concepts are difficult to visualize. Fig. 7 shows more results on animals. A specific concept like “moose” is more visualizable than a general concept “animal”. “Antelope” is ranked in between as it is an ambiguous concept which can indicate a landmark or an animal. 3.3.2. Compound concepts We discovered 26,378 visualizable compound concepts from NUS-WIDE. Fig. 8 shows a few examples. Each compound concept is a combination of two unique tag words. The top visualizable compound concepts are composed of closely related terms rather than the restricted “adjective+noun” pattern that is studied in [3], e.g. the landmark “Sphinx” is closely related to its located country “Egypt”, whereas event “wedding” 1 Our method is able to take the popularity of concepts into consideration due to the priors p(Ck ) of visualsets in Eq. 12, which is reasonable as in the web scenario, a concept is less visualizable if it has fewer relevant images.

and character “bride” will co-occur. A compound concept may also relate to multiple visualsets as well, e.g., “London graffiti”. The bottom-ranked compound concepts are generally a combination of general words or abstract words. For instance, it is difficult to seize the compact visual property of “pretty things” and “surprise things”. 4. CONCLUSION To answer the question “how well a concept could be represented in a real visual world?”, we present in this paper a novel method of visualness mining to automatically discover and measure visualizable concepts. The method consists of three steps: 1) Build an image heterogeneous graph with multiple types of visual and textual features. 2) Generate visualsets with a simultaneous ranking and clustering algorithm. 3) Discover visualizable simple and compound concepts with visualness estimation based on the visualsets. Evaluated on a large-scale benchmark dataset, our method achieves promising results. 5. REFERENCES [1] K. Yanai and K. Barnard, “Image region entropy: a measure of “visualness” of web images associated with one concept,” in ACM Multimedia, 2005. [2] Y. Lu, L. Zhang, Q. Tian, and W.Y. Ma, “What are the highlevel concepts with small semantic gaps?,” in CVPR, 2008. [3] J.W. Jeong, X.J. Wang, and D.H. Lee, “Towards measuring the visualness of a concept,” in CIKM, 2012. [4] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009. [5] X.J. Wang, Z. Xu, L. Zhang, C. Liu, and Y. Rui, “Towards indexing representative images on the web,” in ACM Multimedia Brave New Idea Track, 2012. [6] D. Tsai, Y. Jing, Y. Liu, H. A. Rowley, S. Ioffe, and J. M. Rehg, “Large-scale image annotation using visual synset,” in ICCV, 2011. [7] X. Zhu, A. B. Goldberg, M. Eldawy, C. R. Dyer, and B. Strock, “A text-to-picture synthesis system for augmenting communication,” in AAAI, 2007. [8] T. L. Berg, A. C. Berg, and J. Shih, “Automatic attribute discovery and characterization from noisy web data,” in ECCV, 2010. [9] A. Sun and S. S. Bhowmick, “Quantifying tag representativeness of visual content of social images,” in ACM Multimedia, 2010. [10] Y. Sun, Y. Yu, and J. Han, “Ranking-based clustering of heterogeneous information networks with star network schema,” in KDD, 2009. [11] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” J. ACM, vol. 46, no. 5, pp. 604–632, 1999. [12] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university of singapore,” in CIVR, 2009.

MINING VISUALNESS Zheng Xu1â, Xin-Jing Wang2 ...

malâ) and ambiguous concepts (e.g., âappleâ, which may rep- resent a kind of fruit or a company); and 3) Even though a concept is highly visualizable, it may still ...

Download PDF

3MB Sizes 1 Downloads 82 Views

Report

MINING VISUALNESS Zheng Xu1â, Xin-Jing Wang2 ...

Recommend Documents

MINING VISUALNESS Zheng Xu1â, Xin-Jing Wang2 ...