Bioingenium Research Group National University of Colombia
Technical Report UN-BI-2009-01 March, 2009
Content-based Medical Image Retrieval Using a Kernel-based Semantic Annotation Framework Juan C. Caicedo,
[email protected] Fabio A. Gonzalez,
[email protected] Eduardo Romero,
[email protected]
Cra 30 # 45 - 03. Ciudad Universitaria. Bogot´a - Colombia www.bioingenium.unal.edu.co
Content-based Medical Image Retrieval Using a Kernel-based Semantic Annotation Framework
Abstract Medical images are an important information modality for health care and the clinical decision making process. However, accessing a medical image collection to find particular and useful samples is a very difficult task due to the large amounts of visual contents and the lack of automated tools to retrieve relevant results. Modeling the semantics of image contents and calculating similarity measures to automatically identify relevant images in a large collection is an important research problem nowadays. This paper presents a novel approach to retrieve medical images based on their semantic contents. The proposed framework first analyzes visual image contents to automatically associate semantic labels. This is achieved using many kernel functions associated to individual low-level visual features, which are optimally combined according to the kernel-target alignment measure. The new adapted kernel generates an image representation that is more discriminative in the annotation process. Automatically generated annotations for each image in the database are used to search for similar images under a query-by-example paradigm. Experimental results show that the proposed framework is between 10% and 50% more effective than other standard kernel methods. Moreover, the retrieval effectiveness is more than twice using the semantic kernel-based framework, compared with a retrieval system that uses only visual information.
Keywords: Medical Image Retrieval; Kernels; Kernel Alignment; Automatic Image Annotation; Histology
1
Introduction
Ordinary clinical routine requires images to support physicians’ decisions, and because of this modern hospitals produce hundreds of them daily. The use of information technologies such as Picture Archiving and Communication Systems (PACS) is widely spread in health centers to manage medical images in routine activities. Images are usually acquired and stored in these systems to support the immediate need, but they remain archived in the long term. These large image collections are a potential source of information and knowledge, but their
1
1 Introduction
2
exploitation requires effective and efficient methods to find and retrieve images that satisfy the user information needs. Systems able to find and retrieve images according to their information content are known as content-based image retrieval (CBIR) systems. In a medical context, a CBIR system should find images to support the decision making process in routine activities. Suppose a physician is evaluating a new case using the available PACS in the hospital. The Doctor is observing the patient’s image using a standard user interface that includes an additional option called “Find similar images”. When the physician picks this option, a set of evaluated cases is displayed. This functionality would allow experts to take advantage of the entire image collection and the information associated to it, providing a tool for evidence-based medicine and case-based reasoning. Clinical benefits of CBIR systems in medicine are discussed in [19]. In addition, other activities such as research and teaching in medical schools may take advantage of these systems. One of the main problems of implementing a useful CBIR system is to have a good image similarity measure. Similarity measures based on low-level visual features (e.g. color histograms, texture descriptors, etc.) are not always the best alternative, since they are unable to capture the semantic notion of similarity. This problem is known as the semantic gap. In the medical context, this is a very important issue since specialists look for very particular patterns in images, and two images that are very different from a visual point of view may be closely related from a semantic higher-level perspective. This problem may be patially overcome by Doctors manually attaching annotations to the images in the repository that capture the semantic content of them. Nevertheless, this approach becomes prohibitive for moderately large repositories. A better approach is to automatically identify semantic concepts within images. Semantic concepts, instead of low-level visual features, are the basis for calculating a semantic similarity measure. This is the approach followed in the system presented in this work. This paper presents a system to archive and retrieve histopathology images by content. The most remarkable characteristic of this system is that it uses a high-level semantic representation of images that greatly improves the retrieval performance. The system has been built on a general framework for organizing the image collection using automatically generated annotations. This framework uses state-of-the-art machine learning algorithms to evaluate image contents and predict a set of labels that represents image semantics. The set of possible labels is defined by domain experts and for each of those labels a Support Vector Machine (SVM) classifier is trained. The input space to the SVM is composed of multiple visual features, that are optimally combined using a kernel-target alignment strategy [7]. The main contribution of this work is the proposed semantic annotation and representation framework itself, which may be easily adapted to other image collections. The wide applicability of the framework is due to two main reasons: first, images are characterized by global features so that there is not an attempt to identify particular objects in the image; second, the annotation system is trained with some examples, annotated by an expert.
2 Content-based Medical Image Retrieval
3
Experimentation has shown that the optimally combined kernel improves the system response between 10% and 50% when compared with standard kernel methods, and more than twice when compared with a system using only visual features, instead of semantic information, to search images. The system was tested with an image collection acquired in the Pathology Lab at the National University of Colombia, which provides clinical services, offers courses to undergraduate and graduate students and develops fundamental research. The image collection was focused on a particular pathology, the basal-cell carcinoma [29], a common skin disease in white populations whose diagnosis is confirmed using histopathology slides. The purpose of the implemented system is to support activities at the Pathology Lab, allowing to search the historical image database that has been collected along years of service and research. This paper are organized as follows: Section 2 presents a review of previous works related to medical image retrieval and histology image retrieval. Section 3 introduces the proposed system to retrieve histopathology images using automatic semantic annotations. Section 4 shows the methods for image processing, kernel adaptation, classifiers training and semantic features production. Section 5 describes the image collection used to perform experiments. The experimental setup, annotation and retrieval results are presented in Section 6. Finally, Section 7 presents the concluding remarks and future work.
2
Content-based Medical Image Retrieval
The medical image retrieval problem has been approached from both general and specific perspectives. When the image collection is not restricted to a modality or specific body organ, the system has to manage a high content variability. On the other hand, for specific application domains, it is possible to design strategies that model the particular knowledge for extracting features or regions of interest from images. Either for specific contexts or general domains, the image content representation is the main issue. CBIR systems such as FIRE [10] and MedGIFT [17] use low-level features to model image contents, i.e., these systems do not attempt to find regions of interest or to model spatial relationships. Low-level features may be useful to detect visual differences between image modalities in a heterogeneous medical image collection. However, the use of low-level features presents poor performance because of the semantic gap [26]. The ASSERT system [24] aims to bridge the gap involving directly the physician to identify regions of interest (ROI) on HRCT images. The authors claim that relevant regions cannot be identified using available computer-vision techniques, so human-drawn ROI’s are processed to extract visual descriptors. Advanced image processing techniques are used to model image contents. For example, the National Library of Medicine has worked on spine x-ray images to represent vertebrae as a set of points [30]. To identify the appropriate set of points, they have proposed a robust algorithm that models the shape of
2 Content-based Medical Image Retrieval
4
vertebrae and also allows to perform accurate morphometrical measures. However, this method is very limited since this is addressed to this specific kind of spine x-ray images. The design of image content representations is the result of a trade-off between method generality and accurate semantics. The ImageCLEFmed challenge is a research event that promotes the discussion of models for content-based medical image retrieval [18]. It has been carried out since 2005, providing a common image dataset to evaluate retrieval performance among all participating groups. The image collection is highly heterogeneous and contains both, images and associated texts that can be used to solve queries proposed by the event organizers. For this dataset, only-visual approaches have shown a lower performance than those using multimodal data [18]. The use of text annotations improves the system response, since there is a more explicit source of semantics. However, textual annotations are not always available and images without annotations are not searchable under a text-dependent scheme. Since available text may improve the retrieval performance in a medical CBIR system, automatic image annotation has become a popular approach. The task is to assign a set of correct words to unlabeled images through content analysis, using a controlled vocabulary that may be generated using automatic text analysis or may be given by an expert. An effort to design such a controlled vocabulary was made by the IRMA project [15], that has proposed an alphanumeric code to annotate medical images, including information about modality, orientation, examined body organ and biological system. The ImageCLEFmed challenge includes a task for automatic annotation of medical images using this code and other works have also adopted and tested it.
2.1
Histology Image Retrieval
The study of image retrieval systems for histology applications is not as common as for radiology. There are a few systems that are designed specifically for histology image retrieval [31, 28], and some others for histology image classification or automatic disease detection [14, 8]. Zheng et al. [31] present a system for retrieving pathology images using low-level features such as colors, textures, Fourier coefficients and wavelet coefficients. Low-level feature similarity is calculated using cosine similarity. Global image similarity corresponds to a linear combination of these low-level similarities. The system was validated using a hierarchical cluster analysis under the assumption that relevant images are close under the defined similarity. Although results show consistency between obtained clusters and domain knowledge, it is not clear how weights have been adjusted, taking into account that those parameters can highly reduce the semantic gap when using low-level features. Tang et al. [28] proposed a system for semantic histology image retrieval. It uses machine learning techniques to automatically annotate a particular type of histology images from gastrointestinal tract. That framework models images as features in a regular grid. Each block in the grid is classified using semantic categories that explain image patterns. This semantic analyzer assigns labels
3 A System for Semantic Image Retrieval
5
Fig. 1: Proposed system overview to each block in the image grid, performing inferences for evaluating global coherence of local labels. This is achieved using three coupled neural network classifiers, requiring a great effort to train the semantic analyzer because of the required number of examples with different levels of annotations for each arrangement. Similarly to Tang’s work, the system proposed in the present paper uses a high-level semantic representation for images. However, our proposed system only uses global visual features, and therefore it does not require to label individual regions in the training set. This makes the proposed system simpler in terms of structure and function. It should be strengthen out that it is impossible to compare both systems since the authors did not perform a systematic evaluation of the system retrieval performance using appropriate measures.
3
A System for Semantic Image Retrieval
The system proposed in this paper was built to solve the concrete problem of storing and accessing efficiently a collection of histopathology images. The system was thought of as a general image storage and retrieval system. A typical usage scenario for the system is the following: a user uploads a set of images to the system, the system indexes and stores the images, the user queries the system using a sample image (this is called query by example QBE), the system looks in the database for those images most similar to the query, finally, the system outputs these images as the result. This process is illustrated in Figure 1. The Figure also shows the system’s two main modules: the image storage
3 A System for Semantic Image Retrieval
6
Fig. 2: Training of the semantic annotator module and the image retrieval module. The Image Storage module has two processing components and two storage components. Once an image is uploaded, it is directly stored in the image database in JPG format. This image is processed using several algorithms to extract low-level visual features that characterize the image. The set of features includes color, texture and edge distributions, among others. These features are fed to the semantic annotation component, which was devised to analyze low-level features and predict concepts that are present in the image. These concepts are defined by domain experts. The other main module, shown in Figure 1, is the Image Retrieval module. This module receives user queries, under the QBE paradigm, and delivers similar images. The query is processed to extract its low-level features, these features are then used by the semantic annotation submodule to generate highlevel semantic tags, which are associated to the image content. The semantic representation is used by a similarity function that evaluates how many concepts are shared between two images. The critical task of this system is the high-level semantic image annotation, accomplished by a semantic annotator submodule. The process to train the semantic annotator is shown in Figure 2. At the beginning, a collection of training images is selected to be tagged by experts. The number and type of concepts depend on the particular application and they must be specified by a domain specialist. In general, an image may exhibit one or more concepts, so concepts generate a partition of images into non-disjoint classes.
4 Semantic Image Annotation and Retrieval Based on Kernel Methods
7
Images are processed to extract a set of low-level features, that together with expert annotations, constitute the training dataset. A learning algorithm is applied several times to the training dataset to learn classifier for each selected concept. Each classifier receives as input the low-level features of a new image, evaluates the classification function and determines whether the image contains the concept or not. With all outputs from individual classifiers, the semantic annotator builds a semantic feature vector for the input image, which indicates a sort of presence degree of each concept in the test image.
4
Semantic Image Annotation and Retrieval Based on Kernel Methods
Previous works have shown that image representation results a key issue in a successful image classification system [4]. In this work, image kernel functions [2] are used to generate a rich image representation space. Image kernel functions are defined for different low-level visual features. For each concept, individual kernel functions are optimally combined to obtain an image kernel function, which is adapted to the particularities of the concept. These kernels are used to train different support vector machines (SVM), one per each high-level concept. The overall process is discussed in detail in the following Subsections.
4.1
Image features
Feature extraction is an important task for image analysis and understanding. There are different approaches to address this problem [4]. Global features for characterizing whole scenes have been proposed using color histograms [27] and MPEG7 features [21]. Likewise, global descriptors such as textures and down-scale representations have been evaluated in medical imaging[12]. One important advantage of using a global image description strategy is that it is unnecessary a specific modeling of the type of objects or regions that images may contain. This makes easier to apply the proposed framework to different type of images. Within this framework each feature is a probability distribution function represented as a histogram which contains global image information. Histograms are a simple non-parametric approach to estimate image contents. The histogram divides the feature space X in M regions χ(m) ⊂ X such that M [
χ(m) = X
(1)
m=1
and χ(m) ∩ χ(n) = φ
∀m 6= n
(2)
Once the feature space partition has been defined, the histogramization procedure counts the number of pixel occurences in each sub-space χ(m) . Frequency
4 Semantic Image Annotation and Retrieval Based on Kernel Methods
8
values are normalized to obtain the probability distribution function. What we have obtained so far indeed is a description provided by the probability distribution function. An image can be understood as a complex mix of many low-level features such as edges, textures, colors and orientations. An image is mapped to a feature space through a filter or transformation. Image data is then distributed in the feature space according to the image content. Globally it is well known that visual systems are sensitive to three basic characteristics, color, edges and orientation. Six feature spaces have been selected to analyze different image content properties. 1. Gray scale histogram. A 256 bins histogram is used. 2. Color histogram: Histograms are represented using partitions of the color space as described in [25]. The RGB space is partitioned in 8×8×8 = 512 bins. 3. Local Binary Patterns: This is a texture feature that has been used in some image-retrieval systems [3]. For each pixel P in the image, its 8 neighbors are examined comparing their intensity values with the intensity of the pixel P . If the intensity of the neighbor pixel is greater, a 1 value is assigned to the corresponding neighbor position, otherwise a 0 value is assigned. The calculated values are used to build an 8-positions binary string per pixel. The binary string can take 256 different values, so a histogram with 256 bins is calculated. 4. Tamura texture histogram: There are 6 different Tamura features: coarseness, contrast, directionality, linelikeness, regularity, and roughness. In this work, the first three are used as they are strongly correlated with human perception. In [9], the original Tamura’s formulation was adapted to calculate each feature per pixel, this is the approach used in this work. Likewise the color histogram, the space generated by the three features is partitioned in 8 × 8 × 8 = 512 bins to calculate the texture histogram. 5. Sobel histogram: The Sobel operator is one of the most popular imageprocessing operators for edge detection [16]. It calculates the intensity difference on the neighborhood of a pixel in the horizontal and vertical directions; this difference may be interpreted as the derivative of a function representing the image at that point. In this implementation, a 3 × 3 operator is used to analyze the 8-neighborhood of each pixel. A 512-bins histogram is calculated using the intensity-change measures produced by the operator. 6. Invariant feature histogram: This technique models features that are invariant under different transformations such as those induced by projecting an object on an image plane. In this work, the distribution of invariant features is calculated according to [25]. Particularly, the integration method
4 Semantic Image Annotation and Retrieval Based on Kernel Methods
9
based on the kernel function X(8, 0)X(0, 16) is used; this may be interpreted as a rotation analysis with two circles, with radius 8 and 16 pixels respectively, where the support of the local transformation is calculated. Intuitively, each histogram is a vector of frequencies, for which each value is proportional to the probability of getting a pixel in the corresponding feature sub-space. All those vectors may be bound together to obtain a unique image representation with 2,560 features.
4.2
Kernel functions
Kernel methods represent a relatively new approach to perform machine learning [23]. One of the main distinctive characteristics of these methods is that they do not emphasize the representation of objects as feature vectors. Instead, objects are characterized implicitly by kernel functions that measure the similarity between two objects. A kernel function induces an implicit high-dimensional feature space where, in principle, it is easier to find patterns. In the context of the present work, kernel methods have attractive characteristics: on one hand, they provide state-of-the-art learning algorithms with a good theoretical support and a competitive performance, on the other hand they allow to deal with complex objects, in this case images, by designing similarity functions, kernels, that capture the semantics of the problem domain. Informally, a kernel function measures the similarity of two objects. Formally, a kernel function, k : X × X → R, maps pairs (x, y) from a set of objects X, the problem space, to reals. A kernel function implicitly generates a map, Φ : X → F , where F corresponds to a Hilbert space, called the feature space. The dot product in F is calculated by k, specifically k(x, y) = hΦ(x), Φ(y)iF . This section introduces the kernel functions used in our framework, specifically to deal with histogram data. The presented kernel functions provide a notion of similarity between a pair of histograms. These functions exploits the probability distribution properties of a histogram to take advantage of the full histogram information. The following are some notation conventions used in this subsection: Let A and B be two frequency histograms. Both histograms have m bins, with ai and bi being the frequency values in the i-th bin respectively, for Pm Pmi = 1...m. In general, histograms are assumed to be normalized, i=1 ai = i=1 bi = 1 , so they can be interpreted as discrete probability distribution functions. 4.2.1
Identity kernel
One can deal with histograms as simple data vectors, regardless of their probability distribution properties. In that sense, we can calculate the dot product between histograms treating them as high dimensional feature vectors, as follows: kI (A, B) = hA, Bi =
m X i=1
ai bi
(3)
4 Semantic Image Annotation and Retrieval Based on Kernel Methods
10
Using the identity kernel function for histograms is equivalent to maintain the feature data in the original m-dimensional feature space. 4.2.2
Histogram Intersection
The histogram intersection is a similarity function devised to calculate the common area between histograms as follows: k∩ (A, B) =
m X
min {ai , bi }
(4)
i=0
It has been originally used for image similarity search [9]. Previous works on image classification have attempted to design kernels for histogram data, including Gaussian and Laplacian RBF functions [1]. However, using the histogram intersection similarity as a kernel for support vector machines has shown a better representation and performance. Moreover, this similarity measure has been shown to satisfy the Mercer’s properties [2].
4.3
Linear combination of kernels
In subsection 4.1 six different histogram features were presented, each histogram corresponds to a different image content view. Given two images, a similarity measure may be calculated by applying one of the kernel functions discussed in the previous Subsection on a pair of feature histograms. This produces 12 alternative kernels (6 feature histograms combined with 2 basic kernel functions). The problem is how to use these possible alternatives to calculate an overall similarity measure for images. The new similarity measure would correspond to a kernel function that induces a new image representation space. The aim of this work is to design a representation space that emphasizes different characteristics such as color, edges and texture depending on the target high-level concept to be identified. In this work, the problem is solved by defining a kernel, kα , which is a linear combination of n individual histogram kernels: kα (x, y) =
n X
αi ki (x, y)
(5)
i=1
The weights αi allow to parameterize the kernel given higher or lower importance to each individual feature. Each ki corresponds to a histogram kernel applied to a particular image feature histogram. The problem is thus to find a vector of weights α that maximizes the performance of the kernel kα in an image classification task. In the particular case of histopathology images, different concepts require different classifiers that emphasize different visual features. Herein we use the kernel alignment concept [7] to build an adapted kernel function for each concept. Each adapted kernel function is expected to emphasize those visual features
4 Semantic Image Annotation and Retrieval Based on Kernel Methods
11
that allow to better recognize the presence (or absence) of the corresponding concept in a given image. Kernel-target alignment [7] measures how appropriate a kernel function is for solving a specific classification problem. Specifically, the alignment of two kernels with respect to a sample S, is defined as: hK1 , K2 iF p AS (k1 , k2 ) = p , hK1 , K1 i hK2 , K2 iF
(6)
where k1 , k2 are kernel functions; K1 , K2 are matrices corresponding to the evaluation of the kernel functions onPthePsample S; and h·, ·iF is the Frobenius inner product defined as hA, BiF = i j Aij Bij . Given a target function y : X → {−1, 1}, where X is the problem space, the target kernel k ∗ is defined as k ∗ (x, z) = y(x)y(z). The target kernel is the optimal kernel for solving the given classification task. The goodness of a given kernel k is measured in terms of how much it aligns with the target kernel in a training sample. Formally this is expressed as A∗S (k) = AS (k, k ∗ )
(7)
The problem of finding appropriate weights for kα then becomes the problem of finding the weights α that maximize the target alignment A∗S (kα ). In [13], this problem is solved by transforming it to an equivalent quadratic programming problem and is the strategy followed in this work.
4.4
SVM Classifiers
Support Vector Machines (SVM) [22] are linear classifiers devised to operate in a feature space defined by a kernel function. The decision function is a hyperplane in the feature space that separates the vectors that are in the class from other vectors. This decision function for classifying a new vector x is as follows: ! l X yi αi k(xi , x) + b (8) sgn i=1
where xi is the i-th support vector, yi is its label, αi is a weight value to be found by the learning algorithm, and k(., .) is the kernel function. The problem of finding such a hyperplane is an optimization problem that minimizes an expression that depends on the training data with a parameter to control the generalization ability of the classifier. The generalization problem deals with finding a classification function that may be robust to the variation of the input data. It is well known that SVM generalization performance depends on a good setting of both the the model complexity and the kernel [22]. The complexity is controlled by a regularization parameter. In this work, the specific regularization parameter for each SVM is experimentally determined using cross validation to reach an effective and general model.
4 Semantic Image Annotation and Retrieval Based on Kernel Methods
4.5
12
The Semantic Image Annotator
The goal of the image annotation system is to analyze visual image contents to produce a semantic interpretation. This interpretation corresponds to the assignment of zero or more concept labels. The system has a restricted vocabulary, and for each image the aim is to identify which of those keywords are more appropriate according to the visual content. However, in this framework, binary labels are not assigned to the image as explained below. The image annotation system requires two training steps for each target concept. The first step is to identify a kernel function that optimally combines all individual histograms as has been described in Section 4.3. The second step is to train a SVM using the corresponding adapted kernel to recognize the particular class. Semantic annotations are built using the output of all SVMs. However, instead of using binary labels that indicate whether an image contains a concept or not, a degree of presence or absence is modeled for each possible concept. Each image is assigned to a semantic feature vector in Rn , where n is the number of concepts. Each component of the semantic feature vector is generated by applying a sigmoid function to the output of the corresponding SVM: f (x) =
1 1+
e−a(x+b)
(9)
The function produces a number in the range [0, 1]. The position and the slope of the function is controlled by the parameters b and a respectively. The shape of the function has an important repercussion on the sensitivity of the semantic annotation process. Specifically, the sigmoid function parameters affect the trade-off between precision and recall. Appropriate values are found experimentally using the training data.
4.6
Semantic Image similarity
As it was discussed in Section 3, the retrieval system looks for the set of images most similar to the query. The semantic similarity of two images is calculated by applying the Tanimoto coefficient to the semantic feature vectors describing the images. Given two semantic vectors x and y, the Tanimoto coefficient is defined as:: x·y T (x, y) = (10) 2 2 kxk + kyk − x · y The Tanimoto coefficient is a generalization of the Jaccard similarity for continuous values. It evaluates the degree of coincidence between two vectors, which, in this context, is related to the common concepts of the two images being compared.
5 Histopathology Image Test Dataset
5
13
Histopathology Image Test Dataset
Images in this work have been used to diagnose a special kind of skin cancer known as basal-cell carcinoma. Basal-cell carcinoma is the most common skin disease in white populations and its incidence is growing world wide [29]. It has different risk factors and its development is mainly due to ultraviolet radiation exposure. Pathologists confirm whether or not the disease is present after a biopsied tissue is evaluated under microscope. In this evaluation, physicians aim to recognize some characteristic patterns in tissues to determine whether the carcinoma is present or absent. This process is mainly achieved by a visual analysis to identify cell structures and tissue organization. In [11], the structural patterns that characterize the basal-cell carcinoma are described. The whole histopathology collection is composed of 5,995 images at 1,280×1,024 pixels, acquired under a Nikon microscope at the Pathology Lab. In the acquiring process a set of clinical cases was selected, related to different patients. Each patient’s sample is put under the microscope and after a visual inspection, some images are captured at different zoom levels. Images were stored in a common directory in JPG format for later analysis. A subset of 1,502 images was studied and annotated by a pathologist to describe its contents. The annotation process included the following steps: 1. The individual analysis of each image to determine the kind of content was done by an expert pathologist. 2. The enumeration of the main visual patterns present in images. This step produced a list of 30 visual arrangements, associated to tissue and cell properties. 3. The revision of the pattern list was done to study the relationships between them. Many patterns do not appear alone, instead, they are present together with other patterns building understandable configurations for physicians. 4. The definition of visual configurations with the most semantic relevance in histopathology was achieved, based on the 30-pattern list. This led to a list of 18 main concepts associated to basal-cell carcinoma images. As a result, the later process gives a dataset with images and descriptions of their related concepts. One image may contain several concepts, and the annotations are useful to automatically validate if search results are relevant to the user information needs. Trough this work, all the experimentation is based on the concept list with 18 entries, which is detailed in Table 1. The first list of 30 visual structures was used only as reference to understand the domain knowledge and to interpret images. One of the histopathology concepts reported in Table 1 is N-P-C. It is a convention for Nodule, Palisading cells and Clefts (N-P-C), which is a typical sign of basal cell carcinoma, not by the presence of any of them individually but by the visualization of all three visual patterns together. Those annotations
6 Experimental Evaluation
14
associated to images are the pathologist view point about image contents and hence the best way to assess the system effectiveness. This image collection has been previously used to test two different retrieval strategies: one that uses only visual similarity [6] and another that uses a semantic representation approach based on SVM classifiers with basic kernels [5].
6
Experimental Evaluation
The kernel-based annotation and retrieval framework described in the previous Section 4 was implemented and tested in a histopathology image database. The experimental evaluation process presented in this Section has two main goals: first, to evaluate the performance of the proposed kernel-based annotation framework on real histopathology images, second, to determine the impact in the retrieval system performance, when using semantic annotations instead of using only low-level visual features. The image database is a collection of basal-cell carcinoma images acquired under the microscope. A dataset with 1,502 images was annotated by an expert pathologist, using a list of 18 histopathology concepts. One image may contain many of those concepts. This dataset was divided up into training (80%) and test (20%) sets, using stratified sampling.
6.1
Kernel alignment
The first task in the training phase is to combine kernel functions associated to low-level visual features. According to Subsection 4.3, an optimal combination can be found for a particular binary classification problem. In this work, for each of the 18 histopathology concepts a new kernel is adapted. Since each class may be better described by some particular features, the optimization problem is expected to find weights that prefer those features. The best feature combination is identified for each, class balancing a trade-off between generalization and good alignment. A 10-fold cross validation was performed on the training dataset to estimate the most appropriate parameters in the optimization problem. Using good parameters, one obtains a linear combination of kernels that is better adapted to recognize the target concept on new unseen image sets.
6 Experimental Evaluation
15
Tab. 1: Histopathology image concepts and weights assigned to each feature. The largest weight per concept is marked with bold and double underline. The second largest weight is marked with italic and single underline. Feature conventions are: LBP: Local Binary Patterns, SOB: Sobel, TAM: Tamura Texture, GRA: Gray Levels, RGB: Colors and INV: Invariants. Concept Cystic change Eccrine glands Elastosis Fibrosis Lymphocyte infiltrate Micronodules Morpheaform pattern N-P-C, elastosis N-P-C, fibrosis N-P-C, infiltrate N-P-C, pilosebaceous annexa N-P-C, trabeculae Necrosis Perineural invasion Pilosebaceous annexa Rod trabeculae Sanguineous vessel Ulceration
LBP 0.17 0.46 0.81 0.52 0.77 0.50 0.49 0.60 0.50 0.43 0.49 0.51 0.40 0.52 0.80 0.62 0.29 0.48
SOB 0.28 0.15 0.09 0.16 0.01 0.11 0.11 0.16 0.20 0.32 0.13 0.15 0.17 0.16 0.15 0.12 0.19 0.16
TAM 0.36 0.16 0.10 0.12 0.11 0.16 0.09 0.20 0.16 0.21 0.17 0.17 0.13 0.15 0.00 0.12 0.21 0.18
GRA 0.00 0.23 0.00 0.20 0.11 0.12 0.12 0.04 0.12 0.00 0.14 0.12 0.19 0.16 0.05 0.14 0.26 0.17
RGB 0.01 0.00 0.00 0.00 0.00 0.11 0.03 0.00 0.01 0.04 0.08 0.06 0.10 0.01 0.00 0.00 0.05 0.02
INV 0.18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Table 1 shows the list of histopathology concepts with the associated weights for each feature. The largest weight for each concept is marked with bold and double underline, followed by the second largest weight that is marked with italic and single underline. In general, Local Binary Patterns (LBP) have been the most important feature for almost all histopathology concepts. This indicates that texture is an important feature for discriminating this specific set of concepts. Other important features are Sobel (edges), Tamura textures and gray level histograms, which receive more or less weights according to the underlying concept. On the other hand, colors and invariant features were determined to be of little value to discriminate histopathology concepts. This may be due to the low color variability in the collection and the most textured structure of histopathology images. Reported results have been obtained after the kernel combination process using the histogram intersection kernel since it performs better in the classification and retrieval tasks.
6 Experimental Evaluation
16
Tab. 2: Classification results on the test dataset using different kernel functions and the base-line model. Strategy Kernel Function Precision Recall F-measure Optimal Combination Hist. Intersection 0.70 0.38 0.48 Optimal Combination Identity Kernel 0.36 0.23 0.27 Simple Combination Hist. Intersection 0.68 0.35 0.45 Simple Combination Identity Kernel 0.34 0.15 0.20 Metafeatures RBF Kernel 0.51 0.47 0.47
6.2
Automatic Image Annotation
The semantic image annotator is composed of 18 SVM classifiers that evaluate image contents under a kernel-based framework. The classification strategy is one-against-all, i.e. a classifier is learned to separate each class from all others. It is specially useful since each image can be annotated using different labels, that is to say, each image can be classified into many classes. The classification system is first trained using 10-fold cross validation to estimate good parameters for each classifier. The parameter is chosen to maximize the f-measure per class, since we want to correctly annotate as many images as possible with high precision. Reported performance measures are precision, recall and f-measure, since others such as accuracy or error rate may be biased by the large per class imbalance in the dataset. In addition, reported measures are weighted-average scores among all classes according to the number of images in each class. The experimentation includes the evaluation of three different strategies for building kernel functions: 1. Optimal combination of kernels using the kernel-target alignment framework. 2. Simple combination of kernels adding with equal weights. 3. An RBF kernel calculated on a reduced histogram representation using basic statistical moments. The first strategy is based on the proposed methods presented in Section 4.3. The second strategy is used as baseline to assess the performance improvement of using an optimal combination of kernels, against the direct sum of all them. The third strategy is another baseline to compare with a more classic kernel construction according to the scheme presented in [5]. In addition, strategies 1 and 2 are evaluated using two underlying kernel functions: the identity kernel and the histogram intersection kernel. It allows to distinguish the contribution of using specific kernel functions to exploit the natural structure of the data. Experimental results are presented in Table 2 showing the performance measures of all three evaluated strategies. The best overall performance is obtained by the optimal combination of features using as underlying kernel function the
6 Experimental Evaluation
17
Recall vs Precision Graph - Low-level features 1 Sobel RGB Tamura Invariants Gray LBP
0.8
Precision
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
Recall
Fig. 3: Recall vs Precision graph with the response of the low-level features using histogram metrics to rank results. histogram intersection kernel. The histogram intersection kernel has largely outperformed the identity kernel under both the optimal and simple combination strategies. Besides, the optimal combination strategy shows a clear improvement in the annotation performance regardless the underlying kernel function. In the histogram intersection kernel case, this improvement is in about 7% of f-measure. In terms of recall, the best model is outperformed by the metafeatures model, which is able to annotate more positive images. However, the precision of this model is about 50%, indicating that, for each correct annotation, another wrong annotation is also generated. In terms of f-measure, both models have a equivalent performance. One outperforms the other in terms of precision, but the situation is reversed when recall is evaluated. With this evidence, it is not possible to declare a clear winner, this will require to evaluate the retrieval performance of both models. This is the objective of the next subsection.
6.3
Image Retrieval
To evaluate the performance of the retrieval module, images in the test set are used to query the system, which sums up to approximately 160 different
6 Experimental Evaluation
18
Tab. 3: Performance measures of the system response using low-level features to rank results. Best scores are in bold. Measure SOB TAM RGB INV GRA LBP M AP 0.101 0.098 0.096 0.095 0.094 0.077 Rank1 11.71 10.95 11.80 11.86 12.26 19.55 ^ 588.59 600.48 608.66 610.87 601.67 713.73 Rank P (n = 1) 0.56 0.44 0.48 0.35 0.34 0.19 P (n = 20) 0.26 0.21 0.20 0.19 0.18 0.12 P (n = 100) 0.15 0.14 0.13 0.13 0.12 0.09 R(n = 20) 0.06 0.05 0.05 0.04 0.04 0.02 R(n = 100) 0.16 0.14 0.13 0.13 0.13 0.09
queries. Standard performance measures are used to evaluate the system response. Muller et al. [20] suggest a framework to evaluate and report the performance of image retrieval systems, which is followed in this experimentation. Evaluation measures include recall, precision, Mean Average Precision (MAP), first relevant image rank, average rank, and recall vs. precision plots. Reported values are the average results for the 160 test queries. The evaluation of the image retrieval system covers two main strategies to search for similar images: using low-level visual features and using semantic annotations. Each strategy evaluates different methods to determine the best configuration. The following subsections presents the results of this evaluation. 6.3.1
Visual-Feature Retrieval Performance
A baseline model using similarity functions for low-level image features is included to compare experimental results. The model based on low-level features calculates the similarity between histograms to produce an image ranking. Individual features have been evaluated using the most appropriate metric according to previous evaluations on the same dataset [6]. Figure 3 shows the recall vs. precision plot for these configurations, suggesting that Sobel and Color Histograms offer better response. In contrast, Gray Histogram and Local Binary Patterns have the poorest results. In general, the precision of all visual features decreases very fast as they return more images. That means that relevant images are not being located in the first result positions, but they are mixed with many irrelevant images. Table 3 presents specific measures to compare the response of the low-level features. The best MAP score is reached by the Sobel Histogram with about 0.101. The table also supports the main plot tendency, showing that the Sobel Histogram has the better response for almost all performance measures.
6 Experimental Evaluation
19
Recall vs Precision Graph - Semantic models 1 Histogram Intersection MetaFeatures Identity Kernel Sobel 0.8
Precision
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
Recall
Fig. 4: Recall vs Precision graph showing the system response of all semanticbased models and the best visual-based model. 6.3.2
Semantic Retrieval Performance
The proposed semantic retrieval framework is based on automatic image annotations, in which no human intervention is needed to describe image contents. The system starts annotating the query image and then retrieves images with similar annotations. Both, the query images and the database images have been automatically annotated by the system. Since these annotations rely on the kernel function used for classification, the retrieval system is evaluated according to the kernel strategy that generates them. In this case, three kernel functions are evaluated: the histogram intersection kernel, the identity kernel and the RBF kernel for metafeatures. The two first kernel functions have been used to generate the image representation under an optimal combination scheme. The retrieval response for all semantic models is presented in Figure 4 using recall vs. precision plots. As baseline for all semantic-based evaluations, the retrieval performance of Sobel features is included since it has been the best ranked among all only-visual strategies. The Figure shows the clear difference between semantic and visual strategies. The identity kernel, which is the worse semantic model shows an improvement in about 50% of MAP with respect to Sobel features. The main difference between semantic and visual models is that
6 Experimental Evaluation
Tab. 4: Performance Measure M AP Rank1 ^ Rank P (n = 1) P (n = 20) P (n = 100) R(n = 20) R(n = 100)
20
measures for all semantic-based models K∩ Meta KI SOB 0.23 0.21 0.15 0.10 16.14 21.55 22.28 11.71 256.3 0.59 0.53 0.39 0.13 0.40
298.2 0.50 0.46 0.35 0.11 0.37
376.0 0.46 0.34 0.27 0.08 0.28
588.59 0.56 0.26 0.15 0.06 0.16
the precision of the latter falls down very early. On the contrary, semantic models maintain high precision values with a very smooth decreasing until the final stage of the retrieval process. It indicates that semantic strategies usually rank relevant images in the first positions of the result list. The model with the best performance is the optimal combination of histogram intersection kernels; it clearly outperforms the other retrieval models. It is consistent with the automatic annotation results presented in the previous Subsection, in which this strategy obtained the best results in such a task. So, the better the semantic annotations, the better the retrieval response. Detailed performance measures are shown in Table 4 for the semantic models and the visual baseline. Again, the histogram intersection kernel obtains the best results for all these performance measures. One of the most significant measures, often used to compare the performance of information retrieval systems is MAP, in which the proposed strategy shows to be more than twice better than the visual strategy. Moreover, compared with the RBF kernel and the identity kernel, the histogram intersection kernel shows an improvement in about 10% and 53% of MAP respectively. Other measures also show the same tendency. An example of the results obtained using visual features and semantic annotations is illustrated in Figure 5. The query image is the first from left to right, and it is used to search for images exhibiting the lymphocyte infiltrate concept. The top five results are presented immediately after the query image, marked with blue squares if they are relevant or red squares if they are not. Observe that the results obtained using Sobel features as retrieval strategy are visually homogeneous and very similar to the query. However, the three last results are not relevant because they do not exhibit the target concept. On the other hand, results obtained with the semantic annotations are all relevant, although they have different appearance with respect to the query. In summary, the response of the retrieval system is more appropriate when it is configured to search images using semantic annotations in contrast to the performance obtained using only low-level features. In addition, the histogram intersection kernel produces semantic annotations that are more accurate to search relevant images than those produced by the baseline models. The pro-
7 Conclusions and Future Work
21
(a) Results obtained when the system uses only visual information
(b) Results obtained when the system uses semantic annotations
Fig. 5: Illustration of a content-based query. The query is the first image from left to right. The top-5 results are shown in order of relevance from left to right. Results are marked with blue if they are relevant and with red if they are not. The query image is used to search for images with lymphocyte infiltrate. posed kernel framework to annotate and retrieve images has demonstrated to be effective.
7
Conclusions and Future Work
The main contribution of this paper is a framework for automatic image annotation and content-based image retrieval in the medical domain. The framework was successfully used as the basis for building a histopathology content-based image retrieval system. This kind of system facilitates the task of searching and retrieving specific information in a large image database. This is specially useful in the medical context where decisions are increasingly supported by evidence. One of the most important characteristics of the proposed framework is that it is based on stat-of-the-art machine learning techniques. The framework includes a process to optimally combine different visual features in a kernel function that generates a new feature space to represent images. In the induced feature space, SVM classifiers are learned to recognize domain application concepts on unlabeled images. A similarity measure is evaluated using the generated annotations to estimate the likeness degree between two given images. This semantic similarity measure is used to retrieve relevant results for particular queries. The proposed kernel-based framework has shown to be effective for automatically annotating histopathology images. The combination of kernels with information of different visual features has demonstrated to be useful as a technique to improve annotation accuracy. This model has been compared with two baseline approaches: semantic annotations generated by SVMs with standard Gaussian kernels and a similarity measure using only visual information. The proposed approach offers a system response that outperforms baseline methods. The retrieval performance depends on the quality of automatic annotations, so
7 Conclusions and Future Work
22
the better the classification system the higher the precision of search results. We argue that the good performance of the proposed annotations system is due to two main aspects: (1) the use of appropriate kernel functions for the particular data structure representing image features, i.e. histograms, and (2) the optimal combination of all kernel functions associated to visual features. In our experiments, visual features are histograms with global information about image contents. However, the proposed framework could be extended to use other kinds of visual features such as segmented regions, shapes or even tree representations, as long as an appropriate kernel would be used to evaluate the similarity between those structures. Our framework will adapt a new kernel function that optimally weights all available image representations according to the particular characteristics of each class. The proposed system can be easily adapted to manage other kind of medical images. Our future work will explore this. An important challenge is to automatize the labeling of an initial set of images, which is required by the training phase of the semantic annotation model. This can be done by automatically identify important concepts from medical texts such as medical articles or health records, evaluating synonymous or using ontologies. This is also part of our future work. Content-based image retrieval systems could be a valuable tool in daily clinical routines, research activities and medicine schools. The implementation and adoption of effective systems in real world scenarios may promote the use of the large medical image collections that are now available. Hospital Information Systems has often been oriented to deliver the required information at the right time, in the right place, to the right persons to support quality and efficiency in the health-care practice [19], and CBIR systems would allow that on an important part of such information: medical images.
References [1] A. Barla, F. Odone, and A. Verri. Histogram intersection kernel for image classification. Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, 3:513–16, 2003. [2] Annalisa Barla, Emanuele Franceschi, Francesca Odone, and Alessandro Verri. Image kernels. Pattern Recognition with Support Vector Machines, LNCS 2388:617–628, 2002. [3] Andrew P. Berman and Linda G. Shapiro. A flexible image database system for content-based retrieval. Computer Vision and Image Understanding, 75, 1999. [4] Anna Bosch, Xavier Mu˜ noz, and Robert Mart´ı. Which is the best way to organize/classify images by content? Image and Vision Computing, 25(6):778–791, June 2007.
7 Conclusions and Future Work
23
[5] Juan C. Caicedo, Fabio A. Gonzalez, and Eduardo Romero. A semantic content-based retrieval method for histopathology images. Information Retrieval Technology, LNCS 4993:51–60, 2008. [6] Juan C. Caicedo, Fabio A. Gonzalez, Edwin Triana, and Eduardo Romero. Design of a medical image database with content-based retrieval capabilities. Advances in Image and Video Technology, LNCS 4872:919–931, 2007. [7] Nello Cristianini, John Shawe-Taylor, Andre Elissee, and Jaz Kandola. On kernel-target alignment. In Advances in Neural Information Processing Systems 14, volume 14, pages 367–373, 2002. [8] M. Datar, D. Padfield, and H. Cline. Color and texture based segmentation of molecular pathology images using hsoms. In Biomedical Imaging: From Nano to Macro, 2008. ISBI 2008. 5th IEEE International Symposium on, pages 292–295, 2008. [9] Thomas Deselaers. Features for Image Retrieval. PhD thesis, RWTH Aachen University. Aachen, Germany, 2003. [10] Thomas Deselaers, Daniel Keysers, and Hermann Ney. Fire - flexible image retrieval engine: Imageclef 2004 evaluation. Multilingual Information Access for Text, Speech and Images, pages 688–698, 2005. [11] Christopher D. M. Fletcher. Diagnostic Histopathology of tumors. Elsevier Science, 2003. [12] Mark O. Gueld, Daniel Keysers, Thomas Deselaers, Marcel Leisten, Henning Schubert, Hermann Ney, and Thomas M. Lehmann. Comparison of global features for categorization of medical images. Medical Imaging, 5371:211–222, 2004. [13] J. Kandola, J. Shawe-Taylor, and N. Cristianini. Optimizing kernel alignment over combinations of kernel. Technical report, Department of Computer Science,Royal Holloway, University of London, UK, 2002. [14] R. W. K. Lam, H. H. S. Ip, K. K. T. Cheung, L. H. Y. Tang, and R. Hanka. A multi-window approach to classify histological features. In Pattern Recognition, 2000. Proceedings. 15th International Conference on, volume 2, pages 259–262 vol.2, 2000. [15] Thomas M. Lehmann, Henning Schubert, Daniel Keysers, Michael Kohnen, and Berthold B. Wein. The irma code for unique classification of medical images. Medical Imaging 2003: PACS and Integrated Medical Information Systems: Design and Evaluation, 5033(1):440–451, 2003. [16] Alberto S. Aguado Mark S. Nikson. Feature Extraction and Image Processing. Elsevier, 2002.
7 Conclusions and Future Work
24
[17] H. M¨ uller, C. Lovis, and A. Geissbuhler. The medgift project on medical image retrieval. In Proceedings of First International Conference on Medical Imaging and Telemedicine, Wuyi Mountain, China, 2005. [18] Henning M¨ uller, Thomas Deselaers, Thomas Deserno, Paul Clough, Eugene Kim, and William Hersh. Overview of the imageclefmed 2006 medical retrieval and medical annotation tasks. In Evaluation of Multilingual and Multi-modal Information Retrieval, pages 595–608. Springer, 2007. [19] Henning M¨ uller, Nicolas Michoux, David Bandon, and Antoine Geissbuhler. A review of content-based image retrieval systems in medical applications– clinical benefits and future directions. International Journal of Medical Informatics, 73(1):1–23, February 2004. [20] Henning M¨ uller, Wolfgang M¨ uller, David M. Squire, St´ephane MarchandMaillet, and Thierry Pun. Performance evaluation in content-based image retrieval: overview and proposals. Pattern Recognition Letters, 22(5):593– 601, April 2001. [21] Xiaojun Qi and Yutao Han. Incorporating multiple svms for automatic image annotation. Pattern Recognition, 40(2):728–741, February 2007. [22] Bernhard Sch¨ olkopf and Alexander Smola. Learning with kernels. Support Vector Machines, Regularization, Optimization and Beyond. The MIT Press, 2002. [23] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [24] Chi-Ren Shyu, Carla Brodley, Avi Kak, Akio Kosaka, Alex M. Aisen, and Lynn S. Broderick. Assert: A physician-in-the-loop content-based retrieval system for hrct image databases. Computer Vision and Image Understanding, 75:111–132, 1999. [25] Sven Siggelkow. Feature Histograms for Content-Based Image Retrieval. ˆ sat Freiburg im Breisgau, 2002. PhD thesis, Albert-Ludwigs-UniversitAˇ [26] Arnold W. M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell., 22(12):1349–1380, December 2000. [27] M. Szummer and R. W. Picard. Indoor-outdoor image classification. Content-Based Access of Image and Video Database, 1998. Proceedings., 1998 IEEE International Workshop on, pages 42–51, 1998. [28] H. L. Tang, R. Hanka, and H. H. S. Ip. Histological image retrieval based on semantic content analysis. Information Technology in Biomedicine, IEEE Transactions on, 7(1):26–36, 2003.
7 Conclusions and Future Work
25
[29] C S M Wong, R C Strange, and J T Lear. Basal cell carcinoma. BMJ, 327:794–798, 2003. [30] Xiaoqian Xu, Dah-Jye Lee, S. Antani, and L. R. Long. A spine x-ray image retrieval system using partial shape matching. Information Technology in Biomedicine, IEEE Transactions on, 12(1):100–108, 2008. [31] Lei Zheng, A. W. Wetzel, J. Gilbertson, and M. J. Becich. Design and analysis of a content-based pathology image retrieval system. Information Technology in Biomedicine, IEEE Transactions on, 7(4):249–255, 2003.