IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 53, NO. 4, APRIL 2015

2175

Saliency-Guided Unsupervised Feature Learning for Scene Classification Fan Zhang, Bo Du, Member, IEEE, and Liangpei Zhang, Senior Member, IEEE

Abstract—Due to the rapid technological development of various different satellite sensors, a huge volume of high-resolution image data sets can now be acquired. How to efficiently represent and recognize the scenes from such high-resolution image data has become a critical task. In this paper, we propose an unsupervised feature learning framework for scene classification. By using the saliency detection algorithm, we extract a representative set of patches from the salient regions in the image data set. These unlabeled data patches are exploited by an unsupervised feature learning method to learn a set of feature extractors which are robust and efficient and do not need elaborately designed descriptors such as the scale-invariant-feature-transform-based algorithm. We show that the statistics generated from the learned feature extractors can characterize a complex scene very well and can produce excellent classification accuracy. In order to reduce overfitting in the feature learning step, we further employ a recently developed regularization method called “dropout,” which has proved to be very effective in image classification. In the experiments, the proposed method was applied to two challenging high-resolution data sets: the UC Merced data set containing 21 different aerial scene categories with a submeter resolution and the Sydney data set containing seven land-use categories with a 60-cm spatial resolution. The proposed method obtained results that were equal to or even better than the previous best results with the UC Merced data set, and it also obtained the highest accuracy with the Sydney data set, demonstrating that the proposed unsupervised-feature-learning-based scene classification method provides more accurate classification results than the other latent-Dirichlet-allocation-based methods and the sparse coding method. Index Terms—Autoencoder, saliency detection, scene classification, unsupervised feature learning.

I. I NTRODUCTION

S

ATELLITE imaging sensors can now acquire images with a spatial resolution of up to 0.41 m. These images, which are usually called very high resolution (VHR) images, have abundant spatial and structural patterns. However, due to the huge volume of the image data, it is difficult to directly access

Manuscript received March 1, 2014; revised June 17, 2014 and August 6, 2014; accepted September 3, 2014. This work was supported in part by the National Basic Research Program of China (973 Program) under Grants 2011CB707105 and 2012CB719905 and in part by the National Natural Science Foundation of China under Grants 41431175 and 61471274. F. Zhang and L. Zhang are with the State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, China (e-mail: [email protected]; zlp62@public. wh.hb.cn). B. Du is with the School of Computer Science, Wuhan University, Wuhan 430072, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TGRS.2014.2357078

the VHR data containing the scenes of interest. Due to the complex composition and the large number of land-cover types, efficient representation and recognition of the scenes from VHR data have become challenging problems, which have drawn great interest in the remote sensing field [1]. In order to recognize and analyze scenes from VHR images, various scene classification methods have been proposed over the years. One particular method, which is called the bag-ofvisual-words (BoVW) [2], has been successfully utilized for scene classification. The basic BoVW approach can be divided into two parts: feature learning and feature encoding [1]. Feature learning consists of clustering the local patches and using the resulting clusters as their representatives, which are commonly known as “visual words.” This set of visual words then forms a codebook which can be used for the image encoding. In the feature encoding step, the images are finally represented by the unordered collections of the visual words and the histograms of the visual-word occurrences [3]. Although the BoVW approach is highly efficient, it does not consider the spatial and structural information in the VHR image, which severely limits the ability to handle complex image scenes. The spatial pyramid matching kernel (SPMK) is a simple and computationally efficient extension of the BoVW-based image representation approach, which was introduced by Lazebnik et al. [4]. SPMK computes the local visual-word histograms at different scales and concatenates the spatial bins which are defined by the spatial pyramid representation, to produce better scene representations. Recently, Yang and Newsam [5] have computed the co-occurrence of the visual words and combined this with the BoVW approach, and they reported a higher classification accuracy than the traditional BoVW and SPMK for their extended spatial co-occurrence kernel (SPCK++) approach [4]. However, both SPMK and SPCK++ need to be used with the intersection kernel and the chi-squared kernel to achieve a good performance, which are nonlinear Mercer kernels and result in high computational complexities compared to linear kernels. Differing from SPMK and SPMK++, some authors have applied latent Dirichl et al. location (LDA) [6]–[8], a hierarchical probabilistic model, for unsupervised learning of the topic allocation, based on the visual-word distribution. Under these models, each image is represented as a finite mixture of an intermediate set of topics, which are expected to summarize the scene information [9]. More recently, there have been several LDA extensions which consider the LDA as an unsupervised topic feature learning method. One extension is to apply a classifier, such as support vector machine (SVM), to the topic representation of the original image data set [10], [11]. We refer to

0196-2892 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2176

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 53, NO. 4, APRIL 2015

these as discriminant extensions, and we refer to the combination of SVM with LDA as SVM-LDA. However, all of the aforementioned approaches rely on k-means clustering to map the local patches of the images to the visual words, and they are limited in their feature representation for classification [1]. In order to overcome the limitations of the traditional methods, the most recent work in image classification has been focused on the unsupervised learning of good feature representations from unlabeled input data. Hinton and Salakhutdinov [12] used deep neural networks to learn low-dimensional feature representations to reduce the dimensionality of the data. Sparse coding is another famous unsupervised feature learning method, which is highly effective for scene classification when compared to the traditional BoVW-based approaches [13], [14]. Sparse coding generates a set of basis functions from the unlabeled data, and the image scenes are encoded in terms of the basis functions to generate sparse feature representations. Recently, Cheriyadat [1] has proposed a method combining scale-invariant-feature-transform (SIFT)-based feature descriptors and sparse coding (Sift + SC), which has outperformed most of the other state-of-the-art methods; however, the method also needs low-level feature measurements such as oriented filter responses or SIFT feature descriptors, which are elaborately manually designed feature descriptors and ignore the characteristic of the original data set. How to achieve the unsupervised learning of good feature representations from input data, particularly from a large amount of unlabeled data for scene classification, is still a critical task for scene analysis in VHR images. In this paper, we propose a new unsupervised feature learning framework to automatically learn feature descriptors from the unlabeled data. In order to extract more efficient and robust features to classify the image scenes, we also combine the framework with a new technique called “dropout” and data augment to overcome the data overfitting. Compared to the traditional BoVW-based approaches using a random sampling strategy, we introduce a saliency-guided sampling strategy to sample a more representative set of patches from the image, which considers the salient regions in the image that contain the major scene information. The major contributions of this paper are as follows. 1) We propose a saliency-guided sampling strategy to extract a representative set of patches from a VHR image, so that the salient parts of the image, which contain the representative information in the VHR image, can be explored, which differs from the traditional random sampling strategy. 2) We introduce an unsupervised feature learning method called the sparse autoencoder to VHR image classification, and we perform extensive experiments to evaluate the efficiency and accuracy with diverse data sets, including an aerial image and a satellite image. 3) We explore the new “dropout” technique and “data augment” in the feature learning procedure to reduce data overfitting and generate good feature descriptors that do not need elaborately designed descriptors such as the SIFT-based algorithm.

The rest of this paper is organized as follows. In Section II, we briefly introduce the saliency-guided sampling strategy for a high-resolution image scene. In Section III, we describe the unsupervised feature learning approach in detail. Section IV provides the overall classification framework and the preprocessing method to train the feature extractor. The details of our experiments and the results are presented in Section V. Finally, Section VI concludes this paper with a discussion of the results and our ideas for future work. II. S ALIENCY-G UIDED S AMPLING S TRATEGY BoVW is a very popular approach for content-based image classification because of its simplicity and good performance [15]. BoVW evolved from texture analysis and considers the stationary property of natural images, which means that the statistics of one part of the image are the same as any other part. We can treat the image scene as a loose collection of independent patches which have different structures and textural information, and we sample a representative set of patches from the image scene. It is a critical task for any bag-offeatures method to sample a representative set of patches or to generate visual words from the image. Naturally, we should focus attention on the image regions that are most informative, to represent the scene information in the image. Many authors have used randomly or densely sampled patches which consider that the different regions of the image have the same information. Instead of using randomly or densely sampled patches, as in [16], we adopt a saliency detection algorithm to guide the sampling task. Saliency detection assumes that a region of interest is generally salient in an image. We use a new type of saliency detection—context-aware saliency—which directly shows the probability of salient regions in the image that are distinctive with respect to both their local and global surroundings. This method unifies local and global saliencies by measuring the similarity between each image patch and the other image patches both locally and globally, enabling us to extract the representative patches from the image. dcolor (xi , xj ) denotes the Euclidean distance between the patches xi and xj in the International Commission on Illumination (CIE) color space, normalized to the range [0, 1], and dposition (xi , xj ) is the Euclidean distance between the positions of patches xi and xj , normalized by the corresponding image dimension, width or height, to the range [0, 1]. Based on the aforementioned observations, we define the dissimilarity measure between a pair of patches as d(xi , xj ) =

dcolor (xi , xj ) 1 + c · dposition (xi , xj )

(1)

where c is a constant and c = 3 in this paper. A patch is salient when d(xi , xj ) is high. Hence, for every patch xi , we search for the K most similar patches in the image (if the most similar patches are highly different to xi , then clearly all image patches are highly different to xi ). Therefore, the saliency value of xi is defined as follows:   K 1  i i k S = 1 − exp − d(x , x ) . (2) K k=1

ZHANG et al.: SALIENCY-GUIDED UNSUPERVISED FEATURE LEARNING FOR SCENE CLASSIFICATION

Fig. 1.

2177

Saliency detection results for five satellite image scenes.

Furthermore, we also use multiscale saliency to measure the saliency of a patch in a multiscale image. In this paper, we use four scales: 100%, 80%, 50%, and 30% of the original image. More details can be found in [17]. We compute the salient regions in the VHR image data set. Fig. 1 shows the saliency detection results for six VHR images. From Fig. 1, we can see that, in the saliency map, different regions such as buildings, ships, and airplanes all correspond to the high-value regions. In addition, for the airport and the harbor images, the salient regions naturally correspond to the parts which have the most representative information, such as the airplane or large white ship. This illustrates the benefits of the use of saliency detection: It helps to identify the regions and parts with the most representative information and guides the sampling task to sample a more representative set of patches. Although the salient regions usually correspond to the scenes, not all image categories can satisfy this assumption. For example, for the agricultural image in Fig. 1, the salient regions are mainly found in the right-bottom area, and we can see that the agricultural scene has a highly repeated texture, so it is difficult to define the salient regions. Furthermore, in the residential image, although the salient regions mainly cover regions such as the different rectangular buildings, which are highly related to “residential,” they cannot capture the complete scene information as a whole because the “residential” image is a mixture of different patterns, including roads, buildings, and trees. To avoid missing the nonsalient regions corresponding to the scenes, we randomly sample some image patches from the nonsalient regions. As the nonsalient regions are often relatively uniform, with less variation in the image scene, as in the airplane image, in which the nonsalient regions mainly correspond to the runway that is relatively uniform and shows little variation, so a smaller number of patches are sampled from these regions. As shown in Fig. 2, each bag of patches constructed in this way consists of patches from both salient and nonsalient regions in each image. Thus, for P images, we can extract m patches, where m is the number of patches sampled from the whole image data set. We first randomly select one image from P images, and we then extract one patch at a time from the image until we obtain enough patches. Each patch has a dimension of w × w and has three bands (R, G, and B), with w referred to as the “receptive field size.” Each w × w patch can be represented as a vector in N of the pixel intensity values, with N = w × w × 3. A data set X ∈ N ×m can thus be constructed, where each column denotes a patch xi ∈ N .

Fig. 2. Sampling salient patches from the image scene.

III. U NSUPERVISED F EATURE L EARNING An unsupervised feature learning algorithm can be used to discover the features in unlabeled data. In detail, the features can be learned from the representative set of patches sampled from the image, and we can use the features across the whole image. In this paper, we view an unsupervised feature learning algorithm as a “feature extractor” that takes the bags of patches X and outputs a function f : N → K that maps an input vector xi to a new feature vector by the K features, where K is a parameter of the algorithm. A. Sparse Autoencoder An autoencoder is a symmetrical neural network that is used to learn the features of a data set in an unsupervised manner [18]. This is done by minimizing the reconstruction error between the input data at the encoding layer and its reconstruction at the decoding layer. During the encoding step, an input vector xi ∈ N is processed by applying a linear mapping and a nonlinear activation function to the network αi = f (x) = g(W1 xi + b1 )

(3)

where W1 ∈ K×N is a weight matrix with K features, b1 ∈ K is the encoding bias, and g(x) is the logistic sigmoid function (1 + exp(−x))−1 . We decode a vector using a separate linear decoding matrix z i = W2T · αi + b2

(4)

where W2 ∈ K×N is a weight matrix and b2 ∈ N is the decoding bias. Feature extractors in the data set are learned by minimizing the cost function; the first term in the reconstruction is the error term. The second term is a regularization term (also called a weight decay term in a neural network) J(X, Z) =

m  1  xi − z i 2 + λ W 2 2 i=1 2

(5)

2178

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 53, NO. 4, APRIL 2015

where X and Z are the training and reconstructed data, respectively. We recall that α denotes the activation of hidden units in the autoencoder. Thus, when the network is with provided i a specific input xi ∈ X N ×m , let ρˆ = (1/m) m i=1 [α ] be the average activation of α averaged over the training set. We want to approximately enforce the constraint ρˆ = ρ, where ρ is the sparsity parameter, which is typically a small value close to zero. In other words, we want the average activation of each hidden neuron ρˆ to be close to zero. To satisfy this constraint, the hidden unit activations must be mostly inactive and close to zero, so that most of the neurons are inactive. To achieve this, the objective in the sparse autoencoder learning is to minimize the reconstruction error with a sparsity constraint, i.e., a sparse autoencoder [19]–[23] K  KL(ρ  ρˆ) (6) J(X, Z) + β j=1

where β is the weight of the sparsity penalty, K is the number of features in the weight matrix, and KL(·) is the Kullback– Leibler divergence. The Kullback–Leibler divergence [24] is given by KL(ρ  ρˆ) = ρ log

1−ρ ρ + (1 − ρ) log . ρˆ 1 − ρˆ

(7)

This penalty function has the property that KL(ρ  ρˆ) = 0 if ρˆ = ρ, and otherwise, it increases monotonically as ρˆ diverges from ρ, which acts as the sparsity constraint. The algorithm is trained by optimizing the cost function (6) with respect to W1 , W2 , b1 , b2 , where we use backpropagation [25] and Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) [26] to train the model. For a cost function optimization problem, e.g., (6), the learning rule for W1 , W2 , b1 , and b2 is    ρ 1−ρ T T i i ΔW1 = W2 (z −α )+β − + · f  (z i ) · (xi ) ρˆ 1 − ρˆ (8) T

ΔW2 = − (z i − xi ) · (ai ) (9)    ρ 1−ρ Δb1 = W2T (z i − αi ) + β − + · f  (z i ) (10) ρˆ 1 − ρˆ Δb2 = − (z i − xi ).

(11)

Fig. 3. Illustration showing feature extraction using a w-by-w feature extractor and a stride of s. We first extract w-by-w patches, each separated by s pixels and then map them to K-dimensional feature vectors to form a new image representation. These vectors are then pooled over 16 quadrants of the image to form a feature vector for classification.

1) Image Convolution: Given a w-by-w image patch, we can now extract a representative αi ∈ K for that patch by using the learned feature extractor f : N → K . We then define a new representation of the entire image by using the feature extractor function f : N → K with each image. Specifically, given an image of n-by-n pixels (with three channels: R, G, and B), we can define a (n − w + 1)-by-(n − w + 1) representation (with K channels) by computing the representative αi ∈ K for each w-by-w “subpatch” of the input image. More formally, we denote α(ij) as the K-dimensional feature extracted from location i, j of the input image. For computational efficiency, we can also “convolute” our n-by-n image with a step size (or “stride”) greater than one across the image. This is illustrated in Fig. 3. 2) Feature Pooling: After feature extraction, the new feature representation for an image scene will usually have a very high dimensionality. For computational efficiency and storage volume, it is standard practice to use a max-pooling strategy to reduce the dimensionality of the image representation. For a stride of s = 1, our feature mapping produces an (n − w + 1)by-(n − w + 1)-by-K representation. We can reduce this by finding the maximum over local regions of α(ij) , as above. This procedure is commonly used (in many variations) in computer vision [4], as well as deep feature learning [28]. In the proposed method, we use a simple form of max pooling. Specifically, we split the feature map into 16 equal-sized quadrants and compute the maximum of α(ij) in all quadrants. This yields a reduced (K-dimensional) representation of each quadrant, from a total of 16K features, which we can use for classification. IV. S CENE C LASSIFICATION VIA SVM

It is generally accepted that the classification performance is improved by increasing the number of learned features [18], and the effect of the number of features on the classification performance when using single-layer networks has been studied in more detail in [27].

This section describes the proposed method for scene classification. The method that we use consists of four main steps, as illustrated in Fig. 4: 1) patch sampling; 2) unsupervised feature learning; 3) feature extraction; and 4) classification.

B. Feature Extraction

A. Overall Architecture

According to the sparse autoencoder algorithm, transforming an input patch xi ∈ N into a new feature representation using αi = f (xi ) = g(W1 xi + b1 ), αi ∈ K , xi ∈ N will yield the feature extractor function f : N → K . This section explores the use of this feature extractor for our (labeled) training and test images, to extract new representative features for classification.

We now describe the overall architecture of the proposed method. As depicted in Fig. 4, the method consists of four parts. 1) First, we extract the patches using saliency detection, which takes into account the information in the dominant regions in each image. Each patch has a dimension of

ZHANG et al.: SALIENCY-GUIDED UNSUPERVISED FEATURE LEARNING FOR SCENE CLASSIFICATION

Fig. 4.

Overall architecture of the proposed method.

w × w and has three bands (R, G, and B), with w referred to as the “receptive field size.” Each w × w patch can be represented as a vector in N of the pixel intensity values, with N = w × w × 3. A data set of m sampled patches can thus be constructed. The data set is then fed into a K-hidden-unit network, which is used for the unsupervised learning of K feature extractors, according to the sparse autoencoder. Furthermore, using a recently introduced technique called “dropout,” the network can learn more robust and efficient feature extractors. Unlike other scene classification frameworks that use a SIFT-based algorithm to extract features from the image and adopt unsupervised feature learning in the extracted features [1], our method automatically learns feature extractors from the image without sophisticated manually designed feature extractors such as the SIFT-based algorithm. 2) After the unsupervised feature learning, we can extract features from the training and test images using the learned feature extractor. By using the max-pooling method, we can dramatically reduce the feature number and decrease the computational cost. 3) Finally, the proposed method is combined with a linear SVM (linear kernel) to predict the scene label. In the case of multiclass predictions, we adopt a one-againstone strategy where multiple binary classifiers are trained on the data from two classes. We use the LIBSVM implementation for the SVM classification [29]. In addition, the regularization parameters of the linear SVM classifier are determined by fivefold cross-validation with the arrangement of [2−2 , 2−1 , . . . , 211 , 212 ]. B. Reducing Overfitting The proposed method has thousands of feature extractors to train, which results in the new feature representation for each

2179

image having a very high dimensionality. Consequently, it is difficult to learn so many feature extractors without considerable data overfitting, and it is not easy to train the SVM classifier with limited training samples. We describe in the following the two primary ways to combat the overfitting. 1) Data Augmentation: Using label-preserving transformations to artificially enlarge the data set is the easiest and most common method of reducing the overfitting of image data [30], [31]. We employ two distinct forms of data augmentation, both of which produce transformed images from the original images, with very low computational cost and complexity. We therefore do not need to store the transformed images on disk, so these data augmentation schemes are, in effect, computationally free. The data augmentation consists of generating image translations and rotations. Supposing that the training image is n by n, we do this by extracting five 0.8 × n-by- 0.8 × n subimages (the four corner patches and the center patch) from the original image and rotate the subimages by 90 ◦ , 180 ◦ , and 270 ◦ . This increases the size of the training set by a factor of 20, although the resulting training examples are highly correlated. Without this scheme, our SVM classifier suffers from substantial overfitting, which would have forced us to use a much smaller number of feature extractors. At the testing time, the network makes a prediction by extracting the same 20 subimages (the four corner patches and the center patch) as well as their rotations (hence, 20 subimages in all) and averages the predictions made by the SVM on the 20 subimages. 2) Dropout: Combining the predictions of many different models or classifiers is a very successful way to increase the test accuracy [32]–[34]; however, it is too expensive for the proposed framework as it would take several hours to train different networks which contain different feature extractors and SVM classifiers. The recently developed “dropout” technique, which consists of setting to zero the output of each hidden neuron with a probability of 50% [35], is a very efficient model combination method that only costs about a factor of two during training. The neurons which are “dropped out” in this way do not participate in the forward pass and do not contribute in the backpropagation. The neural network samples a different architecture when a new input is presented, but all these architectures share the same weights. This technique reduces the complex coadaptations of neurons, since a neuron cannot rely on the presence of other particular neurons [36]; it is, therefore, forced to learn more robust feature extractors that are useful in combination with many different random subsets of the other feature extractors. In the testing procedure, we use all feature extractors to extract new feature representations from the training and test images.

V. E XPERIMENTAL S ETUP AND R ESULTS In this section, we first describe the data sets used for the experiments and the parameter settings of the proposed method. The results obtained for the scene recognition of a benchmark high-resolution satellite image and an aerial image are then discussed.

2180

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 53, NO. 4, APRIL 2015

TABLE I T RAINING AND T EST S AMPLES FOR THE S YDNEY DATA S ET

Fig. 5. Example images associated with the 21 land-use categories in the UC Merced data set. (1) Agricultural. (2) Airplane. (3) Baseballdiamond. (4) Beach. (5) Buildings. (6) Chaparral. (7) Denseresidential. (8) Forest. (9) Freeway. (10) Golfcourse. (11) Harbor. (12) Intersection. (13) Mediumresidential. (14) Mobilehomepark. (15) Overpass. (16) Parkinglot. (17) River. (18) Runway. (19) Sparseresidential. (20) Storagetanks. (21) Tenniscourt.

Fig. 6. (a) Whole image for scene classification. (b) Example images associated with the seven land-use categories from the image: (1) Industrial, (2) ocean, (3) meadow, (4) river, (5) airport, (6) residential, and (7) runway.

Fig. 7. Effect of the sparsity parameter value and the feature extractor size on the classification accuracy with the UC Merced data set. (a) Feature extractor size varied over a wide range of different sizes. (b) Sparsity parameter value varied over a wide range to generate feature extractors with different degrees of sparseness.

A. Description of the Data Sets Two different image data sets were used in the experiments. The first data set chosen for investigation was the UC Merced data set [37]. Fig. 5 shows a few example images representing various aerial scenes that are included in this data set. The images have a resolution of 1 ft/pixel and are 256 × 256 pixels. The data set contains 21 challenging scene categories, with 100 samples per class. The data set represents highly overlapping classes such as denseresidential, mediumresidential, and sparseresidential, which mainly differ in the density of structures. Following the experimental setup in [1], we randomly selected 80% of the samples from each class for training and set the remaining images for testing. The other image data set was constructed from a large satellite image which was acquired from Google Earth, of Sydney, Australia [38]. The spatial resolution of this image was about 0.5 m. The large image to be annotated was of 18 000 × 14 000 pixels, as shown in Fig. 6(a). There were seven classes of training images: residential, airport, meadow, rivers, ocean, industrial, and runway. Fig. 6(b) shows some examples of such images. We converted the original image into 500 × 500 pixel subimages, whereby each subimage was supposed to contain a certain scene, and there were a total of 1008 nonoverlapping subimages. The data set consists of not only the seven defined classes but also some other classes that have not been learned,

Fig. 8. Feature extractor W ∈ 300×1000 obtained from the UC Merced data set. Each of the W j column vectors of length 300 is reshaped to form a 10 × 10 × 3 feature extractor.

such as the bridges and the main roads. We manually labeled 60% of the image to get a 613 subimage data set. The training set for each class contained 25% of the labeled images for each class, and the remaining images were used for testing, as shown in Table I.

ZHANG et al.: SALIENCY-GUIDED UNSUPERVISED FEATURE LEARNING FOR SCENE CLASSIFICATION

2181

TABLE II C OMPARISON W ITH THE P REVIOUS R EPORTED ACCURACIES W ITH THE UC M ERCED DATA S ET

Fig. 9. Confusion matrix showing the classification performance with the UC Merced data set: (a) Nonsaliency detection and (b) with-saliency detection. The rows and columns of the matrix denote the actual and predicted classes, respectively. The class labels are assigned as follows: 1 = Agricultural, 2 = airplane, 3 = baseballdiamond, 4 = beach, 5 = buildings, 6 = chaparral, 7 = denseresidential, 8 = forest, 9 = freeway, 10 = golfcourse, 11 = harbor, 12 = intersection, 13 = mediumresidential, 14 = mobilehomepark, 15 = overpass, 16 = parkinglot, 17 = river, 18 = runway, 19 = sparseresidential, 20 = storagetanks, and 21 = tenniscourt. The with-saliency detection feature extractor produces the best classification performance. The vertical color bar indicates the proportions of samples over the actual class total.

B. Experimental Setup A representative set of patches was sampled from the data set. We set the regions with a saliency probability larger than 0.7 as the salient regions and those regions with a saliency probability smaller than 0.3 as the nonsalient regions. The extracted patches were of 10 × 10 × 3 pixels, so that the feature extractor contained 10 × 10 × 3 pixels. For the convolution step, we set stride s = 1, which was proven to be the most accurate by Coates et al. [27]. We kept the same parameter settings for all experiments. For learning the SVM classification model, we randomly selected a subset of images from the data set to form the training set. We then tested the learned SVM classification model on the remaining images to measure the performance. This process was repeated five times, and we recorded the average classification accuracy and the standard deviation. With the UC Merced data set, we extracted 200 000 patches from the data set, of which 75% of the patches were taken from salient regions, and we randomly selected 80% of the samples from each class to initialize the training set. For the Sydney data set, we extracted 100 000 patches from the data set, of which 75% of the patches were taken from salient regions, and we randomly selected 25% of the samples from each class to initialize the training set. For comparison, the nonsaliency method extracted the same number of patches from the data set randomly. C. UC Merced Data Set To measure the scene classification performance with the UC Merced data set, we first compared the classification accuracies with different sparsity parameter values and feature extractor

Fig. 10. Producers’ accuracies with the UC Merced data set for the proposed method.

sizes. In order to study the sensitivity of the sparsity parameter and feature extractor size, we varied their values over a wide range. Fig. 7 shows the classification performance with different sparsity parameter values and feature extractor sizes. The results showed that there was a wide range of sparseness values for which the classification performance was consistent, and the best classification performance was obtained at a sparsity value close to 0.4. Based on this analysis, for all experiments, we set the sparsity value as equal to 0.4 to generate the feature extractor. To evaluate the classification performance under different feature sizes, we measured the overall classification accuracy with the UC Merced data set for values of feature sizes ranging from 400 to 1200. The experimental analysis showed that a feature size of 1000 produced an excellent accuracy with this data set. For visualization of the learned feature extractors, we show the feature extractors generated from the sampled patches in Fig. 8. Here, it can be seen that the proposed method tends to learn edge and textural feature extractors. Then, to compare the scene classification performance of the proposed approach with SPMK [4], the spatial extension of BoVW (SPCK ++) reported in [5], the Sift + SC approach described in [1], and the traditional SVM-LDA method, we measured the classification performance with the challenging UC Merced data set. Of the four strategies that we tested, our saliency-detection-based method produced the best performance, as shown in Table II. We also compared the classification performances with and without the saliency detection. The results illustrated that using saliency detection is an efficient way to increase the scene classification accuracy. When saliency detection was performed, we found that the proposed method yielded the best accuracy and outperformed the other scene classification approaches. The confusion matrices and overall accuracies are reported in Figs. 9 and 10, respectively. The confusion matrix generated for the saliency detection method in Fig. 9(b) shows that the classification errors were mainly from scenes that share similar structures, such as buildings, sparseresidential, and storagetanks. The proposed method

2182

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 53, NO. 4, APRIL 2015

TABLE III OVERALL ACCURACY ON S YDNEY DATA S ET FOR D IFFERENT M ETHOD

Fig. 11. Effect of the sparsity parameter value and feature extractor size on the classification accuracy with the Sydney data set. (a) Feature extractor size varied over a wide range of different sizes. (b) Sparsity parameter value varied over a wide range to generate feature extractors with different degrees of sparseness.

Fig. 13. Classification map of a large satellite image into seven scene classes, namely, residential, airport, meadow, industrial, ocean, river, and runway, for the three methods: (a) With-saliency detection, (b) nonsaliency detection, and (c) SVM-LDA.

Fig. 12. Feature extractor W ∈ 300×200 obtained from the Sydney data set. Each of the W j column vectors of length 300 is reshaped to form a 10 × 10 × 3 feature extractor.

also showed the highest accuracy for the classification of the agricultural, chaparral, mobilehomepark, and runway scenes, which have a regular textural and spatial structure. D. Sydney Data Set We first compared the classification accuracies for varied sparsity parameter values and feature extractor sizes in the Sydney data set, in the same way as before. Fig. 11 shows the classification performance at different sparsity parameter values and feature extractor sizes. The results showed that a sparsity value close to 0.2 generated the best accuracy. Based on this analysis, for all experiments, we set the sparsity value equal to 0.2 to generate the feature extractors. To evaluate the classification performance with different feature sizes, we measured the overall classification accuracy with the Sydney data set for values of d ranging from 200 to 800. The experimental analysis showed that a feature size of 200 produced an excellent accuracy with this data set. We also visualize the learned feature extractors in the same way as before in Fig. 12, in which we got the same edge and textural feature extractors.

Compared to the UC Merced data set, large values of the sparsity parameter and feature extractor sizes did not result in a high accuracy. This is mainly because the Sydney data set has only seven classes, so it does not need a large feature size to model the local spatial and structural information in the scene. The value of the sparsity parameter and feature extractor size was corrected due to the complexity of the scene in the image. We then compared the final classification accuracies for the saliency detection strategy and the traditional SVM-LDA method. Table III shows the average overall accuracies for the three methods. The results confirmed that using saliency detection is an efficient way to increase the classification accuracy. When saliency detection was performed, we found that the proposed method yielded the best accuracy and outperformed the SVM-LDA scene classification approach. We also obtained the classification map for the large satellite image, as shown in Fig. 13. The confusion matrices and overall accuracies are reported in Figs. 14 and 15, respectively. As expected, the majority of the confusion occurred between the airport and industrial classes as both scenes are dominated by similar kinds of structures, such as large buildings, parking lots, and roads. This point is reflected in the accuracy results in Figs. 14 and 15, respectively. VI. C ONCLUSION Compared to the traditional BoVW-based methods, considering the salient regions is a useful and efficient way to extract the representative patches from a VHR image data set. Unlike previous studies that have focused on elaborately

ZHANG et al.: SALIENCY-GUIDED UNSUPERVISED FEATURE LEARNING FOR SCENE CLASSIFICATION

Fig. 14. Confusion matrix showing the classification performances with the Sydney data set for the three different methods: (a) SVM-LDA, (b) nonsaliency detection, and (c) with-saliency detection. The rows and columns of the matrix denote the actual and predicted classes, respectively. The class labels are assigned as follows: 1 = Residential, 2 = airport, 3 = meadow, 4 = river, 5 = ocean, 6 = industrial, and 7 = runway. Note that using saliency detection produces the best classification. The vertical color bar indicates the proportions of samples over the actual class total.

Fig. 15.

Producers’ accuracies with the Sydney data set for the three methods.

designed low-level feature descriptors, we have proposed a method that directly models the extracted patches to learn the feature extractors from the image scene by exploiting the local spatial and structural information. The feature extractors are generated in an unsupervised manner, and they are directly learned from the image data set. The experiments showed the following: 1) Saliency detection is a useful and efficient way to extract the representative patches from a VHR image; 2) unsupervised feature learning algorithms such as the sparse autoencoder can extract high-quality features from the VHR image, and the accuracy is equal to or even greater than the SIFT-based descriptors, which need to be manually designed; and 3) by using the recently introduced “dropout” technique, we can learn more robust and efficient feature extractors, and the extracted features employed with a linear SVM classifier outperform the existing methods, in terms of the classification accuracy. In our future research, we plan to extend this method to learn hierarchical features of the image content, from low- to highlevel feature representation, since to model a more complex image scene, it will be necessary to generate high-level features from the images. R EFERENCES [1] A. M. Cheriyadat, “Unsupervised feature learning for aerial scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 1, pp. 439–451, Jan. 2014. [2] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in videos,” in Proc. 9th IEEE Int. Conf. Comput. Vis., Oct. 13–16, 2003, vol. 2, pp. 1470–1477.

2183

[3] G. Csurka, C. Dance, L. Fan, and C. Bray, “Visual categorization with bags of keypoints,” in Proc. ECCV Workshop Stat. Learn. Comput. Vis., 2004, vol. 1, pp. 1–22. [4] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog., 2006, vol. 2, pp. 2169–2178. [5] Y. Yang and S. Newsam, “Spatial pyramid co-occurrence for image classification,” in Proc. IEEE ICCV, Nov. 6–13, 2011, pp. 1465–1472. [6] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003. [7] M. Lienou, H. Maitre, and M. Datcu, “Semantic annotation of satellite images using latent Dirichlet allocation,” IEEE Geosci. Remote Sens. Lett., vol. 7, no. 1, pp. 28–32, Jan. 2010. [8] C. Vaduva, I. Gavat, and M. Datcu, “Latent Dirichlet allocation for spatial analysis of satellite images,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 5, pp. 2770–2786, May 2013. [9] N. Rasiwasia and N. Vasconcelos, “Latent Dirichlet allocation models for image classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2665–2679, Nov. 2013. [10] A. Bosch, A. Zisserman, and X. Muoz, “Scene classification using a hybrid generative/discriminative approach,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 4, pp. 712–727, Apr. 2008. [11] P. Quelhas, F. Monay, J. Odobez, D. Gatica-Perez, and T. Tuytelaars, “A thousand words in a scene,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 9, pp. 1575–1589, Sep. 2007. [12] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006. [13] J. Wright et al., “Sparse representation for computer vision and pattern recognition,” Proc. IEEE, vol. 98, no. 6, pp. 1031–1044, Jun. 2010. [14] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning mid-level features for recognition,” in Proc. IEEE Conf. CVPR, Jun. 13–18, 2010, pp. 2559–2566. [15] E. Nowak, F. Jurie, and B. Triggs, “Sampling strategies for bag-of-features image classification,” in Proc. ECCV, 2006, pp. 490–503. [16] S. Singh, A. Gupta, and A. A. Efros, “Unsupervised discovery of midlevel discriminative patches,” in Proc. ECCV, 2012, pp. 73–86. [17] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 10, pp. 1915–1926, Oct. 2012. [18] H.-C. Shin, M. R. Orton, D. J. Collins, S. J. Doran, and M. O. Leach, “Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4D patient data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1930–1943, Aug. 2013. [19] R. Marc’Aurelio, L. Boureau, and Y. LeCun, “Sparse feature learning for deep belief networks,” in Proc. Adv. Neural Inf. Process. Syst., 2007, vol. 20, pp. 1185–1192. [20] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, and U. Montreal, “Greedy layer-wise training of deep networks,” in Proc. Adv. Neural Inf. Process. Syst., 2007, vol. 19, pp. 153–160. [21] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proc. 25th ICML, 2008, pp. 1096–1103. [22] H. Lee, C. Ekanadham, and A. Ng, “Sparse deep belief net model for visual area V2,” in Proc. Adv. Neural Inf. Process. Syst., 2008, vol. 20, pp. 873–880. [23] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring strategies for training deep neural networks,” J. Mach. Learn. Res., vol. 10, pp. 1–40, 2009. [24] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann. Math. Stat., vol. 22, no. 1, pp. 79–86, Mar. 1951. [25] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533– 536, Oct. 1986. [26] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large scale optimization,” Math. Program., vol. 45, no. 3, pp. 503–528, Dec. 1989. [27] A. Coates, A. Y. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in Proc. Int. Conf. Artif. Intell. Stat., 2011, pp. 215–223. [28] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep. 29–Oct. 2 2009, pp. 2146–2153. [29] C. Chang and C. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 27:1–27:27, Apr. 2011.

2184

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 53, NO. 4, APRIL 2015

[30] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” in Proc. 7th Int. Conf. Doc. Anal. Recog., 2003, vol. 2, pp. 958–962. [31] D. C. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” Arxiv preprint, vol. arXiv:120.2745, 2012. [32] X. Huang and L. Zhang, “An SVM ensemble approach combining spectral, structural, semantic features for the classification of high-resolution remotely sensed imagery,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 1, pp. 257–272, Jan. 2013. [33] L. Zhang, L. Zhang, D. Tao, and X. Huang, “On combining multiple features for hyperspectral remote sensing image classification,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 3, pp. 879–893, Mar. 2012. [34] D. Tao, L. Jin, Z. Yang, and X. Li, “Rank preserving sparse learning for Kinect based scene classification,” IEEE Trans. Cybern., vol. 43, no. 5, pp. 1406–1417, Oct. 2013. [35] G. E. Hinton, N. Srivastava, A. Krizhevky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing coadaptation of feature detectors,” Arxiv prepint, vol. arXiv:1207.0580, 2012. [36] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, vol. 25, pp. 1106–1114. [37] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proc. ACM Int. Conf. Adv. Geograph. Inf. Syst., 2010, pp. 270–279. [38] B. Du and L. P. Zhang, “Target detection based on a dynamic subspace,” Pattern Recognit., vol. 47, no. 1, pp. 344–358, Jan. 2014.

Fan Zhang received the B.S. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 2012, where he is currently working toward the Ph.D. degree in the State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing. His research interests include high-resolution image and hyperspectral image classification, machine learning, and computation vision in remote sensing applications.

Bo Du (M’10) received the B.S. degree from Wuhan University, Wuhan, China, in 2005 and the Ph.D. degree in photogrammetry and remote sensing from the State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, in 2010. He is currently an Associate Professor with the School of Computer Science, Wuhan University. His major research interests include pattern recognition, hyperspectral image processing, and signal processing.

Liangpei Zhang (M’06–SM’08) received the B.S. degree in physics from Hunan Normal University, Changsha, China, in 1982, the M.S. degree in optics from the Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, China, in 1988, and the Ph.D. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 1998. He is currently the Head of the Remote Sensing Division, State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University. He is also a “Chang-Jiang Scholar” Chair Professor appointed by the Ministry of Education of China. He is currently a Principal Scientist for the China State Key Basic Research Project (2011–2016) appointed by the Ministry of National Science and Technology of China to lead the remote sensing program in China. He edits several conference proceedings, issues, and geoinformatics symposiums. He also serves as Associate Editor of the International Journal of Ambient Computing and Intelligence, International Journal of Image and Graphics, International Journal of Digital Multimedia Broadcasting, Journal of Geo-spatial Information Science, and Journal of Remote Sensing. He has more than 300 research papers. He is the holder of five patents. His research interests include hyperspectral remote sensing, highresolution remote sensing, image processing, and artificial intelligence. Dr. Zhang is a Fellow of the Institution of Electrical Engineers, an executive member (Board of Governor) of the China National Committee of International Geosphere–Biosphere Programme, an executive member of the China Society of Image and Graphics, etc. He regularly serves as Cochair of the series Society of Photo-Optical Instrumentation Engineers Conferences on Multispectral Image Processing and Pattern Recognition, Conference on Asia Remote Sensing, and many other conferences. He also serves as Associate Editor of the IEEE T RANSACTIONS ON G EOSCIENCE AND R EMOTE S ENSING.

Saliency-Guided Unsupervised Feature Learning For Scene ieee.pdf

There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Saliency-Guided ...

2MB Sizes 10 Downloads 360 Views

Recommend Documents

No documents