Feature Word Selection by Iterative Top-K Aggregation ...

Viewer
Transcript

Feature Word Selection by Iterative Top-K Aggregation for Classifying Recommended Shops Heeryon Cho

Sang Min Yoon

School of Computer Science Kookmin University Seoul, South Korea {heeryon, smyoon}@kookmin.ac.kr Abstract— We propose a feature word selection method for

classifying recommended shops using Yelp customer reviews. TextRank keywords are extracted from the customer reviews to construct the sorted positive and negative keyword lists based on each keyword’s summed TextRank scores. The top-K keywords are then aggregated iteratively by multiples of K to construct the positive and negative keyword frequency lists. The negative keyword frequency list is then subtracted from the positive keyword frequency list, and the resulting list is standardized to generate the final positive and negative keyword lists. The performance of our feature selection method is evaluated using Naïve Bayes classifiers, and the binary classification accuracy of the selected feature words is 77.94%, which is better than the baseline χ² feature word selection. Keywords— Feature word selection, iterative keyword aggregation, Naïve Bayes, Yelp customer reviews, TextRank algorithm.

I.

INTRODUCTION

The explosion of social media data has opened new avenues for data-driven analyses of online reviews in recent years. Consequently, a large body of research was conducted to tackle the problem of mining opinions and detecting sentiments from online reviews [1,2]. In this paper, we tackle the problem of sentiment classification, where the goal is to automatically classify a given review’s sentiment as positive or negative. However, instead of classifying the sentiment of an unseen review, we classify the review target, bars that serve alcoholic beverages, into ‘recommended’ or ‘not recommended’ bars based on Yelp customer reviews. We propose a feature word selection method that separately aggregates the representative positive/negative keywords from the positive/negative customer reviews. We focus on feature word selection because feature selection can potentially eliminate noisy features that adversely affect classification, and it may provide greater insight into more important features [3]. TextRank algorithm [4] is employed to extract keywords from the Yelp bar reviews, and the top-K keywords are iteratively selected to construct feature word lists that output better classification performance. Existing research on feature word selection for opinion classification has proposed an entropy weighted genetic algorithm that incorporates the information-gain heuristic for feature selection [5] and a log likelihood ratio for feature

reduction [6]. In this paper, we propose a method of selecting feature words from the ranked positive and negative keyword lists. We reduce the feature word vector space from 115,659 to 258 and improve the binary bar classification accuracy from 74.48% to 77.94%. Our method also performs better than the baseline χ² feature selection method of 74.94%. II.

DATA & METHOD

A. Yelp Dataset The bar review dataset was constructed from the Yelp dataset challenge data1. A total of 2,596 bars containing ‘Bars’ as its business category was selected. A total of 213,278 English reviews for the 2,596 bars were collected. Each bar contained a real-valued star rating ranging between 1 and 5. We defined the bars with 4 or higher star rating as the ‘recommended bars’ and those with below 4-star rating as the ‘not recommended bars’. Each bar had an average of 82 customer reviews. Each review had an integer star rating ranging between 1 and 5. The size of the bar and review data is summarized in Table 1. We divided the bar and review data into three datasets: feature selection, train, and test data. Note that we did not include 3-star rated reviews in the feature selection data. TABLE I.

SIZE OF BAR & REVIEW DATA

Number of Bars/ Number of Reviews Recommended bars

Feature Selection

323

Train Data 392

Test Data 358

Not recommended bars

542

473

508

1,523

Total Bars

865

865

866

2,596

Total 1,073

5-star & 4-star reviews 57,696 44,141 44,875 146,712 1-star & 2-star & 3-star 15,910 a 24,204 26,452 66,566 reviews Total Reviews 73,606 68,345 71,327 213,278 a. In the case of feature selection data, the 3-star rated reviews were excluded.

B. Feature Word Selection The two keyword lists, positive and negative, were constructed from the positive/negative reviews using the feature selection data. The positive reviews all had 4 or 5-star rating, and the negative reviews all had 1 or 2-star rating. For each review, we removed the English stopwords, and the keywords were extracted using the TextRank algorithm.

This research was supported by the CRC program through the National 1 https://www.yelp.com/dataset_challenge Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2015R1A5A7037615). Sang Min Yoon was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2014R1A1A1002890). Heeryon Cho and Sang Min Yoon, “Feature Word Selection by Iterative Top-K Aggregation for Classifying Recommended Shops,” 2016 International Conference on Information & Communication Technology Convergence (ICTC), Jeju, South Korea, 2016, pp. 27-29. https://doi.org/10.1109/ICTC.2016.7763426

TextRank algorithm calculates the importance of the words in a document using a graph-based ranking algorithm [4]. We used a Python implementation of the TextRank algorithm2 to extract keywords from each of the positive/negative reviews and placed the extracted keywords in the positive/negative keyword lists. Once the positive/negative keyword lists were constructed from the positive/negative reviews, the TextRank scores of the unique keywords in each list were summed, and the keywords were sorted in the descending order of the summed TextRank scores per list. The top-K ranking positive/negative keywords were then selected iteratively from each of the keyword lists by increasing the size of K in multiples of K (i.e., the K is multiplied by 1, 2, etc., until it reached the maximum size of n*K). For example, if we defined the size of K as 50 and the maximum K as 500, we iteratively aggregated the top-50, top-100, top-150, …, top-500 keywords from each of the positive/negative keyword lists. We defined the various values for K and the maximum K in our feature word evaluation experiment. Once the positive/negative keyword lists were created via iterative aggregation, the two lists were merged by subtracting the negative keyword list from the positive keyword list based on the keyword frequency. The resulting single keyword list with the keyword-frequency pairs was then standardized based on the keyword frequencies. Finally, the keywords having the positive/negative standardized values were selected as the feature words for ‘recommended/not recommended’ bar classification. Note that through this standardization process, the domain-specific stopword-like keywords with zerostandardized value were automatically removed (e.g., restaurant, food, etc.). The overall process of the proposed feature word selection method is summarized in Fig. 1. III.

EXPERIMENT & RESULT

A. Experiment We constructed the various feature word lists by setting the K size as 50, 100, 250, 500 and the maximum K as 500, 1,000, 2,500, 5,000, 7,500, 10,000 (Tables 2 & 3 and Fig. 2), and additionally constructed the shorter feature word lists by setting the K size as 10, 25, 50, 100 and the maximum K as 100, 200, 300, 400, 500 (Table 4). We also constructed the feature word lists using the χ² feature selection method3 with the different top-N sizes for the baseline evaluation (Table 3 right most column). The size of top-N words for the χ² method was set equivalent to the feature word size of Table 2, K=500. The train and test data for the recommended/not recommended bar classification experiments were constructed as follows: The bar reviews of each bar were merged to form one long review per bar. Since each bar had ‘recommended’ or ‘not recommended’ labels, the bars’ labels were inherited to the corresponding merged long reviews. The train and test data (i.e., long reviews with ‘recommended/not recommended’ labels per bar) were then turned into matrices of the feature word counts by selecting only the feature words that were constructed using our proposed feature selection method or the baseline χ² feature selection method. Once the various feature word matrices were constructed, the multinomial Naïve Bayes classifiers4 were built using the train data, and the recommended/not recommended bar classification accuracies were measured on the test data.

Figure 1. Overall process of our feature word selection method. TABLE II. SIZE OF POSITIVE+NEGATIVE FEATURE WORDS FOR DIFFERENT K × MAXIMUM K COMBINATIONS Max K 500 1,000 2,500 5,000 7,500 10,000

K=50 284+249 =533 600+534 =1,134 1,556+1,387 =2,943 3,199+2,897 =6,096 4,922+4,507 =9,429 6,735+6,328 =13,063

K=100 240+221 =461 549+500 =1,049 1,483+1,341 =2,824 3,115+2,823 =5,938 4,833+4,444 =9,277 6,644+6,262 =12,906

K=250 179+174 =353 453+426 =879 1,343+1,224 =2,567 2,938+2,676 =5,614 4,644+4,281 =8,925 6,444+6,086 =12,530

K=500 129+129 =258 353+333 =686 1,178+1,064 =2,242 2,707+2,469 =5,176 4,386+4,048 =8,434 6,173+5,842 =12,015

B. Result Fig. 2 and Table 3 display the classification accuracies of test data for the different combinations of K and maximum K of our method and the baseline χ² method. We see that for our method, K=500 and Max K=500 combination yields the best accuracy of 77.94%. The baseline χ² method’s best accuracy is 74.94% where the top-N word size is 2,242. The accuracy difference of the two methods’ best results is statistically significant at the .99 confidence level using the McNemar’s Test. Hence, we conclude that our feature selection method is better than the baseline χ² feature selection method.

2 http://textacy.readthedocs.io/en/latest/api_reference.html 3 http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection. SelectKBest.html#sklearn.feature_selection.SelectKBest 4 http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes. MultinomialNB.html#sklearn.naive_bayes.MultinomialNB Heeryon Cho and Sang Min Yoon, “Feature Word Selection by Iterative Top-K Aggregation for Classifying Recommended Shops,” 2016 International Conference on Information & Communication Technology Convergence (ICTC), Jeju, South Korea, 2016, pp. 27-29. https://doi.org/10.1109/ICTC.2016.7763426

We also performed another baseline evaluation using all 115,659 unique words in the train data: the classification accuracy on test data was 74.48%. Through our feature selection method, we were able to reduce the feature word space from 115,659 to 258 and also improve the classification accuracy from 74.48% to 77.94%. How should we set the size of K and the maximum K? Note that as the size of K increased, more keywords were calculated as having zero-standardized value and were removed given the same maximum K, and as a result, shorter feature word lists were constructed for larger K (Table 2). Based on the experiments (Table 3), we could see that the larger K and smaller maximum K combination generally yielded better performance. We conducted additional experiments for smaller K and maximum K (Table 4), but the best result was K=50 and Max K=100 combination of 76.44%; we were able to reduce the feature word space further to 82. Although the accuracy did not increase for smaller K and maximum K, we were able to obtain a shorter feature word list that works in classification. Let us look at what keywords are selected as feature words. Table 5 shows the top-20 positive/negative keywords for the K=500 and maximum K=500 combination of our method (OURS), the general TextRank keyword lists based on the TextRank scores (TextRank), and the baseline χ² method (χ²). In the case of the top-20 scoring feature words calculated using the χ² method, the feature words are not explicitly divided into positive/negative words (Table 5, χ²). We see that our method’s feature words include more distinct positive/negative keywords. IV.

CONCLUSION & FUTURE WORK

We proposed a feature selection method based on iterative aggregation of the TextRank keywords to improve the classification performance of bars using Yelp customer reviews. The shorter feature word lists obtained by our method allowed us to gain insight into what words were actually effective in the classification. These feature words can be utilized as the keyword in context (KWIC) words for qualitative investigation of customer reviews. Since our current method only utilizes the final selected feature words, we seek to devise a method to utilize the selected keywords’ TextRank scores to construct a more sophisticated, weighted feature word matrix for improving the classification in the future. REFERENCES [1]

[2] [3] [4]

[5]

[6]

K. Ravi and V. Ravi, “A survey on opinion mining and sentiment analysis: Tasks, approaches and applications,” Knowledge-Based Systems, vol. 89, pp. 14-46, November 2015. B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Found. Trends Inf. Retr. vol. 2, no. 1-2, pp. 1–135, January 2008. I. Guyon and A. Elisseeff, 2003, “An introduction to variable and feature selection,” J. Mach. Learn. Res. vol. 3, pp. 1157–1182, March 2003. R. Mihalcea and P. Tara, “TextRank: Bringing order into texts,” in Proc. of the Conf. on Empirical Methods in Natural Language Processing, pp. 404–411, 2004. A. Abbasi, H. Chen, and A. Salem, “Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums,” ACM Trans. Inf. Syst. vol. 26, no. 3, Article 12, June 2008. M. Gamon, “Sentiment classification on customer feedback data: Noisy data, large feature vectors, and the role of linguistic analysis,” in Proc. of the 20th Int’l Conf. on Computational Linguistics, pp. 841–847, 2004.

Figure 2. Classification accuracy of test data TABLE III. Max K

CLASSIFICATION ACCURACY OF TEST DATA (FOR Χ²: TOP-N FEATURE WORDS)

K=50

K=100

K=250

K=500

500

74.83%

75.75%

75.64%

77.94%

1,000

74.71%

74.94%

75.52%

76.91%

2,500

74.48%

74.60%

74.83%

76.33%

5,000

74.71%

74.02%

75.17%

76.33%

7,500

73.44%

73.90%

74.48%

75.29%

10,000

73.44%

74.13%

74.60%

75.06%

TABLE IV.

χ² (Chi^2) 73.56% (258) 73.90% (686) 74.94% (2,242) 74.60% (5,176) 73.79% (8,434) 74.02% (12,015)

CLASSIFICATION ACCURACY OF SMALLER K AND MAX K (SIZE OF FEATURE WORDS)

Max K 100

K=10

K=25

K=50

K=100

74.94% (116)

75.40% (103)

76.44% (82)

75.29% (64)

200

74.71% (232)

74.48% (217)

75.28% (188)

74.02% (149)

300

74.02% (349)

75.06% (332)

74.48% (297)

75.06% (241)

400

74.48% (480)

74.48% (456)

74.48% (415)

75.17% (352)

500

74.13% (602)

74.13% (577)

74.83% (533)

75.75% (461)

TABLE V.

COMPARISON OF TOP-20 KEYWORDS USING OUR METHOD, TEXTRANK KEYWORD LIST, AND Χ²METHOD

OURS OURS TextRank TextRank (+) (–) (+) (–) perfect rude great food fantastic horrible food place wonderful terrible place service love poor good good yummy mediocre service time date awful time bar reasonable min bar drink live bland beer bad bruschetta disappointed drink table unique overpriced best great

χ² (1-10) great wine good food place delicious amazing best love service

χ² (11-20) restaurant menu like pan favorite try dinner time ordered steak

Heeryon Cho and Sang Min Yoon, “Feature Word Selection by Iterative Top-K Aggregation for Classifying Recommended Shops,” 2016 International Conference on Information & Communication Technology Convergence (ICTC), Jeju, South Korea, 2016, pp. 27-29. https://doi.org/10.1109/ICTC.2016.7763426

Feature Word Selection by Iterative Top-K Aggregation ...

The explosion of social media data has opened new avenues for data-driven .... Table 5 shows the top-20 positive/negative keywords for the. K=500 and ...

Download PDF

549KB Sizes 3 Downloads 219 Views

Report

Feature Word Selection by Iterative Top-K Aggregation ...

Recommend Documents