Issues in Visualizing Intercultural Dialogue Using Word2Vec and t-SNE Heeryon Cho Sang Min Yoon HCI Laboratory College of Computer Science Kookmin University Seoul, 02707, South Korea Email: {heeryon, smyoon}@kookmin.ac.kr

Abstract—One way to visualize an intercultural dialogue is to plot keywords jointly used by the intercultural speakers to see how the keywords locate relatively to each other, with the position of the keywords signifying some kind of a similarity relationship. We processed a Japanese transcription of a Korean-Japanese dialogue using Word2Vec and t-SNE algorithm to generate various 2D plots of the noun words jointly used by the Korean and Japanese speakers. Through this visualization process, we tracked down some of the issues involved in generating a meaningful visualization of the noun words jointly used by the intercultural speakers.

I. I NTRODUCTION Suppose that we are to build a system that dynamically visualizes how well a dialogue among intercultural speakers is taking place. One way to do this is to selectively plot speakers’ spoken nouns, which express key concepts of the dialogue, onto a 2D space, and analyze whether the noun words jointly spoken by the intercultural speakers are used in a similar context. Recent development in distributed vector representations of words and dimensionality reduction techniques has laid a useful foundation for visualization of word relationships [1], but many issues must be solved in order to realize meaningful visualization of intercultural dialogue using these techniques. Figure 1 shows two 2D plots of the same twenty nouns used in an intercultural dialogue between Korean (marked in blue ‘x’) and Japanese (marked in red ‘o’) speakers. A Japanese transcript of the intercultural dialogue 1 was processed using Word2Vec [2] and t-distributed Stochastic Neighbor Embedding (t-SNE) [3] algorithm to generate the 2D plots. Because the Japanese transcript included a Japanese translation of the Korean dialogue, all noun words were plotted in Japanese with matching English words. Notice how the plots (Figs. 1 (a), (b)) change as the t-SNE learning rates are adjusted from 800 to 1,000. Many choices, e.g., the selection of words, the size of the word embedding dimension and t-SNE perplexity values, etc., must be made to generate a meaningful visualization. This paper aims to identify some of the issues involved in visualizing the spoken nouns shared by the intercultural 1 The transcript is available at http://www2.jiia.or.jp/pdf/resarch/H25 JapanKoreaDialogue/04 Proceeding.pdf

(a) t-SNE learning rate = 800

(b) t-SNE learning rate = 1000 Figure 1. 2D plots of the same twenty nouns jointly spoken by the Japanese and Korean speakers during an intercultural dialogue. The same meaning nouns spoken by the two countries’ speakers are linked with colored lines. Different t-SNE learning rates induce different visualizations as shown in Fig. 1 (a) and Fig. 1 (b), but some words are positioned relatively similarly to each other as in the case of ‘military’ and ‘criticism’. Country names such as ‘U.S.’, ‘China’, ‘Japan’, and ‘Korea’, and ‘history’ and ‘economy’ are also located near each other regardless of different t-SNE learning rates.

speakers. To do so, we review the steps taken to generate the plots in Fig. 1, and clarify several issues involved in the visualization. Although the processing of an intercultural dialogue generally accompanies multilingual text processing, we assume that a monolingual transcript, which includes a human translation of the other language(s) of the dialogue, is given in order to focus on the issues regarding visualization.

II. C ASE S TUDY A total of thirty four Korean and Japanese academics and journalists participated in a two-day program hosted by the Korea Foundation and The Japan Institute of International Affairs to discuss the present and future of Korea–Japan relations in October 21 & 22, 2013. Several Korean and Japanese students also participated in the program. The program consisted of 10+ hours of presentation and discussion, and the Japanese transcript of the Korean–Japanese dialogue was created. A. Data & Method The PDF transcript was first converted to a text file, and the text file was manually split into three separate files containing utterances of the Japanese speakers, Korean speakers, and the Japanese or Korean speakers in which the nationality was unspecified. From each file, the word tokens and the parts of speech were obtained using an open source Japanese morphological analyzer, JUMAN++ [4]. Then, the word embeddings of the extracted word tokens were obtained by applying the Word2Vec algorithm. We used an open source Word2Vec generation tool, Gensim [5], to construct the 100-dimension word vectors. Finally, t-SNE algorithm was applied to generate the 2D word vectors. An open source machine learning library, scikit-learn [6], was used for dimensionality reduction using t-SNE. A total of 1,488, 1,544, and 312 sentences spoken by the Japanese, Korean, and unspecified-nationality speakers were obtained. B. Issues Trial and error was conducted to reach the plots in Fig. 1. Three issues were identified in the visualization process. • Word2Vec generation: Initially, we generated each country’s word vectors using only the corresponding speaker’s text, i.e., Japanese for Japanese and Korean for Korean. This was done so that the differences in the noun word usage, if present, would be more distinctly highlighted. However, the resulting word vectors of the two countries’ utterances lead to wildly varying plots when fed into t-SNE, rendering a meaningful comparison of the two countries’ spoken nouns difficult. The remedy was to utilize the word vectors generated using the unspecified-nationality speakers’ text as pivot vectors, and update each country’s embeddings through the retraining of the pivot vectors. Note that all parts of speech was used when generating the pivot word vectors whereas only the values of the noun words were updated during the retraining of each country’s word vectors. Moreover, various word vector dimensions, e.g., 2, 10, 100, were tested. • t-SNE parameter selection: Not only the learning rate (recall Fig. 1 (a), (b)), but also the perplexity value of t-SNE affected the visualization outcome. It has been advised that the perplexity value should be



smaller than the number of the plotting points [7]. In the case of Fig. 1, we set the perplexity value to 10 for each country’s results. Keyword selection: The selection of jointly used noun words for visualization was another issue. While automatic identification of the jointly used words was possible by taking the intersection of the two countries’ unique word lists, the selection of the meaningful and interesting words for visualization was subjective and difficult. In our experiment, we sorted the noun words according to their frequency and selected the more frequently occurring nouns.

III. C ONCLUSION & F UTURE W ORK We tracked down some of the issues involved in the visualization of the shared noun words uttered in an intercultural dialogue using Word2Vec and t-SNE algorithm. Based on the findings summarized in this paper, we plan to devise a way to automatically select, visualize, and highlight the common keywords that have different understanding and/or usage among the intercultural speakers in the future. ACKNOWLEDGMENT This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) grants funded by the Korean Ministry of Science, ICT & Future Planning (NRF-2017R1A2B4011015) and Korean Ministry of Education (NRF-2016R1D1A1B04932889). R EFERENCES [1] H. Heuer, “Text comparison using word vector representations and dimensionality reduction,” in Proc. EuroSciPy, 2015, pp. 13–16. [2] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. NIPS, 2013, pp. 3111–3119. [3] L. Van Der Maaten, “Accelerating t-SNE using tree-based algorithms,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 3221– 3245, Jan. 2014. [4] H. Morita, D. Kawahara, and S. Kurohashi, “Morphological analysis for unsegmented languages using recurrent neural network language model,” in Proc. EMNLP, 2015, pp. 2292– 2297. ˇ uˇrek and P. Sojka, “Software Framework for Topic [5] R. Reh˚ Modelling with Large Corpora,” in Proc. LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010, pp. 45–50. [6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011. [7] M. Wattenberg, F. Vi´egas, and I. Johnson, “How to use t-SNE effectively,” Distill, 2016.

Issues in Visualizing Intercultural Dialogue Using ...

Seoul, 02707, South Korea. Email: 1heeryon, [email protected] .... automatically select, visualize, and highlight the common keywords that have different ...

416KB Sizes 1 Downloads 125 Views

Recommend Documents

Issues in evaluating semantic spaces using word ...
Abstract. The offset method for solving word analo- gies has become a standard evaluation tool for vector-space semantic models: it is considered desirable for a space to repre- sent semantic relations as consistent vec- tor offsets. We show that the

Real user evaluation of spoken dialogue systems using ...
Mechanical Turk (AMT), provides an efficient way of recruit- ing subjects. ... aimed to evaluate the overall performance of the dialogue system. 3. Remote evaluation. The evaluation of ... number to call, using either a land line or a mobile phone an

Dialogue Systems in Education
Jun 17, 2009 - Tutorial dialogue systems .... design of Instructional Systems”, Lawrence Erlbaum. Bloom, B. S. ... Artificial Intelligence and Education, in press.

Dialogue Systems in Education
Jun 17, 2009 - Dialogue Systems in. Education ... Intelligent Tutoring System (Sleeman et al. 1982). • Computer .... data(reinforcement learning). 6/17/2009. 15.

Visualizing Your Data Using Real-World Business ...
Real-World Business Scenarios Full Download ePub. Online ... The Truthful Art: Data, Charts, and Maps for Communication · Data Visualisation · SHOW ME ...

[Ebook] The Big Book of Dashboards: Visualizing Your Data Using ...
If signed companies would be required to inform users of how they’re using the location data they collect if the users decides to share it Pok 233 ...

Social Issues- Brainstorming and Speaking - Using English
Social Issues- Brainstorming and Speaking ... Find social issues in the list of collocations below: Ageing marriages .... Should be covered more by the media.