Issues in Visualizing Intercultural Dialogue Using Word2Vec and t_SNE Heeryon Cho and Sang Min Yoon HCI Lab., College of Computer Science, Kookmin University, Seoul, 02707, South Korea
[email protected],
[email protected]
Culture & Computing 2017, 10-12 September, 2017, Doshisha University, Japan ►ABSTRACT◄ One way to visualize an intercultural dialogue is to plot keywords jointly used by the intercultural speakers to see how the keywords locate relatively to each other, with the position of the keywords signifying some kind of a similarity relationship. We processed a Japanese transcription of a Korean-Japanese dialogue using Word2Vec and t-SNE algorithm to generate various 2D plots of the noun words jointly used by the Korean and Japanese speakers. Through this visualization process, we tracked down some of the issues involved in generating a meaningful visualization of the noun words jointly used by the intercultural speakers.
►ACKNOWLEDGMENT◄ This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) grants funded by the Korean Ministry of Science, ICT & Future Planning (NRF-2017R1A2B4011015) and Korean Ministry of Education (NRF-2016R1D1A1B04932889).
Figure 1. 2D plots of the same twenty nouns jointly spoken by the Japanese and Korean speakers during an intercultural dialogue. The same meaning nouns spoken by the two countries’ speakers are linked with colored lines. Different t-SNE learning rates induce different visualizations as shown in Fig 1. left & right, but some words are positioned relatively similarly to each other as in the case of ‘military’ and ‘criticism’. Country names such as ‘U.S.’, ‘China’, ‘Japan’, and ‘Korea’, and ‘history’ and ‘economy’ are also located near each other regardless of different t-SNE learning rates. Fig 1. left: t-SNE learning rate = 800, Fig 1. right: t-SNE learning rate = 1000
►RESEARCH SUMMARY◄
우리들은…
The Problem How well is an intercultural dialogue taking place?
►ISSUES & FINDINGS◄
我々は…
Keyword Selection - Jointly spoken noun words were automatically identified by taking the intersection - However, selection of meaningful and interesting words was subjective and difficult to automate
One Solution Visualize intercultural dialogue by plotting noun words jointly spoken by the participants
Word2Vec Generation - Initially, we generated each country’s word embedding vectors using country-specific texts - Result was wildly varying word embedding vectors that were difficult to compare - We used unknown-speakers’ texts to first generate ‘pivotal’ vectors and updated only the noun word embeddings using country-specific texts
Ultimate Goal Build a dynamic system that visualizes intercultural dialog by tracking jointly spoken keywords Case Study - Japanese transcript of Korean-Japanese dialogue discussing the present and future of Korea-Japan relations was processed and key noun words were plotted on a 2D space using NLP technology - We identified several issues in visualization by reviewing the step-by-step visualization process we have taken
t-SNE Parameter Selection - t-SNE learning rate and perplexity value affected visualization outcome - Perplexity value was adjusted to be smaller than the number of plotting points (i.e., words)
►VISUALIZATION PROCESS◄ Japanese Transcript of Korean-Japanese Dialog
① PDF2Text Conversion
Text Format Korean-Japanese Dialog
② Manually Split Text Files
Three Files: KO / JA UNKNOWN
Word Token & Part of Speech
③ Obtain Word & POS
④ Obtain Word Embedding
100D Word Embedding (Word2Vec) INPUT
PROJECTION
2D Visualization (t-SNE) OUTPUT
w(t-2) w(t-1)
SUM
⑤ Reduce Dimension (2D Vis.)
w(t) w(t+1) w(t+2)
►REFERENCES◄ [1] H. Heuer, “Text comparison using word vector representations and dimensionality reduction,” in Proc. EuroSciPy, 2015, pp. 13–16. [2] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. NIPS, 2013, pp. 3111–3119. [3] L. Van Der Maaten, “Accelerating t-SNE using tree-based algorithms,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 3221–3245, Jan. 2014. [4] H. Morita, D. Kawahara, and S. Kurohashi, “Morphological analysis for unsegmented languages using recurrent neural network language model,” in Proc. EMNLP, 2015, pp. 2292–2297. [5] R. Rˇ ehu°rˇek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proc. LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010, pp. 45–50. [6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011. [7] M. Wattenberg, F. Vi´egas, and I. Johnson, “How to use t-SNE effectively,” Distill, 2016.