Clustering and Visualization of Online Chat

Viewer
Transcript

Clustering and Visualization of Online Chat Lung-Hsiang Wong and Chee-Kit Looi National Institute of Education, Nanyang Technological University, 1 Nanyang Walk, Singapore 637616. E-mail: {lhwong, cklooi}@nie.edu.sg

Abstract: Online chat has become very popular as the Internet permeates into every aspect of our life. In online chat, discussions tend to be unstructured and participants may switch topics frequently and randomly. This paper reports an agent-based chat analysis tool (CAT) that tracks a chat session such that chatters may easily determine both the previous content and the current status of the chat session. We developed an algorithm for CAT to divide chat content into topic-based “clusters”. We then proposed a novel way of visualizing the chat content in terms of clusters (represented by bars) on a timeline – this is to observe the dynamics of the discussion topics in a temporal basis. We have developed a prototype of the CAT Agent in Java and used it to cluster chat transcripts of 10 education-oriented and/or theme-based chat sessions. Keywords: Intelligent Agents; Online Chat Clustering & Visualization; Information Extraction; Computeraided Learning; e-Learning

1. Introduction The popularity of online chatting has increased significantly over the past decade. Online chat sessions may either be very structured and targeted to discussion of a particular topic, or may be unstructured and are generally noisy and out-of-focus. The term “noise” refers to off-topic content included in a topic-based chat session. Chatters in a chat session may switch discussion topics frequently and randomly. Some chat systems allow users to log a chat transcript, which is generally saved as an archive file and thus is not organized according to, e.g., a threaded structure that can be easily indexed and searched. Such spontaneity and randomness in conventional online chat sessions thus poses several problems for chatters. In the context of e-Learning, these inherent problems also make this supposedly useful collaborative learning tool become unattractive. First, a latecomer to a particular chat session, or an existing chatter who leaves the session and later returns may have some difficulty tracing the previous discussion. Therefore, it is hard for such person to (re)join the conversation. Second, individual chatters who enter a chat room when it is empty need a manner of determining a current status of a chat session. Even in chat sessions that include a logging feature that tracks the discussions in the form of, for example, a chat transcript, browsing through such a transcript to find a quality exchange can be tedious and time-consuming. Third, for topic-based discussions (especially learning-oriented ones), chatters are likely to initiate off-topic discussions unless there is a mechanism that can detect the switching of topics and possibly provide a warning or alert about the topic switch, either directly or indirectly. As a result, a need exists for an online chat analysis tool that tracks a chat session such that chatters may enter and exit the conversation randomly, and easily determine both the previous content and the current status of the chat session. Additionally, a tool is needed for topic-based chat sessions to monitor in real-time the directions of the discussions to ensure that topic switching is avoided or reduced. This paper proposes a chat agent, CAT (stands for Chat Analysis Tool), to analyze the chat content. The CAT Agent is capable of extracting keywords from the chat utterances, associating the extracted keyword with a topic, and identifying content clusters of the chat session by organizing the content according to the topics. The agent outputs a graphical representation of the chat content in terms of the time-based segments of the conversation, and in the form of a bar chart. A CAT Agent can either log on to a chat room as a normal user (provided the agent programmer knows the communication protocol of the chat room, e.g., IRC) and perform online chat analysis, or cluster the utterances in a text-based chat log transcript.

Wong, L.-H., & Looi, C.-K. (2006). Clustering and visualization of online chat. Proceedings of Global Chinese Conference on Computers in Education 2006, Beijing, China.

2. Clustering Chat Sessions The primary function CAT is to group chat utterances (referring to what a chatter “says” within her turn during the online chat) in a chat session as “clusters” in terms of topics. Sometimes, two or more “clusters” may be overlapping, reflecting the occurrence of concurrent threads during a chat session. Figure 1a shows a sample dialogue dissection with the keywords detected and clusters identified by CAT. Utterance/Action

Keyword(s)

Dorothy: Louis XIV was very old when he died? Marvin enters the room. Adam: 70+ years old. considered a very old man at his time „coz people usually died in their 40s. Carl: Which king was beheaded during French Revolution? Edmund: Yes, Louis XV was his grandson, you know? Brenda: Hey, anybody wants to see the musical “Phantom of the Opera”? Nelly leaves the room. Dorothy: Any good about Louis XIV? Edmund: Andrew Lloyd Weber!! When?

Louis XIV

Cluster(s)

French Revolution Louis XV

history

musical / Phantom of the Opera Musical

Louis XIV Andrew Lloyd Weber

Figure 1a: A sample dialogue dissection with the keywords and clusters detected by CAT

now history musical (zoom)

-

+

(scroll timeline) <<

<

>

>>

Figure 1b: The chat timeline display according to the dialogue dissection in Figure 1a As one can observe from Figure 1a, chat clusters may be separated, for example, by “noise”. Threshold parameters (refer to section 3), configurable by users, determine the tolerance level of such “noisy utterances”. On the other hand, an utterance may involve more than one topic, for example, the discussion of “information technology in education” involves two topics: “information technology” and “education”. In addition to determining clusters with specific topics, CAT can also detect “socializing clusters,” including “looking for chatters,” “greeting,” and “separation”. Figure 1b depicts the chat timeline being constructed according to the result of clustering of the dialogue dissection in Figure 1a. The chat timeline depicts a visual, temporal representation of chat topics included in the chat session. On the chat timeline, each cluster is represented as one or more lines of color. Each of the clusters that relate to a particular topic is represented by the same color. The chat timeline can either be updated in real-time during a chat session or after a chat session has completed if it incorporates a logging feature. Figure 2 depicts the minimal software architecture of CAT. It consists of a chat analyzer, a dictionary, a GUI/configurator and a chat room interface. The chat analyzer clusters the chat content by referring to the dictionary for keyword-topic mappings. The GUI/configurator is a CAT agent‟s direct interface with its human owner - to generate the timeline display as well as provide a dialog box for the user to configure the agent and the clustering algorithm. The chat room interface enables the CAT agent to log on to a specific chat room as an ordinary chatter in order to perform real-time chat clustering; and to retrieve chat logs from Wong, L.-H., & Looi, C.-K. (2006). Clustering and visualization of online chat. Proceedings of Global Chinese Conference on Computers in Education 2006, Beijing, China.

the chat room for off-line analysis. Therefore, the module should have the knowledge of the chat room‟s protocol (e.g., IRC), where to retrieve the chat logs (e.g., in a web directory) and the format of the chat logs.

CAT dictionary

chat analyzer

chat room interface

chat room ABC

GUI / configurator

Figure 2: Minimal software architecture of CAT The CAT agent can either reside on the same host machine with the chat room server or a remote computer that is allowed to establish TCP/IP connection with the server. It can even be implemented as a mobile agent if migrating to different chat servers is desired. In a separate embodiment, different CAT agents can communicate with each other through a “global server” in order to exchange information about different chat rooms (probably with different protocols and hosted by different servers) on the Internet. This facilitates “Chat room matchmaking” to bring together chatters in different chat rooms that share the same interest (Wong & Looi 2003).

3. Chat Clustering Process Clustering is an “ongoing” process. When each chat utterance is analyzed, CAT will try to incorporate it into a recent cluster or create a new cluster that includes this utterance. Hence, existing clusters are likely to be gradually growing when the clustering process goes on. The process is summarized as below: (1) Pick up the next chat utterance and extract the keywords by applying standard text parsing and morphology algorithms. (2) Each identified keyword is associated to one or more topics listed in a topic glossary (in the form of lookup table) of CAT. If the keyword can be mapped to more than one topic, then CAT will pick the topic that has been included in a recent chat cluster. The identified topic is added into the topic list of the utterance. (3) The utterance is lumped into a recent cluster the topic list of the utterance includes the same topic with the cluster. If not, CAT will try to form a new cluster by looking for recent utterances that share the same topic. Sometimes, both conditions are failed for a particular utterances - thus no new cluster is created and no recent cluster is expanded. (4) Update the timeline display. (5) Loop back to (1). How do we quantify “recent”? This is where the following user-configurable parameters are introduced:   

Utterance Count Threshold (UCT): The minimum count of utterances needed to form a cluster; Utterance Proximity Threshold (UPT): The maximum count of off-topic utterances between a current utterance and a last utterance of a cluster that the cluster can be expanded to the current utterance; Time Threshold (TT): The maximum time gap between a current utterance and a last utterance of a cluster that the cluster can be expanded to the current utterance.

Sizes of individual clusters are therefore influenced by the user‟s specification of UCT, UPT and TT. Bigger threshold parameters will result in the generation of clusters of larger sizes, which may involve more Wong, L.-H., & Looi, C.-K. (2006). Clustering and visualization of online chat. Proceedings of Global Chinese Conference on Computers in Education 2006, Beijing, China.

noise within individual clusters. Other than noise, such a mechanism may also address the commonlyoccurred “phantom adjacency pairs” (two utterances that are posted sequentially that look like an adjacent pair, but the times of their construction reveal that they are not related ones) and “phantom responsiveness” (two adjacent utterances appear to relate to each other but in fact do not) problems as explored in (Smith, Cardiz & Burkhalter 2000) and (Garcia & Jacobs 1998). In other words, these parameters decide how “finegrained” the clustering should be. CAT ensures that each cluster satisfies the constraints posed by each of these conditions. Example: The following example depicts a process for clustering chat content. Utterance Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Timestamp (hr:min:sec) 1:00:01 1:00:02 1:00:04 1:00:04 1:00:04 1:00:09 1:00:11 1:00:11 1:00:12 1:00:13 1:00:17 1:00:18 1:00:19 1:00:21 1:00:29 1:00:29 1:00:31

Topic(s) Detected A A A, B A, B B B

A A A, B, C B

C C C

Case 1: UCT = 3; UPT = 5; TT = 10 seconds Clusters formed: Topic

Range Remarks (Utteranc e Indices) A 1-11 The cluster has a gap between 5-8. The gap is taken as part of the cluster because it only covers 4 utterances (< UPT) and spans 8 seconds (< TT). B 3-12 The cluster has a gap between 7-10. The gap is taken as part of the cluster because it only covers 4 utterances (< UPT) and spans 8 seconds (< TT). C 15-17 Utterance 11 is not taken as part of the cluster because the gap between 11 and 15 will cover 3 utterances (< UPT) but spans 11 seconds (> TT). The first two clusters are overlapping with each other.

Wong, L.-H., & Looi, C.-K. (2006). Clustering and visualization of online chat. Proceedings of Global Chinese Conference on Computers in Education 2006, Beijing, China.

Case 2: UCT = 3; UPT = 3; TT = 12 seconds Clusters formed: Topic

A A

B

C

Range Remarks (Utteranc e Indices) 1-4 9-11 The reason this cluster can‟t merge with the first cluster (1-4, Topic A) is that it will otherwise form a gap (5-8) that spans 8 seconds (< TT) but covers 4 utterances (> UPT). 3-6 The reason utterances 11 & 12 can‟t be included in this cluster is that it will otherwise form a gap (7-10) that spans 8 seconds (< TT) but covers 4 utterances (> UPT). Neither can utterances 11 & 12 make up a separate cluster because it only has 2 utterances with the same topic (< UCT). 11-17 The cluster has a gap between 12-14. The gap is taken as part of the cluster because it only covers 3 utterances (< UPT) and spans 11 seconds (< TT).

The 2nd and the 4th clusters are overlapping with each other. After a cluster is created, it is refined. The term “refining” a cluster refers to a cluster B being absorbed by a cluster A when all of the following conditions are met: (1) Clusters A and B are overlapping; AND (2) Cluster B‟s topic is a sub-category of cluster A‟s topic (e.g., “Internet” is a sub-category of “Computers”); AND (3) The length, i.e., utterance count, of cluster B is less than 1/5 the length of cluster A. A list of clusters identified in a particular chat session is stored in a database. When the moderator of the chat room changes the values of UCT, UPT and TT, the system deletes the current cluster list and builds a new cluster list. Before modifying the UCT, UPT, and TT values, the moderator can save current cluster list.

4. Additional CAT or Chat Room Features Leveraging on Chat Clustering The proposed chat clustering process will facilitate new online chat features that are otherwise impossible in “conventional” chat rooms. For example, the timeline display can be made clickable so that the following features can be incorporated: (1) Popping up a topic information window that displays the “accumulative” statistics and information of ALL clusters that have been associated to a particular topic, for example, total time elapsed, active chatters, regular patterns (say, “Discussions on this topic usually occur between 9pm and 11pm on Fridays”), etc., and the start and end time of each cluster. (2) Popping up a cluster information window that displays the statistics and other information related to the cluster, e.g., time elapsed, a list of all participants and active chatters, an indication of a quality of the conversation, i.e., a measure of how much noise, or off-topic discussion is included in a chat session. (3) Access to an annotation tool that allows a user to add annotations/comments to a particular cluster. The annotation window is presented as a single-threaded online message board. If the user changes the values of UCT, UPT and TT (thus the chat session is re-clustered), the system will automatically attach the annotations to the new cluster(s) at the same portion of the dialogue with the same topic. (4) Launch of and access to a private chat room that is accessible to the chatters in the main chat room to engage in a topic-specific discussion directed to the topic reflected by the selected cluster. Each cluster can have one related private chat room. The process also supports a search engine embodiment that allows a user to search a chat log for specific topics. In this case, when the user specifies a topic, CAT lists the clusters that relate to the topic. CAT may also list the sub-categories that relate to the topic. If the resultant list is too large, the user can narrow it Wong, L.-H., & Looi, C.-K. (2006). Clustering and visualization of online chat. Proceedings of Global Chinese Conference on Computers in Education 2006, Beijing, China.

down by specifying more conditions such as keywords of a cluster, time period of a cluster, or participants in a chat session or specific cluster of a chat session, etc. If a “clean” (i.e., off-topic exchange is filtered off) chat log is desired, CAT will be able to generate such a chat log that keeps on-topic (as specified by individual users) clusters and removes off-topic ones. On the other hand, in a topic-specific chat session, if the chatters have been off-topic for quite a while (can be prespecified by the chat moderator, say, three minutes), a CAT Agent that logs on to the chat room as a user will be able to detect this and give “verbal” warning to all the chatters. Such a feature will especially be helpful for teachers in a learning environment where several hundreds of small project groups communicate, as she certainly cannot facilitate so many concurrent chat discussions. A CAT Agent could assist her in facilitating online conversation and monitoring individual contributions, and alerting her whenever “authoritative” intervention is desired (for example, spamming, vulgarities or being off-task).

5. Related Work Ever since online chat becomes popular, researchers and developers all over the world have been working on innovative features and user interfaces to address underlying problems and improve the “conventional” chat mechanisms. We present three of those systems that are strongly related to the underlying concept or technique of CAT. Inspired by online threaded discussion forums and Usenet Newsgroups, (Smith, Cardiz & Burkhalter 2000) developed threaded text chat as a way to solve the adjacency pair (i.e., adjacent utterances that are often separated by a potentially large number of intervening utterances) problem in real-time chat. The sidebenefit of such system is that relevant utterances are grouped together under various threads manually (by individual chatters‟ judgement). The CAT Agent doesn‟t intend to eliminate such a problem in real-time chat (as what Smith‟s system does), but it is able to group relevant utterances together automatically through the chat clustering process. Conversation Landscape (Donath, Karahalios & Viegas 1999) is a visualization of the flow of the conversation vis-à-vis the different participants. It is a 2-D model of the conversation in which the participants identified by colour are arrayed along the x axis and the y axis represents time. Utterances are shown as horizontal lines. The wider the line, the longer is the utterance. The rhythm of the conversation is revealed in the 2-D visualization. Similarly, the CAT Agent provides a temporal-based visualization of the conversation, but it moves one step further by depicting the evolvement of discussion topics in terms of clusters and the length of each cluster, instead of simply indicating the length of individual utterances. Another system for visualizing the chat conversational space is a system called AIDE (Mase, Sumi & Nishimoto 1998; Sumi, Nishimoto & Mase 1997). The discussion viewer of AIDE provides all participants a common information space by dynamically visualizing the semantic structure of the discussion. It displays the spatial arrangement of the keywords and the utterances in a 2-D space. Two utterances with shared keywords are located closer together and these common keywords are mapped between the pair. In a sense, it‟s an alternative approach of chat clustering that groups relevant utterances (without recording the temporal proximity of these utterances, which CAT can do) in terms of keywords (CAT generates utterances in terms of topics).

6. Conclusion The essence of chat clustering is to divide the content of a chat session (or, to group adjacent utterances) into clusters according to a temporal sequence of the discussion topics. By contrast, conventional clustering methods are typically applied to a set of documents or items, i.e., they are grouped together by topics. The chat clustering process may be executed either in real-time, i.e., as a chat session is happening, or after a chat session has completed. In this paper, we present a novel chat clustering and visualization approach that groups adjacent utterances into clusters in terms of topics and visualizes the clusters on a timeline. Such mechanism will enable general chatters to visualize the evolvement of the chat topics and conveniently extract useful exchange and information from chat transcripts, and chat room moderators to achieve greater control on chat sessions. The mechanism may be useful in reducing “repairs for misunderstood prior turns” (Garcia & Wong, L.-H., & Looi, C.-K. (2006). Clustering and visualization of online chat. Proceedings of Global Chinese Conference on Computers in Education 2006, Beijing, China.

Jacobs 1998), resulting in a neater chat. It is especially useful for education-oriented chat sessions (e.g., online project meetings among students from various schools) or chat sessions where focus on specific theme is desired. The “Conversation Facilitation Agents” as described in (Looi 2001) is one such application that our approach can be applied to. We have developed a prototype of the CAT Agent in Java and have used it to cluster chat transcripts generated during 10 education-oriented and/or theme-based chat sessions.

7. References Donath, J., Karahalios, K., & Viegas, F. (1999), Visualizing conversation, Journal of Computer-Mediated Communication, Vol.4, No.4, June 1999. Garcia, A. & Jacobs, J.B. (1998), The interactional organization of Computer Mediated Communication in the college classroom, Qualitative Sociology, Vol.21, No.3, 1998. Looi, C.K. (2001), Supporting conversations and learning in online chat, Proceedings of the 2001 Conference on Artificial Intelligence in Education, San Antonio, Texas, May 2001; also in Artificial Intelligence in Education, IOS Press, pp. 142-153. Mase, K., Sumi, Y., & Nishimoto, K. (1998), Informal conversation environment for collaborative concept formation, Toru Ishida (Ed.), Community Computing: Collaboration over Global Information Networks, John Wiley & Sons, 1998, pp.165-205. Smith, M., Cardiz, J.J. & Burkhalter, B. (2000), Conversation trees and threaded chats, Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, pp.97-105. Sumi, Y., Nishimoto, K., & Mase, K. (1997), Personalizing shared information in creative conversations, IJCAI-97 Workshop on Social Interaction and Communityware, Nagoya, Aug. 1997. Wong, L.H. & Looi, C.K. (2003), ChatterWeb – Building a web of online chat rooms, internal report, Kent Ridge Digital Labs., Singapore.

Wong, L.-H., & Looi, C.-K. (2006). Clustering and visualization of online chat. Proceedings of Global Chinese Conference on Computers in Education 2006, Beijing, China.

Clustering and Visualization of Online Chat

... chat transcripts of 10 education-oriented and/or theme-based chat sessions. ..... tool that allows a user to add annotations/comments to a particular cluster.

Download PDF

162KB Sizes 1 Downloads 182 Views

Report

Clustering and Visualization of Online Chat

Recommend Documents