AAAI Proceedings Template

Viewer
Transcript

Automatic identification of dynamic virtual communities of practice based on browsing patterns Bhaskar Mitra, Radha C, Bharati Raghavan Dept. of Computer Science People’s Education Society Institute of Technology 100 Feet Ring Road, Bangalore, India E-mail: bhaskar.mitra, radha.cr, bharati.raghavan @gmail.com

Abstract Highly interactive computer mediated communication has facilitated the formation of various forms of virtual learning communities on the Web that have existed over the years including communities of practice and Internet discussion groups engaged in interactive sharing and learning processes. However, the process of community formation or registration based on user’s interest structure is still fundamentally a manual process. In this paper we present a novel approach to identifying virtual communities of practice that exist on the Web by examining the browsing patterns of on-line users using a neural network model.

This overhead can be evaded with automatic identification of possible CoPs existing over the Web ideally with minimal user involvement. To accomplish this we develop an on-line collaboration project called the Horde Project to automatically identify commonality of interests and possible CoPs based on user’s document access patterns or Web usage using a neural network information retrieval model. In the next section we provide further insights into the background. Further sections concentrate on metrics for measuring vicinity between Web users and the actual design of the Horde Project in more detail. We also discuss the practical applicability of such an on-line tool in a later section.

Introduction Several Web based technologies and services have transformed the World Wide Web into an ideal meeting place for its users. The Web has progressively advanced beyond being an information space to also a social space [1]. Highly interactive computer mediated communication (CMC) has facilitated the formation of virtual communities [2] and the transformation of the web into a context for learning through community-supported collaborative construction [3]. Various forms of virtual learning communities (VLC) have existed over the years including communities of practice (CoP) [4] and Internet discussion groups engaged in interactive sharing and learning processes [5]. However, the process of community formation or registration based on user’s interest structure is still fundamentally a manual process. Users browse for Internet discussion forums of interest before joining. After joining most members are found to be active for a short time after which they become passive recipient of messages (lurkers) with little or no contribution most of the time [2]. This accentuates the necessity of forming relatively short-term knowledge sharing interactions instead, marked by intense participation in the form of small CoPs where once the need of the user is met the CoP is dissolved. For such short-term interactions initiation or registration processes with overheads of manual searching for compatible candidate with analogous interests are highly undesirable.

Background Lave and Wegner [4] essentially define a CoP as a loosely bound distributed group of people with shared interests benefiting from legitimate peripheral participation (LPP) which is a process of peripheral participatory association of the fresh members of the community and their consequent promotion with experience gained from the association [6]. Numerous case studies have been performed till date on CoPs such as those involving online community of journalists [7] and distributed design teams [8]. Though in this paper we concentrate only on CoPs that can be identified amongst the millions of daily Web surfers it is interesting to note the inferences that were drawn from the online community of journals by Millen and Dray [7] that suggests the existence of a general feeling of openness in asking for assistance and providing answers within the CoP contributing to shared information. Many diverse approaches have been adopted over the years to accurately cluster users into CoPs including statistical clustering techniques, ontology networks [9, 10] and symbolic machine learning [11]. In ontology network analysis (ONA), hierarchical data structure containing all the relevant entities and their relationships are formulated and strength of relationship between the entities is measured to provide metrics of connectedness. The challenges lie in the construction of the ontology networks

and correct estimation of strength of relationships between entities [9]. Machine learning techniques alternatively have been employed for user modeling tasks by monitoring system usage. An important issue in machine learning techniques is the choice of the learning method. Two of the unsupervised learning methods include conceptual clustering and cluster mining. Conceptual clustering algorithm generates a hierarchy of entities, which can be used to select communities at different levels of generality. It generates disjoint communities, which map each user uniquely to a community. It is unsuitable for systems such as the Web where users commonly belong to more than one community. In those cases Paliouras et al. advocate the cluster mining method involving graph construction to be more appropriate [11]. Many of these approaches incorporate either profiling user navigation paths [12] or explicit profile data collection from users. Browsing patterns usually provide good indication of a user’s interests. Chan discusses procedure to capture user’s interest profile by monitoring user’s access behaviors [13]. It is possible, though not comprehensively exploited, to establish a relationship between the millions of users on the Web based on their browsing patterns [14]. Contrastingly the relationship between millions of hyperlinked documents on the Web has been well explored and employed in the forms of massive indexes for searching (for example, [15]).

Sidler, Scott and Wolf see the vicinity metric as a combination of four simpler metrics – space, semantic, time and user interest which they visualize as orthogonal axes in the vicinity space [19]. The space metric is similar to that advocated by Froitzheim and Schubert as based on hypertext references. The semantic metric is based on document semantic by using “weighted hyperlinks or by using content-based information from the documents”. The time metric is based on temporal overlapping of user’s browsing sessions. In the human-centered interest metric, vicinity is defined “based on shared interest, the ability to speak a common language, membership of the same cultural group, and many other definable characteristics”. The Horde Project defines user vicinity based on the semantic and time metrics. Where as time metrics is an essential requirement for visibility of another presentity, it by itself does not indicate any commonality of interests or practice. The semantic metric is thus the primary factor of vicinity determination between surfers on the Web. The semantic metric is applied on both the content and the context (as indicated by hyperlink text of the link pointing towards the document) of the documents accessed by the users. Based on the temporal analysis of the predominant semantic patterns in the documents browsed by the users we try to find resemblance between users’ interests.

Vicinity metrics

Neural network is an information processing technology within the framework of soft computing and computational intelligence well suited for Information Retrieval tasks. [20]. It presents a suitable form of knowledge representation for Information Retrieval applications wherein nodes usually represent IR objects such as keywords, documents and links and bi-directional connection between the nodes represent their weighted relevance associations. Learning algorithms can be applied to adjust connection weights so that the network can predict or classify unknown examples correctly. [21]. Connectionist algorithms and neural networks have been applied over the years for finding nebulous relationships and patterns between real world entities (for example, topic spotting in natural language documents [22]). In the Horde Project we make use of a neural network model whose learning process involves weight computations based on the approximation of strength of relationship that exist between users, documents, links and words in the real world from data collected by crawling the web and monitoring the user’s document access patterns. From the study of these identified relationships we determine the vicinity between millions of Web users. We present the structure of the neural network and the weight derivation process in greater detail in the following two sections. The process of user access monitoring and web crawling are also described.

Based on the nomenclature as proposed by Lottermoser et al. [16], the Horde Project is in effect a dynamic nondisjunctive non-symmetric vicinity based presence awareness service for the web. Lottermoser et al. define vicinity as “a set of locations which are adjacent in terms of awareness” such that a presentity occupying a location within the vicinity of another is aware of the other presentity. In our context presentity maps to on-line users and they are said to be in each other’s vicinity if they exhibit satisfactory levels of similarity of interests. There are a few researches on metrics for measuring distances between users in the cyberspace. According to Froitzheim and Schubert [17], Web pages, and all other types of network accessible documents, can be seen as locations in the virtual world. These virtual locations correspond to places in the real world like rooms, street corners, and stores. People are moving - browsing – between virtual locations via hypertext references… Humans, and other active entities (e.g. robots) are acting in the space spanned by URLs. One action is movement through the virtual world... The second action is communication between active entities, e.g. humans or agents. Froitzheim and Schubert advocate hypertext references as the obvious choice of metrics but also cite document content overlap as another possible alternative [17, 18].

The Horde Project approach

Types of layers The neural network model used in Horde Project is composed of layers of four types as shown in figure 1. The input and output layers of the neural network are user layers and the link layers, document layers and word layers are the hidden layers of the neural network. The detail information of each layer in the model is given as follows.

Link layer Each node in this layer represents a unique hyperlink in the crawled subset of the Web. A link is representative of both the content of the document it points to and also the context of the document in the form of the hyperlink text.

Document layer The nodes in the document layer represent individual documents or web pages that have been discovered while crawling. The document nodes, in essence, represent the contents of the corresponding documents.

Word layer Nodes in this layer represent words identified from the documents crawled on the Web. Stop words are not mapped nor are different forms of the same roots. For example, only “dye” is mapped to a node where as “dyes”, “dyeing” and “dyed” are not. More on word identification and processing (stop word removal, stemming etc.) is highlighted later. The innermost two hidden layers of the neural network are word layers.

Weight assignment In this section we shall discuss the various weight assignment policies for the connections between the nodes of the different layers of the used neural network model. The weights are representative of strength of relationship between the entities represented by the corresponding nodes.

Input/Output layers

User layer This layer comprises of nodes representing each uniquely identified user on the Web. Users are identified using unique usernames or screen-names. Only users who are online at the current instant are mapped to nodes. The user layer nodes are connected to the nodes in the link and document layers.

As discussed earlier, the input and output layers of the Horde Project are user layers. The weights between the user layers and the corresponding link layers and document layers are derived dynamically based on the user’s document access patterns. The data collected in this fashion is dynamic contradictory to stereotype models where prior information about users is assumed to be available [11]. As a part of the Horde Project client we develop a browser extension for Internet Explorer [23] that monitors every link that the user clicks and every URL that he/she visits and reports it in real time to the centralized Horde server via a web service. The choice of browser is immaterial to the Horde Project and the same can also be developed for other browsers such as Firefox [24]. On the server side every URL visited and hyperlink clicked by the user is logged along with a timestamp. Here we propose the semantics of a user as a collection of documents (or links) he has recently visited (or clicked) augmented with the following perception: (a) the more frequently uk access dj, the more significant dj is in characterizing the interest of uk (the document access frequency assumption); (b) the more users accessing document dj the smaller is the contribution of the document in characterizing the user uk accessing it (the inverse user

frequency assumption); (c) the latest the timestamp is for an access the higher is its influence in characterization of the user uk. Based on the above insight we compute a function document frequency-inverse user frequency, dfiuf(dj, uk) for document dj and user uk as follows,

dfiuf(d j , u k ) = df(d j , u k ) ⋅ log

|U| # U (d j )

Where #U(dj) denotes the number of distinct users in the user set U who have visited dj at least once and

⎧1 + log # (d j , u k ), if # (d j , u k ) > 0⎫ df(d j , u k ) = ⎨ ⎬ 0, Otherwise ⎩ ⎭ And #(dj, uk) is given as n ⎧ M - e kt i , if M ≥ e kt i ⎫ # (d j , u k ) = ∑ ⎨ ⎬ 0, Otherwise ⎭ i ⎩

Where n is the total number of times the user uk has accessed dj in the last Tmax seconds and M is one greater than the maximum weight contribution possible by a single access of a document and ti indicates the difference between the timestamp of the ith access and the current time. The relationship between M, k and Tmax is kTmax

=M

e

Here we have assumed that the significance of an access reduces exponentially with time. The final Weight is obtained by cosine normalization, finally yielding

w jk =

dfiuf(d j , u k ) |c|

∑ dfiuf(d , u ) j

2

s

s =1

Note that access records whose timestamps are older than the current time by more than Tmax (generally 3 hours) are periodically removed from the database.

Hidden layers The weights between the nodes in the hidden layers of the neural networks are directly derived from the data collected by crawling the Web. The Horde Project crawler starts from an initial small database of URLs. It fetches each document or page and extracts all the hyperlinks present in the document. Each URL discovered from the hyperlinks is added to the database for subsequent crawling. The content of each document fetched then undergoes a preprocessing phase. Stop words are removed by checking with an available static list. All words are then reduced to their roots by the application of a modified version of Porter’s Stemming algorithm [25] as suggested by Yamout et al. [26]. Subsequent processing determines the relationships between the hidden layer nodes in terms of weights as described next.

Link to Document. Each link node has a single edge connecting it to the single document node corresponding to the target document of the link. The weight on this link is constant (γ) for all link and document node pairs. The constant γ represents the probability that the user’s decision to access the document was based on its actual content only. Therefore 1-γ represents the probability that the user’s decision was based on the link text (document context). Here we assume that if a user is aware of the contents of a document the link text does not have much effect on his decision to access the document or not. Conversely if he is unaware of the actual content then the context is his only source of information on which he can base his decision. Therefore the user’s decision is based on either the document context or the content but not both. Note that the document nodes to link nodes connections have symmetrically similar weights as described. Link to word. Apart from the document nodes, the link nodes also connect with the word nodes corresponding to the words that appear as part of the link text. The weights are calculated as in the case of document nodes to word nodes (described later). Note that in this case the words in the link text are treated equivalent to the title text for documents in document-to-word weight calculation. The weights thus calculated are further multiplied by a factor of 1-γ, probability that the document was accessed based on the hyperlink text only. The weights between word nodes to link nodes are exactly same as in the reverse direction. Document to word. The document to word node weighting is based on document occurrence representation (DOR) universally adopted by the information retrieval community [27]. The document frequency – inverse term frequency, dfitf(dk, tj) for the document dk and the word tj is computed as per the following relation

dfitf(d k , t j ) = df(d k , t j ) ⋅ log

|D| # D (d k )

Where #D(dk) denotes the number of distinct terms in the dictionary D which occur at least once in dk and

⎧1 + log # (d k , t j ), if # (d k , t j ) > 0⎫ df(d k , t j ) = ⎨ ⎬ 0, Otherwise ⎩ ⎭ Where #(dk, tj) is given as n

# (d k , t j ) = ∑ f ( t ji ) i

The function f(tji) returns a value in the range of 0 and 1 proportional to the emphasis on the ith occurrence of term tj. For example, if the ith occurrence of tj is in the title of the web page then the returned value of f(tji) is 1. Clearly f(tji) is higher for terms that are in bold face or in larger font size. The final Weight obtained by cosine normalization is as follows

wkj =

dfitf(d k , t j ) | c|

∑ dfitf(d , t ) s

2

j

s =1

Word to document. The computation of word to document node weights is similar to that of document to word nodes and is obtained by multiplying the earlier result by a factor ρ, which is the page rank of the document in question. Brin and Page calculate the rank of a document as

PR(A) = (1 - d) + d × (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Where PR(T) is the page rank of document T, C(T) is the number of links going out of document T, T1...Tn are other documents with links pointing towards A and d is a damping factor with value between 0 and 1 (usually set to 0.85) [15]. Word to word. The weighting between word nodes is based on term co-occurrence representation (TCOR) [27]. The term frequency – inverse term frequency has its form

tfitf(t k , t j ) = tf(t k , t j ) ⋅ log

|D| # D (t k )

Where #D(tk) denotes the number of terms in the dictionary D which co-occur with tk in at least one document and

⎧1 + log # (t k , t j ), if # (t k , t j ) > 0⎫ tf(t k , t j ) = ⎨ ⎬ 0, Otherwise ⎩ ⎭ Where #(tk, tj) denotes the number of documents in which tk and tj co-occur. Weights thus obtained are too normalized by cosine normalization, yielding

wkj =

tfitf(t k , t j ) |c|

∑ tfitf(t , t ) s

2

j

s =1

Execution The Horde Project server comprises of the neural network described in the previous section, exposed with the help of Web services. The weights in the neural network model are determined as already explained. The Horde Project browser extension on the client machine, in addition to reporting user access patterns to the server, periodically polls for list of other users with compatible levels of similarity in browsing interests. The poll request is transformed on the server into one execution of signal flow through the entire neural network. The signal flow starts with the user layer node corresponding to the user from whom the poll request is received. This node is applied with an input signal of unit strength and the other nodes with no signal. The signal flows through the different layers of the neural network normalized by the weights between the layers described.

The weights between the user and the document and link layers associate the user with the document context and content most prominent in his access pattern. The connection to the word layer tries to identify key words that are most recurrent in the documents and links accessed. The next layer of connections with the word layer recognizes other words that appear most frequently on the Web along with the already identified words. This entire array of words thus identified is representative of the concepts of interest to the user. The rest of the neural network is almost symmetrical to the first half described. The array of words is mapped back to documents and links that show high frequency of occurrence of the words in more visible style (bold face, larger font size etc.). In the final phase the users are identified who are found to display the highest level of interest in the identified documents and links. These identified users are ranked by the strength of signal received by their corresponding nodes in the output layer. The n top-ranking users are reported back to the client as compatible with the client’s perceived interests where it is duly represented to the user by the Horde Project browser extension.

Application Recognizing CoPs among on-line users on the Web provides a platform for collaboration and knowledge sharing. Use can be made of the results of the described tool to build innovative applications that enable users for quick information exchange in the area of common interest. Social networking softwares can adopt the results for providing suggestions to its users about compatible candidates for networking. The results can also be used for enhancing one’s browsing experience from a lonely search of information to a collaborative process. Document recommendations can be provided to users based on what other users with comparable interest are reading or they can be allowed to correspond more closely and interactively through suitable collaboration frameworks. Examples of such collaboration efforts include Collaborative Web Browsing [28] and CoBrow [19].

Conclusion With the massive increase of the information content on the Web a dire need is being felt for quick extraction of pivotal information with minimal searching overhead. In the physical world we often take to enquiring information from someone who has prior familiarity of the subject of interest than exploring unwieldy quantity of literature. The Horde Project is an endeavor to bring together on-line users with similar interests for quick and effective flow of knowledge. The Horde Project is currently being developed on the Microsoft .Net framework 2.0 [29] with Microsoft SQL Server 2005 [30] being used for database. The current version of the server is designed using the

model-view-controller pattern [31] and exposes its functionality through web services and an admin console. The Horde Project uses the described neural network model for identifying relationships between users based on their browsing patterns. Equipped with such knowledge of potential CoPs and suitable framework support for easy collaboration the act of browsing can become truly a collaborative experience.

Acknowledgement We thank Dr. Kavi Mahesh, Dright Ho and Cohan Sujay Carlos for their reviews, constructive criticisms and suggestions that helped us write this paper and Annapurna PS for her involvement in the implementation of the Horde Project.

References [1] Donath, J., and Robertson, N. 1994. "The Sociable Web" Proceedings of the Second International WWW Conference. Chicago, IL, October 1994. [2] Jones S. G. eds. 1996. Cybersociety: computermediated communication and community. New Media Cultures. Canadian Journal of Communication. [3] Bruckman, Amy. 1997. MOOSE Crossing: Construction, Community, and Learning in a Networked Virtual World for Kids. PhD Thesis, MIT. [4] Lave, J. and Wenger, E. 1991. Situated Learning: Legitimate Peripheral Participation. Cambridge University Press. [5] Zhang, Y., and Tanniru, M. 2005. An Agent-Based Approach to Study Virtual Learning Communities. In Proceedings of the Proceedings of the 38th Annual Hawaii international Conference on System Sciences (Hicss'05) Track 1 - Volume 01 (January 03 - 06, 2005). HICSS. IEEE Computer Society, Washington, DC, 11.3. [6] Gourlay, S. 2003. Communities of Practice: A new concept for the millennium, or the rediscovery of the wheel? Available at http://ktru-main.lancs.ac.uk. [7] Millen, D. R. and Dray, S. M. 2000. Information sharing in an online community of journalists. Aslib Proceedings: new information perspectives. [8] Pemberton-Billing, J., Cooper, R., Wootton, A. B. and Andrew, N.W. North. 2003. Distributed Design Teams as Communities of Practice. Proceedings of 5th European Academy of Design Conference. [9] Alani, H., Dasmahapatra, S., O'Hara, K. and Shadbolt, N. (2003) Identifying Communities of Practice through Ontology Network Analysis. IEEE IS 18(2) pp. 18-25. [10] Davies, J., Duke, A. and Sure, Y. 2003. OntoShare – An Ontology-based Knowledge Sharing System for Virtual Communities of Practice. Proceedings of I-KNOW. [11] G. Paliouras, C. Papatheodorou, V. Karkaletsis, C.D. Spyropoulos, 2002. Discovering User Communities on the Internet using Unsupervised Machine Learning

Techniques, Interacting with Computers, Vol. 14(6), pp. 761-791, January 2002. [12] Cyrus Shahabi, Amir M. Zarkesh, Jafar Adibi, Vishal Shah. 1997. Knowledge Discovery from Users Web-Page Navigation. RIDE 1997. [13] Chan, P. K. 2000. Constructing Web User Profiles: A non-invasive Learning Approach. In Revised Papers From the international Workshop on Web Usage Analysis and User Profiling B. M. Masand and M. Spiliopoulou, Eds. Lecture Notes in Computer Science, vol. 1836. SpringerVerlag, London, 39-55. [14] Fu, Yongjian; Sandhu, Kanwalpreet; Shih, Ming-Yi. (1999). "Clustering of Web Users Based on Access Patterns" [Proceedings]. WEBKDD (Workshop on Web Usage Analysis and User Profiling), 1999, [15] Brin, S. and Page, L. 1998. The anatomy of a largescale hypertextual Web search engine. In Proceedings of the Seventh international Conference on World Wide Web 7 (Brisbane, Australia). [16] Lottermoser, B. G., Plaice, J., Kropf P. and Slonim, J. 2003. Distributed Communities on the Web. Springer. [17] Froitzheim, K. and Schubert, P. 2004. Presence in Communication Spaces. 19th International CODATA Conference [18] EU Telematics Project CoBrow: "System Specification", Project Deliverable 4.3, Feb. 1997. [19] Sidler, G., Scott, A. and Wolf, H. 1997. Collaborative browsing in the World Wide Web. Proceedings of the 8th Joint European Networking Conference, Edinburgh, May 12.-15. 1997. [20] Mandl, T. 2000. Tolerant and Adaptive Information Retrieval with Neural Networks. Available at http://unihildesheim.de. [21] Chen, H. 1995. Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms. Journal of the American Society for Information Science. Volume 46, Issue 3, Pages 194 - 216. [22] Wiener, E., Pedersen, J. O. and Weigend, A. S. 1995. A Neural Network Approach to Topic Spotting. Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval. [23] Microsoft Corp., Internet Explorer. Available at http://www.microsoft.com. Firefox. Available at [24] Mozilla Corp., http://www.mozilla.com. [25] Porter, M.F., (2002) “Developing the English Stemmer”, http://snowball.tartarus.org/. [26] Yamout, F., Demachkieh, R., Hamdan, G. and Sabra, R. 2004. Further Enhancement to the Porter’s Stemming Algorithm. Workshop on Text-based Information Retrieval TIR-04. [27] Lavelli, A., Sebastiani, F. and Zanoli, R. 2004. An Experimental Comparison of Term Representations for Term Management Applications. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM-2004), Washington, USA, 9-11 November 2004.

[28] Esenther, A. W. 2002. Instant Co-Browsing: Lightweight Real-Time Collaborative Web Browsing. In Proceedings of WWW, 2002. [29] Microsoft Corp., Microsoft Visual Studio 2005. Available at http://msdn.microsoft.com/vstudio. [30] Microsoft Corp., Microsoft SQL Server 2005. Available at http://www.microsoft.com/sql/default.mspx. [31] Elisabeth Freeman, Bert Bates and Kathy Sierra, Head first design patterns, Oct 26, 2004. pp. 526-549, O’Reilly Media, Inc.