The Horde Project: Collaborative Browsing Framework ...

Viewer
Transcript

A Project Report on

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice by Annapurna P S - 1PI02CS013 Bharati Raghavan - 1PI02CS024 Bhaskar Mitra - 1PI02CS025 Radha C - 1PI02CS075 Guide Dr. Kavi Mahesh Feb’06 - June’06 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING PES INSTITUTE OF TECHNOLOGY 100 FEET RING ROAD, BSK-III STAGE, BANGALORE – 560 085

PES Institute of Technology 100 Feet Ring Road, BSK- III Stage, Bangalore – 560085 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CERTIFICATE Certified that the project work entitled The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice is a bonafide work carried out by Annapurna P S, Bharati Raghavan, Bhaskar Mitra and Radha C in partial fulfillment for the award of degree of Bachelor of Engineering in Computer Science and Engineering of the Visveswariah Technological University, Belgaum during the year 2006. It is certified that all corrections and suggestions indicated for internal assessment have been incorporated in the report deposited in the Departmental Library. The Project Report has been approved as it satisfies the academic requirements in respect of the project work prescribed for the Bachelor of Engineering Degree. Signature of the Guide Dr. Kavi Mahesh

Signature of the HOD Prof. Nitin V. Pujari

Signature of the Principal Dr. K. N. B. Murthy

Name of the Student: Annapurna P S University Seat Number: 1PI02CS013

Name of the Student: Bharati Raghavan University Seat Number: 1PI02CS024

Name of the Student: Bhaskar Mitra University Seat Number: 1PI02CS025

Name of the Student: Radha C University Seat Number: 1PI02CS075

External Viva Name of the Examiners

Signature with date

1. ……………………….

……………………………….

2. …………………………

……………………………….

Acknowledgement We are extremely grateful to our institution PESIT for providing us with facilities that have made this project a success. We express our heartfelt gratitude to Prof. Nitin V. Pujari, Head of the Computer Science and Engg. Department whose support and guidance was invaluable. We

would

like

to

express

our

gratitude

to

Dr.

K.

N.

Balasubramanya Murthy, Principal, PESIT and Prof. D. Jawahar, Director, PESIT for providing us a congenial environment to work in. We are extremely grateful to our faculty-in-charge Dr.Kavi Mahesh for his guidance, advice and continuous encouragement. We would also like to express our gratitude to our project coordinator Mr.V.Badri Prasad for his guidance.

Index S. No. 1 2 3 4

5

6 7

8

S. No. 9

10 11

Topic Introduction Problem Definition Literature Survey Project Requirement Definition 4.1 User Characteristics

Page No. 1 3 4 13 13

4.2 Assumptions and Dependencies

13

4.3 Requirement Specification

14

4.3.1 Functional Requirements

14

4.3.2 Operational Requirements

14

4.3.3 Non-Functional Requirements

15

4.3.4 Security Requirements

15

4.3.5 User Interface Description

15

4.3.6 Reporting Requirements

16

System Requirement Definition 5.1 Scope and Feasibility of the Software

17 17

5.2 Software Requirements

17

5.3 Hardware Requirements

18

GANTT Chart System Design 7.1 System Overview

19 20 20

7.1.1 Neural Network Design

20

7.1.2 High Level Architecture

27

Detailed System Design 8.1 Database Design

29 29

8.2 User Interface Details

39

8.3 Classes

39 Topic

Implementation 9.1 Interaction between objects of various classes Integration Testing

Page No. 41 41 43 45

12 13

11.1 Unit and Functional Testing

45

11.2 Integration Testing

45

Conclusion Future Enhancements

46 47

Abstract Highly interactive computer mediated communication has facilitated the formation of various forms of virtual communities. However, the process of community formation is fundamentally a manual process. The aim of the Horde project is to dynamically identify virtual communities of practice that exist on the Web by examining the browsing patterns of online users using a neural network based information retrieval model and then provide framework support for realization of true collaboration in Web browsing. The dynamic identification of virtual communities of practice that exist on the web involves constructing a neural network using the information obtained by crawling the web and processing the obtained data and also making use of the data obtained by monitoring the user’s browsing patterns. The end result of this stage is an ordered list of users or documents that are related to the user initiating the query. This list also contains a factor indicating the degree of relation between every user/document in the list and the user initiating the query. These results thus obtained can be made use of to construct a framework support for collaborative browsing. The framework support includes text messaging, hooked browsing etc.

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

1 Introduction Several Web based technologies and services have transformed the World Wide Web into an ideal meeting place for its users. The Web has progressively advanced beyond being an information space to also a social space. Highly interactive computer mediated communication (CMC) has facilitated the formation of virtual communities and the transformation of the web into a context for learning through community-supported collaborative construction. Various forms of virtual learning communities (VLC) have existed over the years including communities of practice (CoP) and Internet discussion groups engaged in interactive sharing and learning processes. However, the process of community formation or registration based on user’s interest structure is still fundamentally a manual process. Users browse for Internet discussion forums of interest before joining. After joining most members are found to be active for a short time after which they become passive recipient of messages (lurkers) with little or no contribution most of the time. This accentuates the necessity of forming relatively shortterm

knowledge

sharing

interactions

instead,

marked

by

intense

participation in the form of small CoPs where once the need of the user is met the CoP is dissolved. For such short-term interactions initiation or registration processes with overheads of manual searching for compatible candidate with analogous interests are highly undesirable. This overhead can be evaded with automatic identification of possible CoPs existing over the Web ideally with minimal user involvement. To accomplish this we develop an on-line collaboration project called the Horde Project to automatically identify commonality of interests and possible CoPs based on user’s document access patterns or Web usage using a neural network based information retrieval model.

Department of CS

Feb’06 – June’06

-1-

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

The Horde Project also provides means for such users with commonality of interests to communicate with each other through simple text messages. Cobrowsing, a process in which users transit from one page to another collectively as a group, is also supported. This allows users to share their knowledge in the fields of common interest with each other.

Department of CS

Feb’06 – June’06

-2-

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

1 Problem Definition The aim of the Horde project is to dynamically identify virtual communities of practice that exist on the Web by examining the browsing patterns of online users using a neural network based information retrieval model and then provide framework support for realization of true collaboration in Web browsing. The collaboration framework should provide means for the users to communicate with each other and collaboratively browse the web.

Department of CS

Feb’06 – June’06

-3-

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

2 Literature Survey The initial literature survey has involved study of previous works and related concepts present in the form of technical papers, books and other forms of online and offline publications. The referenced publications and the context in which they have proven useful are specified in the references section. This section gives some insight into the various topics that became a part of our literature survey. Communities of Practice A Community of Practice(CoP) can be defined as a loosely bound distributed group of people with shared interests benefiting from legitimate peripheral participation (LPP), which is a process of peripheral participatory association of the fresh members of the community and their consequent promotion with experience gained from the association. Numerous case studies have been performed till date on CoPs such as those involving online community of journalists and distributed design teams that suggest the existence of a general feeling of openness in asking for assistance and providing answers within the CoP contributing to shared information. Many diverse approaches have been adopted over the years to accurately cluster users into CoPs including statistical clustering techniques, ontology networks and symbolic machine learning. In ontology network analysis (ONA), hierarchical data structure containing all the relevant entities and their relationships are formulated and strength of relationship between the entities are measured to provide metrics of connectedness. The challenge lies in the construction of the ontology networks and correct estimation of strength of relationships between entities.

Department of CS

Feb’06 – June’06

-4-

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

Machine learning techniques

alternatively have been employed for user

modeling tasks by monitoring system usage. An important issue in machine learning techniques is the

choice of the learning method. Two of the

unsupervised learning methods

include conceptual clustering and cluster

mining. Conceptual clustering algorithm generates a hierarchy of entities, which can be used to select communities at different levels of generality. It generates disjoint communities, which map each user uniquely to a community. It is unsuitable for systems such as the Web where users commonly belong to more than one community. Many of these approaches incorporate either profiling user navigation paths or explicit profile data collection from users. Browsing

patterns usually

provide good indication of a user’s interests. It is possible,

though not

comprehensively exploited, to establish a relationship between the millions of users on the Web based on their browsing patterns. Contrastingly the relationship between millions of hyperlinked documents on

the Web has

been well explored and employed in the forms of massive

indexes for

searching. While clustering users into CoPs an important issue to be taken care of are the vicinity metrics. Vicinity can be defined as “a set of locations which are adjacent in terms of awareness” such that a presentity occupying a location within the vicinity of another is aware of the other presentity. In our context presentity maps to on-line users and they are said to be in each other’s vicinity if they exhibit satisfactory levels of similarity of interests. There are a few researches on metrics for measuring distances between users in the cyberspace. According to Froitzheim and Schubert, Web pages, and all other types of network accessible documents, can be seen as locations in the virtual world. These virtual locations correspond to places in the real world like rooms, street corners, and stores. People are moving Department of CS

Feb’06 – June’06

-5-

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

browsing – between virtual locations via hypertext references… Humans, and other active entities (e.g. robots) are acting in the space spanned by URLs. One action is movement through the virtual world... The second action is communication between active entities, e.g. humans or agents. Froitzheim and Schubert advocate hypertext references as the obvious choice of metrics

but also cite document content overlap as another possible

alternative. Sidler, Scott and Wolf see the vicinity metric as a combination of four simpler metrics – space, semantic, time and user interest which they visualize as orthogonal axes in the vicinity space. The space metric is similar to that advocated by Froitzheim and Schubert as based on hypertext references. The semantic metric is based on document semantic by using “weighted hyperlinks or by using content-based information from the documents”. The time metric is based on temporal overlapping of user’s browsing sessions. In the human-centered interest metric, vicinity is defined “based on shared interest, the ability to speak a common language, membership of the same cultural group, and many other definable characteristics”. Information Retrieval Information retrieval (IR) is the art and science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data. Since the 1940s the problem of information storage and retrieval has attracted increasing attention. It is simply stated: we have vast amounts of

information to which accurate and speedy access is becoming ever more difficult. One effect of this is that relevant information gets ignored since it is Department of CS Feb’06 – June’06 -6-

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

never uncovered, which in turn leads to much duplication of work and effort. With the advent of computers, a great deal of thought has been given to using them to provide rapid and intelligent retrieval systems. In libraries, many of which certainly have an information storage and retrieval problem, some of the more mundane tasks, such as cataloguing and general administration, have successfully been taken over by computers. However, the problem of effective retrieval remains largely unsolved. In principle, information storage and retrieval is simple. Suppose there is a store of documents and a person (user of the store) formulates a question (request or query) to which the answer is a set of documents satisfying the information need expressed by his question. He can obtain the set by reading all the documents in the store, retaining the relevant documents and discarding all the others. In a sense, this constitutes 'perfect' retrieval. This solution is obviously impracticable. A user either does not have the time or does not wish to spend the time reading the entire document collection, apart from the fact that it may be physically impossible for him to do so. When high-speed computers became available for non-numerical work, many thought that a computer would be able to 'read' an entire document collection to extract the relevant documents. It soon became apparent that using the natural language text of a document not only caused input and storage problems (it still does) but also left unsolved the intellectual problem of characterizing the document content. It is conceivable that future hardware developments may make natural language input and storage more feasible. But automatic characterization in which the software attempts to duplicate the human process of 'reading' is a very sticky problem indeed. More specifically, 'reading' involves attempting to extract information, both syntactic and semantic, from the text and using it to decide whether each document is relevant or not to a particular request. The difficulty is not only knowing how to extract the information but also how to use it to decide relevance. The comparatively slow progress of modern linguistics on the Department of CS Feb’06 – June’06 -7-

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

semantic front and the conspicuous failure of machine translation show that these problems are largely unsolved. The purpose of an automatic retrieval strategy is to retrieve all the relevant documents at the same time retrieving as few of the non-relevant as possible. When the characterization of a document is worked out, it should be such that when the document it represents is relevant to a query, it will enable the document to be retrieved in response to that query. Human indexers have traditionally characterized documents in this way when assigning index terms to documents. The indexer attempts to anticipate the kind of index terms a user would employ to retrieve each document whose content he is about to describe. Implicitly he is constructing queries for which the document is relevant. When the indexing is done automatically it is assumed that by pushing the text of a document or query through the same automatic analysis, the output will be a representation of the content, and if the document is relevant to the query, a computational procedure will show this. Intellectually it is possible for a human to establish the relevance of a document to a query. For a computer to do this we need to construct a model within which relevance decisions can be quantified. It is interesting to note that most research in information retrieval can be shown to have been concerned with different aspects of such a model.

Department of CS

Feb’06 – June’06

-8-

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

The diagram above illustrates a typical IR system. It shows three components: input, processor and output. Starting with the input side of things. The main problem here is to obtain a representation of each document and query suitable for a computer to use. Let me emphasize that most computer-based retrieval systems store only a representation of the document (or query) which means that the text of a document is lost once it has been processed for the purpose of generating its representation. A document representative could, for example, be a list of extracted words considered to be significant. Rather than have the computer process the natural language, an alternative approach is to have an artificial language within which all queries and documents can be formulated. When the retrieval system is on-line, it is possible for the user to change his request during one search session in the light of a sample retrieval, thereby, it is hoped, improving the subsequent retrieval run. Such a procedure is commonly referred to as feedback. Secondly, the processor, that part of the retrieval system concerned with the retrieval process. The process may involve structuring the information in some appropriate way, such as classifying it. It will also involve performing Department of CS

Feb’06 – June’06

-9-

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

the actual retrieval function, that is, executing the search strategy in response to a query. In the diagram, the documents have been placed in a separate box to emphasize the fact that they are not just input but can be used during the retrieval process in such a way that their structure is more correctly seen as part of the retrieval process. Finally, we come to the output, which is usually a set of citations or document numbers. In an operational system the story ends here. However, in an experimental system it leaves the evaluation to be done. Artificial Neural Networks An artificial neural network can be defined as a data processing system consisting of a large number of simple, highly interconnected processing elements called artificial neurons in an architecture inspired by the structure of the cerebral cortex of the brain. These processing elements are organized into a sequence of layers with full or random connections between the layers. This arrangement is shown in the figure below.

yl Department of CS Input Buffer Middle(Hidden) Output Buffer Layer

yq

yr

Feb’06 – June’06 xl

xh

- 10 xm

kthth Layer jith Layer Layer

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

The input layer is a buffer that presents data to the network. This input layer is not a neural computing layer because the nodes have no input weights and no activation functions. The top layer is the output layer, which presents the output response to a given output. The other layer (or layers) is called the intermediate or hidden layer because it usually has no connections to the outside world. Each processing element in the neural network receives input signals represented as x0, x1, …, xn. These inputs are modified by a synaptic weight whose function is analogous to that of the synaptic junction in a biological neuron. These weights

can be positive or negative, corresponding to the

acceleration or inhibition of the flow of electrical signals. Each processing element consists of two parts: the first part aggregates the weighted inputs resulting in a quantity and the second part is a nonlinear filter called the activation function through which the combined signal flows. There are various possible activation functions

and are chosen depending on the

application.

Department of CS

Feb’06 – June’06

- 11 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

1 Project Requirement Definition 2 User Characteristics •

The most common users of this product will be the general Internet surfers. The Communities of Practice are dynamically identified with minimal user involvement. A user friendly GUI is provided which allows online users to communicate and collaborate with each other effectively.

•

The other class of users will be the administrators of the system

who

would be monitoring and controlling the server through a console.

3 Assumptions and Dependencies •

The system assumes the existence of a reliable network connection and hence, does not provide any functionality to cater to a faulty or unreliable network connection.

•

The client is developed as an extension to the Internet Explorer browser. Although the system can be implemented for other browsers such as Firefox, they have not been catered to.

•

The software does not provide any security services in addition to those provided by the communication protocols employed by the users' system.

Department of CS

Feb’06 – June’06

- 12 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

1.1 Requirement Specification 1.1.1 Functional Requirements •

The user shall be able to login to the system with a screen name and a password.

•

The client shall monitor the URLs visited and the links clicked by the user and report to the server in real time.

•

The server shall receive these details and log them along with a timestamp.

•

The URLs thus logged are used to build a neural network model whose output is a list of similar interest users and similar context documents.

•

The system should support collaborative browsing.

•

The system should allow similar interest users to communicate with each other through text messaging.

•

The

users

should

be

able

to

communicate

with

multiple

users

simultaneously.

1.1.2 Operational Requirements •

SQL Server should be up and running.

•

SQL Database driver is required for the Horde server to be able to connect to the database.

•

A reliable Internet connection should be available and up.

Department of CS

Feb’06 – June’06

- 13 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

1.1.3 Non-Functional Requirements •

The system must operate in real time.

•

The data collection from the web must be dynamic and must be performed with requisite frequency to avail up to date information on user's browsing patterns.

•

The system should provide for maximum concurrent usage as the system is going to be used all over the world simultaneously.

•

The system must provide a high degree of reliability and availability.

•

The database maintained at the server must be backed up, as critical information is stored in the database.

•

The privacy of the information pertaining to the users' browsing patterns shall be maintained and not made available to other users unless they have similar interests.

1.1.4 Security Requirements This is a Web-based application and hence no additional security measures need to be taken. The security mechanisms and services provided by the underlying protocols are sufficient for the application.

1.1.5 User Interface Description •

The system provides a console, which provides a set of commands to monitor and control the server.

•

The system should provide a login screen that enables the user to enter a screen name and password to login to the system.

•

As the user visits URLs, a list of viable candidates for collaboration is provided to the users in the form of a simple list on the Internet

Explorer sidebar. Department of CS

Feb’06 – June’06

- 14 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

•

Along with the list of similar interest users, a list of similar context documents is also provided in the sidebar.

•

After the user selects a similar interest user from the list provided, a window for text messaging is provided which enables quick information exchange.

•

The user can thus communicate with a number of users simultaneously. For each user name selected for communication, a separate window is popped up.

•

When the user selects any of the documents listed in the sidebar, the user is redirected to the URL of the document.

•

Collaborative browsing is also provided through a similar interface.

•

Online help should be provided to the user for ease of use.

1.1.6 Reporting Requirements •

All actions performed by the Horde Server must be logged.

•

Descriptive failure and error messages must be provided to the administrators for easy configuration and maintenance.

•

Invalid inputs from the user such as invalid username or password must be reported to the user in a user-friendly and descriptive manner.

Department of CS

Feb’06 – June’06

- 15 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

1 System Requirement Definition The System Requirement Definition converts the project requirement specification above into a formal document, which forms the basis for software development. This chapter specifies the features of the software to be developed in formal terms.

4 Scope and Feasibility of the Software The software intended to be developed is a practical solution to the problem of dynamically identifying online communities of practice. It shall provide an efficient and convenient mechanism for communication and collaboration between the members of such dynamic communities. The software shall be implemented using C#.Net which provides the collection of APIs required for the obtaining the required information and performing the necessary computations.C#.Net provides for interoperability with the Common Object Model APIs which can be used to provide the various functionalities of the software.

1.2 Software Requirements Operating System - Windows 2000 Professional / Windows XP Database – SQL Server 2005 Framework – Microsoft .NET 2.0 Internet Explorer 6.0 XML Web Services

Department of CS

Feb’06 – June’06

- 16 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

1.3 Hardware Requirements Pentium Processor 600MHz minimum (1GHz recommended) 192 MB of RAM minimum (256MB recommended)

Department of CS

Feb’06 – June’06

- 17 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

5 GANTT Chart

Legend Legend

Requirement Analysis System Design Implementation Integration and Testing

Department of CS

Feb’06 – June’06

- 18 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

2 System Design 2.1 System Overview The Horde Project employs a neural network based information retrieval model to find relationships between the millions of users who are online at any instant. The first sub – section discusses the design of this neural network

2.1.1 Neural Network Design In the Horde Project we make use of a neural network model whose learning process involves understanding the relationship that exist between users, documents, links and words from data collected by crawling the web and monitoring the user’s document access patterns. From the study of these identified relationships we determine the vicinity between millions of Web users. The various layers and the weight assignment process are described below. Types of Layers The neural network model used in Horde Project is composed of layers of four types as shown in the figure below. The input and output layers of the neural network are user layers and the link layers, document layers and word layers are the hidden layers of

the neural network. The detail

information of each layer in the model is given as follows.

Department of CS

Feb’06 – June’06

- 19 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

User layer This layer comprises of nodes representing each uniquely identified user on the Web. Users are identified using unique usernames or screen-names. Only users who are online at the current instant are mapped to nodes. The user layer nodes are connected to the nodes in the link and document layers. Link layer Each node in this layer represents a unique hyperlink in the crawled subset of the Web. A link is representative of both the content of the document it points to and also the context of the document in the form of the hyperlink text. Department of CS

Feb’06 – June’06

- 20 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

Document layer The nodes in the document layer represent individual documents or web pages that have been discovered while crawling. The document nodes, in essence, represent the contents of the corresponding documents. Word layer Nodes in this layer represent words identified from the documents crawled on the Web. Stop words are not mapped nor

are different forms of the

same roots. For example, only “dye” is mapped to a node where as “dyes”, “dyeing” and “dyed” are not. The innermost two hidden layers of the neural network are word layers. Weight Assignment In this section we shall discuss the various weight assignment policies for the connections between the nodes of the different layers of the used neural network model. The weights are representative of strength of relationship between the entities represented by the corresponding nodes. Input/Output layers As discussed earlier, the input and output layers of the Horde Project are user layers. The weights between the user layers and the corresponding link layers and document layers are derived dynamically based on the user’s document access patterns. The data collected in this fashion is dynamic contradictory to stereotype models where prior information about users is assumed to be available. As a part of the Horde Project client we develop a browser extension for Internet Explorer that monitors every link that the user clicks and every URL that he/she visits and reports it in real time to the centralized Horde server via a web service. On the server side every URL visited and hyperlink clicked by the user is logged along with a timestamp. Department of CS

Feb’06 – June’06

- 21 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

Here we propose the semantics of a user as a collection of documents (or links) he has recently visited (or clicked) augmented with the following perception: (a) the more characterizing

frequently uk access dj, the more significant dj is in the

interest

of

uk

(the

document

access

frequency

assumption); (b) the more users accessing document dj the smaller is the contribution of the document in characterizing the user uk accessing it (the inverse user frequency assumption); (c) the latest the timestamp is for an access the higher is its influence in characterization of the user uk. Based on the above insight we compute a function document frequency-inverse user frequency, dfiuf(dj, uk) for document dj and user uk as follows,

Where #U(dj) denotes the number of distinct users in the user set U who have visited dj at least once and

And #(dj,uk) is given by

Where n is the total number of times the user uk has accessed dj in the last Tmax seconds and M is one greater than the maximum weight contribution possible by a single access of a document and ti indicates the difference between the timestamp of the ith access and the current time. The relationship between M, k and Tmax is

Department of CS

Feb’06 – June’06

- 22 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

Here we have assumed that the significance of an access reduces exponentially

with

time.

The

final

Weight

is

obtained

by

cosine

normalization, finally yielding

Hidden layers The weights between the nodes in the hidden layers of the neural networks are directly derived from the data collected by crawling the Web. The Horde Project crawler starts from an initial small database of URLs. It fetches each document or page and extracts all the hyperlinks present in the document. Each URL

discovered from the hyperlinks is added to the database for

subsequent crawling. The content of each document fetched then undergoes a preprocessing phase. Stop words are removed by checking with an available static list. All words are then reduced to their roots using Porter’s Stemming algorithm. Subsequent processing determines the relationships between the hidden layer nodes in weights as described next. Link to Document Each link node has a single edge connecting it to the single document node corresponding to the target document of the link. The weight on this link is constant (γ) for all link and document node pairs. The constant γ represents the probability that the user’s decision to access the document was based on its actual content only. Therefore 1-γ represents the probability that the user’s decision was based on the link text(document context). Here we assume that if a user is aware of the contents of a document the link text does not have much effect on his decision to access the document or not. Conversely if he is unaware of the Department of CS

Feb’06 – June’06

- 23 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

actual content then the context is his only source of information on which he can base his decision. Therefore the user’s decision is based on either the document context or the c content but not both. Note that the document nodes to link nodes connections have symmetrically similar weights as described. Link to word Apart from the document nodes, the link nodes also connect with the word nodes corresponding to the words that appear as part of the link text. The weights are calculated as in the case of document nodes to word nodes (described later). Note that in this case the words in the link text are treated equivalent to the title text for documents in document-to-word weight calculation. The weights thus calculated are further multiplied by a factor of 1-γ, probability that the document was accessed based on the hyperlink text only. The weights between word nodes to link nodes are exactly same as in the reverse direction. Document to word The document to word node weighting is based on document occurrence representation (DOR) universally adopted by the information retrieval community. The document frequency – inverse term frequency, dfitf(d k,dj) for the document dk and the word tj is computed as per the following relation

Where #D(dk) denotes the number of distinct terms in the dictionary D which occur at least once in dk and

Where #(dk, tj) is given as Department of CS Feb’06 – June’06

- 24 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

The function f(tji) returns a value in the range of 0 and 1 proportional to the emphasis on the ith occurrence of term tj. The final Weight obtained by cosine normalization is as follows

Word to word The weighting between word nodes is based on term co-occurrence representation (TCOR). The term frequency – inverse term frequency has its form

Where #D(tk) denotes the number of terms in the dictionary D which cooccur with tk in at least one document and

Where #(tk, tj) denotes the number of documents in which tk and tj co-occur. Weights thus obtained are too normalized by cosine normalization.

Department of CS

Feb’06 – June’06

- 25 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

2.1.2 High Level Architecture The following diagram illustrates the high level design of the Horde project server.

The design is based on the Model-View-Controller pattern. The View layer exposes the functionality of the server through a Web Service and through a console for the administrator. The Controller layer consists of a single module, which controls the Neural Network as well as the Collaboration models. The Neural network model consists of three modules: •

Data Collection Engine – which collects data from the Web by crawling the web periodically and from the client about the user’s browsing patterns and logs them into the database.

Department of CS

Feb’06 – June’06

- 26 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

•

Learning Engine – which scans through the data collected and assigns weights between the various layers of the neural network.

•

Execution Engine – which builds and executes the neural network to obtain the results as and when the request is obtained from the client.

The Collaboration Model consists of modules for the users to communicate and collaborate with each other. It consists of a Collaboration Framework Server, which acts as the interface between the client and the neural network model. The Horde client is an Internet Explorer extension which takes the form of a vertical sidebar listing the similar interest users, similar context documents and acting as an interface between the user and the server during query and communication.

Department of CS

Feb’06 – June’06

- 27 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

3 Detailed System Design 3.1 Database Design The Horde Project operates using the following tables: LoginInfo Maintains the user details and session identifiers. Name

Description

Constraint

Primary Key

Data Type

UserName

Name of the

Varchar(255)

Password

user User supplied

Varchar(255)

Email

password User mail

Varchar(255)

SecurityQuestion

address User supplied

Varchar(255)

security SecurityAnswer

question User supplied

Varchar(255)

security City Country SessionID

answer User City User Country A randomly

Varchar(255) Varchar(255) Varchar(255)

generated id to represent the user’s session with the server

Department of CS

Feb’06 – June’06

- 28 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

UserResourceVisitLog Contains list of links and URLs visited by the users. Name

Description

Constraint

Data Type

Username

Username of

Primary key

Varchar(255)

ResourceType ResourceID Timestamp

the user Link/Page Link ID/URL Time of visit

Primary Key Primary Key Primary key

Varchar(255) Varchar(255) Datetime

MessageList Contains the list of messages exchanged between users – utilized by the collaboration framework. Name From

Description Sender’s

Constraint Primary key

Data Type Varchar(255)

To

username Receiver’s

Primary key

Varchar(255)

Type

username Chat or

Varchar(6)

navigation Timestamp

message type Time when

Primary key

Datetime

message was Content

sent Content of the

Varchar(255)

Delivered

message A marker

int

indicating whether message has been delivered to receiver UrlList Department of CS

Feb’06 – June’06

- 29 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

Maintains a list of URLs that have been found. Name

Description

Url

The URL of the

Visited

page An integer

Constraint

Primary Key

Data Type

Varchar(255) int

representing whether the page has been parsed or not LinkList Contains the links traversed and their details. Name

Link

Description

Constraint

A randomly-

Primary key

Data Type

Varchar(255)

generated string to represent the Source

link Source page of

Varchar(255)

Page Inner Text

the link Link URL Inner Text of

Varchar(255) Varchar(255)

the Link

Department of CS

Feb’06 – June’06

- 30 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

SignalValueDump Maintains the query IDs and the corresponding output signal values after execution of the neural network. Name

QID

Description

A randomly

Constraint

Data Type

Primary Key

Varchar(255)

Primary Key

Varchar(255)

Primary Key

Varchar(2)

generated Entity

query ID The output node for which signal is being

EntityType

calculated The type of

SignalValue

layer The output

float

signal value A set of weight tables are used which list the neural network nodes and the weights associated with the edges connecting them. UserToLinkWeightList Maintains the weights between the user and the link layers. Name

User Link DFIUF

Description

Username Link ID Document

Constraint

Primary Key Primary Key

Data Type

Varchar(255) Varchar(255) Float

frequency inverse user frequency representing the strength of relationship Department of CS

Feb’06 – June’06

- 31 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

between user Weight

and link Normalized

Float

weight representing the strength of relationship between user Updated

and link A temporary

Int

marker to indicate whether the weight for this user- link pair is calculated or not. UserToPageWeightList Maintains the weights between the user and the document layers. Name

Description

User Page

Username URL of the

DFIUF

Page Document

Constraint

Primary Key Primary Key

Data Type

Varchar(255) Varchar(255) Float

frequency inverse user frequency representing the strength of relationship Department of CS

Feb’06 – June’06

- 32 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

between user Weight

and page Normalized

Float

weight representing the strength of relationship between user Updated

and page A temporary

Int

marker to indicate whether the weight for this user- page pair is calculated or not. WordToLinkWeightList Maintains the weights between the word and the link layers. Name

Description

Constraint

Data Type

Word

Word in the link Primary Key

Varchar(255)

Link DF

inner text Link ID Represents the

Varchar(255) Float

Primary Key

number of times the word occurs in the DFITF

link text Document

Float

Frequency Department of CS

inverse Term Feb’06 – June’06

- 33 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

Frequency representing the strength of relationship between document and Weight

word Normalized

Float

weight representing the strength of relationship between link Updated

and word A temporary

Int

marker to indicate whether the weight for this word- link pair is calculated or not. WordToPageWeightList Maintains the weights between the word and the document layers. Name

Description

Constraint

Data Type

Word

Word in the

Primary Key

Varchar(255)

Page

web page URL of the

Primary Key

Varchar(255)

DF

Page Represents the

Department of CS

Feb’06 – June’06

Float - 34 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

number of times the word occurs in the DFITF

page Document

Float

Frequency inverse Term Frequency representing the strength of relationship between document and Weight

word Normalized

Float

weight representing the strength of relationship between document and Updated

word A temporary

Int

marker to indicate whether the weight for this word- page pair is calculated or not. Department of CS

Feb’06 – June’06

- 35 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

WordToWordWeightList Maintains the weights between the word layers. Name

Word1

Description

First word

Constraint

Data Type

Primary Key

Varchar(255)

Primary Key

Varchar(255)

among the cooccurring Word2

words Second word among the cooccurring

TFITF

words Term

Float

frequency inverse term frequency representing the strength of relationship between the Weight

two words. Normalized

Float

weight representing the strength of relationship between the Updated

two words. A temporary

Int

marker indicating whether the weight for this Department of CS

Feb’06 – June’06

- 36 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

word-word pair is calculated or not.

Department of CS

Feb’06 – June’06

- 37 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

3.2 User Interface Details The Horde Client forms the user interface of the Horde Project. It takes the form of a vertical sidebar in the Internet Explorer browser and provides a list of similar interest users and similar context documents to the user. When the user can clicks on a name in the list, a window pops up through which the user can communicate with the user selected. When the user clicks on a document in the list, he is redirected to the selected document.

3.3 Classes 8.3.1 View Layer Classes •

Service – this is the web service class and contains methods to interact with the controller for logging users in and out, registering new users ad getting results from the server.

•

Shell – this is a class which provides methods which allow the administrator to interact with the controller and control the working of the server.

8.3.2 Controller Layer Classes •

Controller – this provides methods to control the working of the server. This acts as the interface between the view layer classes and the core model layer classes.

Department of CS

Feb’06 – June’06

- 38 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

8.3.3 Neural Network Classes •

Data Collection Engine WebDataProcessorSupervisor – supervises

the

web

data

collection

(crawling and parsing of web pages) •

PageDownloader – has methods to download and parse web pages Learning Engine

•

LearningEngineSupervisor -

the main supervisor which controls all the

weight updations •

UserToResourceWeightUpdaterSupervisor UserToResourceWeightUpdater

class

–

which

controls performs

the

the weight

calculation between the user and the resources accessed by him. •

WordToResourceWeightUpdaterSupervisor WordToResourceWeightUpdater

class

–

which

controls performs

the

the weight

calculation between the link/page and the word layers. •

WordToWordWeightUpdaterSupervisor

–

controls

the

WordToWordWeightUpdater class which performs the weight calculation between co-occurring words. Execution Engine •

UserQuery – This class contains methods to perform the signal value calculation for each query obtained from the client.

8.3.4 Collaboration Framework Classes •

CollaborationServer – This class contains methods for user collaboration.

Department of CS

Feb’06 – June’06

- 39 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

4 Implementation 4.1 Interaction between objects of various classes The client classes interact with the server through the web service classes and send messages entered by the user to the server

through

the

CliwntMessageHandler

and

ServerMessageHandler classes. The web service in turn invokes the controller methods in order to perform its functions. The controller class invokes the appropriate methods in the CollaborationServer class or the neural network classes to perform the requested action. The interaction between the objects of these classes is illustrated in the following diagram:

Department of CS

Feb’06 – June’06

- 40 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

Department of CS

Feb’06 – June’06

- 41 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

EMBED PBrush

Department of CS

Feb’06 – June’06

- 42 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

5 Integration System integration involves taking independently developed sub-systems and putting them together to make the complete system. The individually tested modules are added to the main system code and the entire system is tested again to ensure that the system is functioning correctly and conforms to the specified requirements. Integration can be performed in two ways: ‘The Big Bang Approach’ where all the sub-modules are added at once and the entire system is tested. The flaw with this approach is that debugging and code management becomes difficult. This approach is followed in the waterfall model of software development. The second approach is the incremental approach where small modules are built and added one by one. Testing is performed after each module is added. This helps in isolating the errors and the bugs to the module added recently and thus makes debugging and code management easier. This approach is employed in the Agile software development methodology. We have followed the incremental method for integration during the development of the project. We divided the project into two parts: The Horde Server and the Horde Client. We first started with the development of the server. Horde Server Development and Integration We started off with the development of the Web Data Processor module, which crawls the web and downloads and parses the web pages. The we developed the Learning Engine module, which would calculate and assign weights to the various edges connecting the nodes in the neural network. Department of CS

Feb’06 – June’06

- 43 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

Finally the execution engine module was developed which would actually build and execute the neural network to obtain the required results. Each of these modules were separately tested and then added to the main code. This helped in debugging and made code management easier for us. Horde Client Development and Integration The Horde client was also implemented incrementally. We initially started off with the development of the Internet Explorer extensions which would display the list of similar interest users and document suggestions. The we continued with collaboration framework for the exchange of messages between users. Connecting the client to the collaboration framework was the next step. At each step the modules were executed and tested extensively before they were integrated with the main code.

Department of CS

Feb’06 – June’06

- 44 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

6 Testing Verification and validation is the generic name given to the checking processes that ensure that the software conforms to the specification and meets the needs of the users. The system should be verified and validated at each stage of the software development process. Verification and validation is therefore a continuous process. Defect testing and debugging is the predominant verification and validation technique. Testing involves exercising the program with data close to the real-time data. The existence of program defects or inadequacies are inferred from unexpected inputs. Testing may be carried out during the implementation phase to verify that the software behaves as intended by its designer and after the implementation is complete. The later testing phase checks conformance with requirements and assesses the reliability of the system.

6.1 Unit and Functional Testing We have employed an incremental approach for testing the software. Each module has been completely tested for defects and conformance to specification before integrating it into the main code. Each time a change was made to a module it was tested for bugs before re-integrating it.

6.2 Integration Testing As the individual modules were extensively tested before integration, integration testing comprised of only ensuring that the software is working correctly even after the addition of the new module. In the event of errors after integration of the new module, the error could be isolated to the new module and could be modified to ensure proper functionality.

Department of CS

Feb’06 – June’06

- 45 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

6 Conclusion The main aim of the Horde Project was to make the Web a more enjoyable place and make the browsing experience a richer and a more productive experience. This goal has been reached to a very large extent. This software allows users to communicate synchronously with other users who, in all probability, have the capacity to provide instant answers to their questions and hence render them more productive during work.

Department of CS

Feb’06 – June’06

- 46 -

The Horde Project: Collaborative Browsing Framework for Dynamic Online Communities of Practice

13 Future Enhancements Many enhancements can be made to the Horde Project in order to make a user’s browsing experience more effective and productive. Some of the enhancements that we have thought of are listed below: •

Voice communication – Allow similar interest users to not only communicate through short messages but also to talk to each other over the Internet.

•

File transfer – Allow users to transfer files to each other through the collaboration framework. This would result in more efficient knowledge– sharing.

•

Partially Hidden browsing - Allow users to configure the client such that only some of the links and the URLs accessed by the user are exposed to the Horde Server. This will ensure more privacy to the online user.

•

Web Page Tagging – Allow users to tag web pages in part or whole as of interest corresponding to certain topics. This will further enhance the document suggestion feature.

Department of CS

Feb’06 – June’06

- 47 -

References Books •

Modern Information Retrieval by R Baeza-Yates, B Ribeiro-Neto, Chapters 1, 5 and 7.

•

Neural Networks: A Systematic Introduction by Raul Rojas, Chapters 1, 4 and 7.

Papers Communities of Practice •

G. Paliouras, C. Papatheodorou, V. Karkaletsis, C.D. Spyropoulos, 2002. Discovering User Communities on the Internet using Unsupervised Machine Learning Techniques, Interacting with Computers, Vol. 14(6), pp. 761-791, January 2002.

•

Fu, Yongjian; Sandhu, Kanwalpreet; Shih, Ming-Yi. (1999). "Clustering of Web Users Based on Access Patterns" [Proceedings]. WEBKDD (Workshop on Web Usage Analysis and User Profiling), 1999.

•

Lottermoser, B. G., Plaice, J., Kropf P. and Slonim, J. 2003. Distributed Communities on the Web. Springer.

•

Gourlay, S. 2003. Communities of Practice: A new concept for the millennium, or the rediscovery of the wheel?

•

Bruckman, Amy. 1997. MOOSE Crossing: Construction, Community, and Learning in a Networked Virtual World for Kids. PhD Thesis, MIT.

•

Pemberton-Billing, J., Cooper, R., Wootton, A. B. and Andrew, N.W. North. 2003. Distributed Design Teams as Communities of Practice. Proceedings of 5th European Academy of Design Conference.

•

Alani, H., Dasmahapatra, S., O'Hara, K. and Shadbolt, N. (2003) Identifying Communities of Practice through Ontology Network Analysis. IEEE IS 18(2) pp. 18-25.

•

Davies, J., Duke, A. and Sure, Y. 2003. OntoShare – An Ontologybased Knowledge Sharing System for Virtual Communities of Practice. Proceedings of I-KNOW.

•

Zhang, Y., and Tanniru, M. 2005. An Agent-Based Approach to Study Virtual Learning Communities. In Proceedings of the Proceedings of the 38th Annual Hawaii international Conference on System Sciences (Hicss'05) - Track 1 - Volume 01 (January 03 - 06, 2005). HICSS. IEEE Computer Society, Washington, DC, 11.3.

•

Fu, Yongjian; Sandhu, Kanwalpreet; Shih, Ming-Yi. (1999). "Clustering of Web Users Based on Access Patterns" [Proceedings]. WEBKDD (Workshop on Web Usage Analysis and User Profiling), 1999.

Collaborative Browsing •

Sidler, G., Scott, A. and Wolf, H. 1997. Collaborative browsing in the World Wide Web. Proceedings of the 8th Joint European Networking Conference, Edinburgh, May 12. -15. 1997.

•

Esenther, A. W. 2002. Instant Co-Browsing: Lightweight Real-Time Collaborative Web Browsing.

•

M.Sun, N.Bakis and I.Watson. Intelligent agent-based Collaborative Construction Information. INTERNATIONAL JOURNAL OF CONSTRUCTION INFORMATION TECHNOLOGY, 1999.

•

The Expertise Browser: How to Leverage Distributed Organizational Knowledge. AL Cohen, PP Maglio, R Barrett - Workshop on Collaborative Information Seeking at CSCW, 1998.

•

Letizia: An Agent That Assists Web Browsing. H Lieberman - IJCAI (1), 1995.

Information Retrieval, Machine Learning and Neural Networks •

Cyrus Shahabi, Amir M. Zarkesh, Jafar Adibi, Vishal Shah. 1997. Knowledge Discovery from Users Web-Page Navigation. RIDE 1997.

•

Chen, H. 1995. Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms. Journal of the American Society for Information Science. Volume 46, Issue 3, Pages 194 - 216.

•

Lave, J. and Wenger, E. 1991. Situated Learning: Legitimate Peripheral Participation. Cambridge University Press.

•

Chan, P. K. 2000. Constructing Web User Profiles: A non-invasive Learning Approach. In Revised Papers From the international Workshop on Web Usage Analysis and User Profiling B. M. Masand and M. Spiliopoulou, Eds. Lecture Notes In Computer Science, vol. 1836. Springer-Verlag, London, 39-55.

•

Froitzheim, K. and Schubert, P. 2004. Presence in Communication Spaces. 19th International CODATA Conference.

•

Mandl, T. 2000. Tolerant and Adaptive Information Retrieval with Neural Networks.

•

Porter, M.F., (2002) “Developing the English Stemmer”, http://snowball.tartarus.org/.

•

Yamout, F., Demachkieh, R., Hamdan, G. and Sabra, R. 2004. Further Enhancement to the Porter’s Stemming Algorithm. Workshop on Text-based Information Retrival TIR-04.