Intelligent Sensemaking of Scientific Literature Through Citation Scent Bi Chen*, Baojun Qiu† , Yan Qu†† , John Yen*, Xiaolong Zhang* *College of Information Sciences and Technology Department of Computer Science and Engineering †† College of Information Studies † * The Pennsylvania State University, University Park, PA 16802, USA †† University of Maryland, College Park, MD 20742, USA
[email protected],
[email protected],
[email protected],
[email protected],
[email protected] †
extent these papers are important to the filed, and what other relevant topics might be. Search engines cannot help on these questions.
ABSTRACT
We propose a design based on natural language processing to discover “scent” information for citation analysis.
• Scenario 2: An experienced scientist writing a review To write a thorough review, the scientist needs to locate all relevant papers and more importantly, synthesize them in a structural way. She must read papers, extract related information, and finally organize and integrate information. In this process, the scientist has to find such information as motivations of citations and positive or negative comments on other papers so that she can know what kinds of research work have been done, where those unaddressed issues may lie in, how research directions and topics are related to each other, and what papers have made key contributions to the field.
Author Keywords
Sensemaking, Information Scent, Citation Scent ACM Classification Keywords
H.5.2 Information Interfaces and Presentation: User interfaces INTRODUCTION
Understanding existing research papers lays a foundation for scientists to develop their own research agendas, communicate with others, and keep up with up-to-date research. Scientists need tools to quickly locate and analyze relevant papers. Such tools may benefit the whole research community and improve research productivity. Many systems can support scientific literature search and preliminary organization of search results. For example, scientific literature search engines like CiteSeer [4] and Google Scholar [5] can help to search research papers and provide scientists with important information suc as citations and citation context for further analysis.
Both of these scenarios involve advanced sensemaking tasks that current search engines cannot easily support. While results from search engines provide useful meta-data about papers (e.g., category, keyword, citations) to achieve the goals in the above scenarios, people need to first extract information that is embedded in individual papers and then integrate isolated information pieces from papers together. To help scientists on such advanced sensemaking tasks by collecting useful “scents” in citations, we propose an approach based on natural language processing (NLP) to support deep citation analysis. Our approach goes beyond current NLP-based techniques for review sentence highlightings [4] or content summarization [21] [14], and focuses on collecting scent information distributed inside papers.
Often, researchers also need to gain insight into their research fields by knowing more than what search engines can provide. Take the following two scenarios as examples: • Scenario 1: A graduate student developing his dissertation topic The student needs help to gain a basic understanding of his field. She starts from search engines to get some papers. However, after reading these papers, she still has no idea about whether their topics are still current, to what
In this paper, we will first review relevant work, and then describe our design. RELATED WORK
Sensemaking is to make sense of ambiguous or unfamiliar situations. The cost structure of sensemaking comes from iteratively creating sophisticated representations and fitting information into these representations [16]. Citation networks in scientific literature are an important tool for literature sensemaking. Submitted for review to CHI 2009.
1
Currently, we are targeting for three different types of citations: core-citation, comment-citation, and margin-citation.
Making sense of citations involves identifying reasons of citations, tracing citations, analyzing co-authorship, summarizing a paper based on comments from other papers, and finding inadequacy in existing work. Great efforts have been made to support these tasks. Citation and co-author networks[8] help researchers to identify major research topics, co-authorships, and research trends. Cat-a-Cone [7] provides researchers with a novel user interface to integrates search and browse very large category hierarchies with their associated text collections. BiolioViz [18] helps researchers to create custom visualizations on bibliography data. Butterfly [11] helps researchers integrate searching, browsing, and accessing tasks. KonneXSALT [6] and ClaiMaker [19] collect claims of research papers, and organize them semantically. CiteSense [23] offers an integral environment for paper search, paper analysis, paper collection, and paper structure building. Although these systems are great help for researchers, they are still manual systems in nature, which require researchers to pay great efforts and time to extract useful citation scents from papers.
• core-citation: a citation including a paper which builds a basis for the current paper. Example in [23]:Our research developed some core tasks related to information seeking and sensemaking in research literature review by drawing on existing research on seeking, organizing, and making sense of information [9, 4, 2, 10, 16, 19]. • comment-citation:a citation including sentiment comments on other papers. Example in [20]:Example: We therefore consider a far wider range of markables than Kim and Webber(2006), who only consider the pronoun “they”. • margin-citation: a citation, neither core-citation nor commentcitation, just presenting a normal statement of other papers. Example in [23]:Some efforts have been made to help such tasks, including Citation and Co-Author Network[15], Cata-Cone[12], BiolioViz[21], Butterfly[16], and so on.
To reduce the labor work of researchers, various of machine learning approaches have been proposed in citation analysis. Teufel at al. [21] and Qazvinian et al. [14] use citation identification and classification to improve single-document summarization. Teufel, Siddharthan et al. [22] analyzed the reasons of citations to improve text summarization and more informative citation indexers. Nanba et al. [12] used citations as features to classify scientific papers into topics. Piao et al. [13] proposed a system to create a network of opinion polarity relations between citations by using existing semantic lexical resources and NLP tools. Siddharthan et al. [20] used machine learning algorithms to perform a new reference task - deciding scientific attribution. These projects have shown the potential of applying language features in machine learning to improve citation analysis. However, algorithms seen in these projects are largely based on selected features and classification and ignore relationship among terms in natural language, which may provide more information for sensemaking of citations.
Distinguishing these three types of citations requires NLPbased tools, becasue meta-data of a paper usually does not offer necessary information that can be used to distill citation scent. In [22] [12] [14], only language features are used to classify citations, but the dependence relationship between terms is ignored. A sentence in natural language expresses its meaning not only on meaning of individual terms but also on the sequence order of the terms. In addition to natural language features, our algorithm will also consider the sequence order of the terms by applying CRF algorithm [9], a framework for building probabilistic models to segment and label sequence data. Because of limited space, we will not introduce the CRF algorithm here. Rather, we explain our algorithm with an example. Consider this sentence found in a paper: We therefore consider a far ::::: wider range of markables than (Kim and Webber(2006)), who only consider the pronoun ::: “they”.
CITATION SCENT: A TOOL FOR DEEP CITATION SENSEMAKING
What we propose here is a system to support citation sensemaking by considering the dependence relationship between terms in citations using sequence data analysis model, like Conditional Random Field Model(CRF) [9]. We call the system Citation Scent, because it is aimed at discovering language “scents” for citation analysis from papers.
As shown, the above citation is a comment-citation that gives a negative comment on the paper mentioned. If we directly use terms in the sentence as features to train the model as did in [22], we will face the sparsity problem because many perfectly valid word sequences do not appear in a training corpora.
Our design focus on using NLP techniques to collect relevant information that is distributed inside papers and then based on such scent information to provide users with knowledge structures that are important to sensemaking [16]. One of the key issues we are trying to address is to identify the importance and relevance of citations to a paper by extracting and analyzing review comments. With the knowledge about to what extent individual citations are important and relevant to the paper, scientists can have a better understanding of evolution and progress of a research topic.
We group terms in the sentence into several types to reduce the problem of sparsity. The first type is cue phrases, like in [1], which have a semi-fixed form with a clear semantics, but with syntactic and lexical variations which are hard to appear in the training data. We will use the algorithm described in [1] to detect cue phrases. For this example, cue phrase is “we consider ...”. The second type is function terms. In this example, “than” is a function word which makes this sentence a comparison one. Such function words 2
• Author Social Network Based on the relationship of co-author, we will construct authors into a large social network for easy navigation.
can be found using grammer parser, such as Link Grammar Parser [10]. The third type is sentiment words. In this case, “wider” has a light positive meaning while “only” has a light negative meaning. We can build sentiment vocabulary based on the algorithm in [3]. The fourth type is facts point to the cited paper. Here, the cited paper is “Kim and Webber(2006)” .
CONCLUSION
In this paper, we propose a design to use citation scent to support sensemaking of research papers. Our approach goes beyond meta-data of papers, and use NLP-based techniques to collect relevant citation comments from papers and then build knowledge structures for sensemaking.
Then, the example sentence has been simplified to a form: cue phrase + light positive + than + cited paper + light negative. This sentence will be marked as negative to train the CRF model. Thus, not only is the problem of sparsity prevented, but also the sequence is preserved. If we do not consider the sequence, the model will be hard to classify it as positive or negative because there are one positive word and one negative word.
REFERENCES
1. Abdalla, R. and Teufel, S.(2006). A Bootstrapping Approach to Unsupervised Detection of Cue Phrase Variants. In ACL’06. 2. Ed H. Chi , Peter Pirolli , Kim Chen , James Pitkow. (2001) Using information scent to model user information needs and actions and the Web. In Proceedings of the SIGCHI conference on Human factors in computing systems, p.490-497. March 2001, Seattle, Washington, United States
SYSTEM DESIGN
We are building a system prototype upon our CiteSense system [23]. The previous CiteSense system was largely based on meta-data of papers. We will integrate the NLP algorithms into the system and provide more comprehensive tools for research paper sensemaking.
3. Gamon, M. and Aue, A. (2005). Automatic identification of sentiment vocabulary: exploiting low association with known sentiment terms. In Proceedings of ACL 2005 Workshop on Feature Engineering for Machine Learning in Natural Language Processing, pp. 57-64.
The system will include six components: a crawler, a paper database, an NLP module, a knowledge network, a sentiment moduel, and an author social network. • Crawler This component is to download research articles from the Internet. Because most of papers are in a form of pdf, we need to transform pdf papers into text before storing them into the database.
4. Giles, C., Bollacker, K., Lawrence, S. (1998). CiteSeer: an Automatic Citation Indexing System. In Proceedings of 3rd ACM Conference on Digital Libraries, pp. 89-98. 5. Google Scholar. http://scholar.google.com/.
• Paper Database This component is to store papers. The paper will be segmented into five parts: title part, authors part, abstract part, main content part, and reference part. Each reference will be further segmented into authors, title, conference/journal, year, and so on. We will use Hidden Markov Model [15] to segment references into difference parts.
6. Groza, T., Handschuh, S., Mller, K. and Decker, S. (2008). Facts about KonneXSALT: First Steps Towards a Semantic Claim Federation Infrastructure. In ESWC’08. 7. Hearst, M. and Karadi, C. (1997). Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results using a Large Category Hierarchy. In the proceedings of 20th Annual International ACM/SIGI’97.
• NLP Module For each reference, we need to locate the corresponding citations in a paper. We use sentence model to extracted citations, use phrase detection [1] to find theory or method terms, and use shallow parsing [17] to identify the constituents (noun groups, verbs, verb groups, etc.). By applying the constituents and their relationship in a sentence as features, we can train our CRF model [9] and then classify citations into three pre-defined types.
8. Ke, W., Borner, K., and Viswanath, L. (2004). Major Information Visualization Authors, Papers and Topics in the ACM Library. In IEEE INFOVIS’04. 9. Lafferty, J., McCallum, A., Pereira, F.(2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference on Machine Learning, Morgan Kaufmann, pp. 282-289.
• Knowledge Network Based on the relationship of core-citation, we can organize papers into a large knowledge network, where users can easily identify how ideas flow between papers.
10. Lafferty, J., Sleator, D. and Temperley, D. (1992). Grammatical Trigrams: A Probabilistic Model of Link Grammar. In Proceedings of the AAAI Conference on Probabilistic Approaches to Natural Language.
• Sentiment Module Based on the relationship of comment-citation, we can train the sentiment classifier based on their sentiment (positive or negative). Again, we will do summarization based on their sentiments.
11. Mackinlay, J.D., Rao, R., and Card, S.K. (1995). An organic user interface for searching citation links. In ACM CHI’95, pp. 67-73. 3
12. Nanba, H., Kando, N. and Okumura, M. (2000). Classification of Research Papers using Citation Links and Citation Types: Towards Automatic Review Article Generation. In Ameircan Society for Information Science SIG Classification Research Workshop: Classification for User Support and Learning, pp. 117-134. 13. Piao, S., Ananiadou, S., Tsuruoka, Y., Sasaki Y. and McNaught J. (2006). Mining Opinion Polarity Relations of Citations. In International Workshop on Computational Semantics (IWCS-7). 14. Qazvinian, V. and Radev, D. (2008). Scientific paper summarization using citation summary networks. In COLING 2008, Manchester, UK. 15. Rabiner, L.(1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In Proceedings of the IEEE, pp. 257-286. 16. Russell, D. M., Stefik, M. J., Pirolli, P., and Card, S. K. (1993). The Cost Structure of Sensemaking. In ACM INTERCHI’93, pp. 269-276. 17. Sha, F. and Pereira, O. (2003). Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Edmonton, Canada. pp 134-141. 18. Shen, Z., ogawa, M., Teoh, S.T., and Ma, K. (2006). BiblioViz: a system for visualizing bibliography information. In Proc. of Asia-Pacific Symposium on Information Visualization. 19. Shum, S., uren, V., Li, G., Sereno B. and Mancini, C. (2006). Modeling naturalistic argumentation in research literatures: Representation and interaction design issues. In Special Issue: Computational Models of Natural Argumentation, International Journal of Intelligent Systems. pp 17-47. 20. Siddharthan, A. and Teufel, S. (2007). Whose idea was this? Deciding attribution in scientific literature. In Proc. of the 6th Discourse Anaphora and Anaphor Resolution Colloquium, DAARC07. 21. Teufel, S. and Moens, M. (2002). Summarizing scientific articles: experiments with relevance and rhetorical status. In Computational Linguistics, 28(4):409-445. 22. Teufel, S. Siddharthan, A. and Tidhar, D. (2006). Automatic classification of citation function. In Proceedings of EMNLP’06, pp. 103-110. 23. Zhang, X., Qu, Y., Giles, L. and Song, P. (2008). CiteSense: Supporting Sensemaking of Research Literature. In CHI’2008.
4