Ramesh Nallapati
1/5
RESEARCH STATEMENT Ramesh Nallapati (
[email protected]) The last decade has witnessed the emergence of enormous socially generated text content such as Wikipedia, blogs and social networks. While significant progress has been made by the Information Retrieval (IR) community in retrieving relevant pieces of information from these large datasets, there is a still a vast amount of hidden knowledge about human behavior in these datasets that remains largely unexplored. Unlike ever before in the history of mankind, there is now a unique opportunity for Artificial Intelligence (AI) systems to take a leap forward and automatically learn from these data sets answers to sociological questions such as the following: Are there communities within social media that tend to interact with each other more often than with others? Who are the most influential entities on specific topics? How does the social discourse on a topic evolve with time? What do certain user defined tags in social data mean? How do ideas propagate in social networks? Answers to such questions have significant implications for sociologists, information analysts as well as online advertisers. One of the significant challenges in answering these questions is building trustworthy models that can automatically mine patterns from these large and some times noisy textual corpora. In my post-doctoral research experience at CMU and Stanford, I have focused precisely on this challenge, building novel unsupervised probabilistic graphical models called Latent Topic Models. These models are ideally suited for exploratory analysis of large text collections since they are highly interpretable and require no human supervision. While previous work on Latent Topic Modeling focused only on modeling text, the novel models I proposed enabled them to learn from diverse sources of information present in these modern data sets such as hyperlinks, tags, time-stamps and venues. I have shown that modeling such diverse sources of information reveals many interesting and useful patterns from these corpora, besides improving their predictive performance. My recent work on Topic Modeling builds on my past experience in Information Retrieval (IR) as a Ph.D. candidate at University of Massachusetts Amherst. This work yielded some significant results on its own. During this time, I proposed novel probabilistic Language Models for IR that improved the performance of existing models on several tasks such as query based retrieval, document classification and topic-tracking. I also introduced a Machine Learning based learning paradigm for query based retrieval which is now a highly cited work in the sub-field of ‘Learning to Rank for IR’, and is adopted as part of graduate course curriculum at Stanford University. The rest of the statement is organized as follows. In section 1, I elaborate on my Topic Modeling contributions that enables better analysis of modern text data sets. I describe my past work in applying Machine Learning techniques to Information Retrieval in section 2. Section 3 presents my vision for future research.
1
Latent Topic Models for Analyzing Large Text Corpora
Latent Topic Modeling is an unsupervised technique proposed by David Blei et. al. in 2003 that assumes that the documents in a corpus are generated using a pre-specified probabilistic generative process, which lends the model high interpretability. The basic Topic Model, called Latent Dirichlet Allocation (LDA), automatically learns meaningful clusters of words called topics or themes based on co-occurrence patterns of words in documents. It also automatically soft-indexes each document into these topics, allowing us to search for documents by topics. LDA has been widely adopted by researchers and practitioners as an exploratory and search tool for analysis of large textual data sets. In combination with standard TF-IDF word-matching, it has been shown to produce state-of-the-art performance in several IR tasks such as query based retrieval and text classification. The basic LDA model uses only the words in documents as input. In my work that I describe below, I showed that jointly modeling rich contextual data such as hyperlinks, time-stamps, tags and locations will help answer a number of broad sociological questions that LDA cannot, besides improving the performance of these models. Joint Topic Models for Text and Hyperlinks: A pair of documents connected by a hyperlink are more likely to discuss the same set of topics that pairs that are not. To capture these correlations, I introduced one of the earliest joint topic models for citations (hyperlinks) and text (ICWSM‘08, KDD‘08). I showed that by learning these topical correlations across hyperlinks, the new model is able to predict hyperlinks for an unseen document significantly better than LDA. As shown in Fig. 1(a), the model also reveals the rates at which communities of different topics cite each other, a pattern the basic LDA model cannot detect.
http://www.cs.stanford.edu/∼nmramesh
Ramesh Nallapati
2/5
“Machine Transla.on”
“Syntac.c Parsing”
“Supervised learning”
translat
pars
discours
align
tree
relat
word
grammar
structur
model
parser
text
sentence
node
segment
english
sentenc
unit
pair
depend
rhetor
sourc
rule
marker
languag
input
linguist
target
deriv
cue
The mathema7cs of sta7s7cal machine transla7on: parameter es7ma7on, 1993 (21.7987)
Intricacies of Collins parsing model, 2004
AIen7on, inten7ons, and the structure of discourse, 1986 (14.89)
(2.89)
A sta7s7cal Approach to Sense Disambigua7on in Machine Transla7on, Brown et al, 1991
11.97 The Mathema7cs of Sta7s7cal Machine Transla7on: Parameter Es7ma7on, Brown et al, 1993
6.09
6.11 Aligning sentences in bilingual corpora using lexical informa7on, Chen, 1993
7.09 K-‐vec: A new approach for Aligning parallel texts, Fung, 1994
The Candide System for Machine Transla7on, Berger et al, 1994
6.59 Aligning a parallel English-‐Chinese corpus sta7s7cally with lexical criteria, Wu, 1994
Figure 1: (a) A visualization of citation rates across topics on blogs data as learned by the Joint Topic Model for Text and Citations (KDD’08). Each box displays the most likely words in a specific topic, and the numbers on the directed arrows are proportional to the citation rates across the corresponding topic pairs. We see strong citation rates between the topics “Iraq War” and “London Bombings” due to their topical relatedness but not between “CIA Leak” and “London Bombings”. (b) Left: A selection of 3 topics and their most influential documents identified by the TopicFlow model (AISTATS‘11, NIPS Workshop ‘10) on the ACL anthology corpus. Right: visualization of the flows on the topic of “Machine Translation” through the citations of the topic’s most influential paper (marked with broken red border).
Hyperlinks can also be used as votes to determine the influence of a document, as popularized by thePageRank algorithm. PageRank captures general influence of a document, but often times, we need to model the influence of a document in the context of a given topic. For example, a document that is highly influential on politics may be completely irrelevant on sports. In order to capture this notion of topic-specific influence of a document, we need to count only topically relevant hyperlinks, but this information is often not available. Recently, I proposed the TopicFlow model (NIPS Workshop‘10, AISTATS‘11) that combines Topic Modeling with ideas from Network Flow to automatically learn the topical relevance of citations and thereby model the topic-specific global influence of documents. To the best of my knowledge, this is the first model of its kind that is completely unsupervised. Our experiments show that the TopicFlow model is able to outperform several state-of-the-art baselines in predicting the topical influence of documents. In addition, the model also provides powerful visualization of the spread of influence of a topic across the citation network as shown in Fig. 1(b).1 Associating Topics with Tags: A fundamental problem in interpreting the topics discovered by LDA is that they are not labeled with names. Since modern social data is often labeled by users with tags, I proposed along with my colleagues at Stanford, a new Topic Model called Labeled LDA (EMNLP‘09) that directly maps topics to the tags of documents, thus making the topics as well as the tags much more interpretable. Using the machinery of LDA, the model is also able to discover which parts of the document are most associated with each of its tags. We showed that it improves upon state-of-the-art SVMs in identifying topically relevant subtexts in a document. An interesting visualization of this effect is shown in Fig. 2(a). Modeling Evolution of Topics with Time: Another dimension that is very useful in modeling topics in a large corpus is time. Documents on the same topic that are temporally close together tend to have similar word usage that those that are farther apart. My work on Topic Tomography (KDD’08), that was a best paper finalist, exploits the time-stamp information of documents and models the evolution of topical discourse with time at various scales of resolution of time. This model is very useful for information analysts in studying the dynamics of a topic in terms of its evolving word usage as a function of time. A typical output of the model is shown in Fig. 2(b). Modeling Topic Specific Lead/Lag of Online Communities: With the growing popularity of social media, the flow of information from one community to the other has become viral. Understanding how information spreads from one community to the other is of interest to social scientists, information analysts and online adver1 A demo of this visualization can also be found at the home page of my Stanford colleague Jason Chuang, who is a visualization expert: http://hci.stanford.edu/∼jcchuang/topic-flow/.
http://www.cs.stanford.edu/∼nmramesh
Ramesh Nallapati
3/5
1883-‐2002
The Elements of Style, William Strunk, Jr. Asser9ng that one must first know the rules to break them, this classic reference book is a must-‐have for any student and conscien9ous writer. Intended for use in which the prac9ce of composi9on is combined with the study of literature, it gives in brief space the principal requirements of plain English style and concentrates aEen9on on the rules of usage and principles of composi9on most commonly violated.
Energy Electron Atom
Style
Grammar
1883-‐1942
Radia2on
1943-‐2002
Theory
Theory
Energy
Energy
Raise
Electron
Atom
State
Atom
Heat
Absorb
Radia2on
Raise
Physical
State
Radia2on
Charge
Absorb
Molecule
Raise
Electron
Nuclear
Gas
Reference
Educa9on
1883-‐1912
Par2cle
Laser Ion
1913-‐1942
1943-‐1972
Theory
Atom
Energy
Energy
Heat
Energy
Radia2on
1973-‐2002 Electron
Energy
Theory
Atom
Atom
Gas
Raise
Electron
State
Atom
Electron
Raise
Laser
MaAer
Radia2on
Absorb
Radia2on
Molecule
Molecule
Spectrum
Physical
Produce
Electric
Nuclear
Nuclear
Par2cle
Absorb
Band
Quantum
Radia2on
Charge
State
Charge
Figure 2: (a) The output of Labeled LDA (EMNLP‘09) on a passage from the book “Elements of Style” tagged with four different tags on delicious.com: Style, Grammar, Reference and Education. The model automatically learns the correlations between words and tags from a large corpus of tagged documents and automatically detects words in a document relevant to each of its tags. (b) The output of the Multi-scale Topic Tomography model on articles from Science ranging over 120 years, on the topic of “Particle Physics”. The model captures the variation of topic discourse with time at various resolutions of time-scale. Each level in the tree represents a resolution level, with the root representing the average view of the topic over the 120 year period, and the leaves represent the finest scale. Notice that words such as “heat” and “gas” are more popular in the 1890’s while words such as “electron” and “laser” gain traction in the 1990’s.
tisers. An easy solution to this problem is to track the propagation using hyperlink information, but often times this is missing. In such cases, a promising solution is to trace the usage of words within topics across different corpora. As a case in point, I applied Topic Models to investigate whether blogs lead or lag behind news outlets in disseminating information on specific topics (ICWSM‘11, NIPS Computational Social Sciences Workshop‘10). The output of the model is displayed in Fig. 3.
2
eur,raquo,cam,sex,live,spam,yang,versandkosten,ink l,und, post,facebook,comment,forum,profil,like,password,
help,log,ago, new,servic,market,busi,web,inform,site,compani,pro duct,design, video,sex,free,gai,porn,download,movi,html,girl,mo
del, n't,like,Ame,just,make,know,sai,want,peopl,did, new,said,presid,state,world,year,report,publish,drag, naAon, bahrain,elect,govern,right,opposit,human,bahraini,n
ew,shia,press, game,new,video,music,plai,team,player,sport,live,le agu, -‐200 -‐100 0
100 200 300 400 500 600 700 800
Figure 3: Number of seconds by which news lags behind blogs resolved by topics. Topics are described in terms of their top 10 most likely terms. News leads blogs on topics such as Sports (bottom most bar), Politics (second bar from bottom), but lags behind blogs on topics such as Adult content (first and fourth entries from top), and Business (third bar from top).
Machine Learning for Information Retrieval
During my doctoral research, my main contributions were in IR, a field of study that also deals with mining information from a large corpus, but in the context of a user’s information need typically expressed as a key word query. At the time, IR was mostly dominated by heuristic approaches. My main goal was to exploit the machinery of Machine Learning in order to make IR models more principled and powerful. Learning to Rank for IR: One potential inadequacy I saw with heuristic IR models is the difficulty in modeling the relative importance of various arbitrary features of documents. This becomes particularly important in the web search context where relevance of a document to a user’s query is influenced by several query independent features such as its hyperlink connectivity, anchor text, author’s authenticity, etc. To address this issue, I proposed applying supervised ML techniques such as Support Vector Machines to automatically learn the weights of these features from query and relevant document pairs from the past (SIGIR‘04), and apply the learned weights on new queries. Experiments on the query based retrieval task on a web collection showed that SVMs can achieve significant boost in performance compared to baseline language models. This work is now regarded as a highly influential work in applying supervised Machine Learning for IR, as is evident from its 124
http://www.cs.stanford.edu/∼nmramesh
Ramesh Nallapati
4/5
citations2 . This paper is also adopted as part of graduate course material for the Information Retrieval and Web Search course at Stanford University3 . Probabilistic Language Models for IR: Language modeling is a modern probabilistic approach for IR that measures the relevance of the document to a query in terms of the likelihood that the document ‘generates’ the query. The most effective language model for IR is the unigram model, which makes a mathematically simplifying but unrealistic assumption that words in a document occur in a conditionally independent fashion. I relaxed this assumption using an adaptive dependency tree technique that captures the most significant dependencies among words in a sentence and applied it to the task of tracking news stories on a given topic. The new model showed improvements compared to the unigram approach (CIKM‘02, SIGIR Workshop‘03). Another key limitation of the unigram language model is that it fails to predict the heavy tail distribution of term occurrences in documents. In my thesis work (Thesis‘06), I proposed a novel distribution for text called the Smoothed Dirichlet (SD) distribution that I showed is a significantly better predictor of term occurrence patterns, and therefore a more consistent performer across a range of IR tasks. One of the main theoretical contributions of this work is to show that the SD distribution explains the successful, but poorly understood KL-divergence ranking function used in the Language Modeling approach.
3
Future Work
My short term plans are to facilitate better analysis of the social media through Latent Topic Modeling. My longer term vision is to continue making contributions towards more expressive and powerful Machine Learning models that will have a wide impact on a range of data mining applications. Some of my plans are as follows. Short Term: Cluster Discovery in Networked Textual Data: Most of the online textual data is connected through hyperlinks and most analyses suggest a ‘small-world’ phenomenon of connectedness of entities in such networks. Despite the strong connectedness, an interesting fact is that there are latent clusters in this networked world that are distinctly different from one another. For example, people tend to cluster with others by their similarity of interests, professions, religions, etc. Unearthing these latent clusters will take us closer towards understanding the information contained in such huge networked textual corpora. Building on my experience in modeling networked textual data, I plan to develop Topic Models to uncover these clusters. Modeling Diffusion of Ideas through Text: There is immense potential in mining social media text to study how ideas spread from one community to the other. Previous work on flow of information focused mostly on exploiting hyperlink information. Research on exploiting text usage for this purpose is only beginning to emerge. My recent work in this area (ICWSM‘11, AISTATS‘10) places me at the forefront of this novel research direction, and I plan to expand on this work in the near future. Long Term: Towards Better and Faster Topic Models: There is still room for improvement in terms of making the Topic Model’s output easily interpretable to the user. For example, topics are represented as multinomial distributions over singleton words, which makes it some times hard to interpret what the topic is about. Additional work needs to be done towards improving the representation of topics using Natural Language Processing techniques. In the near future, I plan to enhance topics by using syntactic information such as parts of speech, noun phrase chunking, parse tree and named entities into the models. My expertise in Topic Modeling combined with my experience in Natural Language Processing (NIPS Transfer Learning Workshop‘10, LREC Workshop‘10, ACL‘08) places me in a good position to pursue these objectives. Another way to control the output of Topic Models is through regularization. Regularization involves adding extra penalty terms in the objective function of the Topic Model which allows us to control the behavior of 2 according 3 see
to Google Scholar’s estimate as of 11/09/2010. the entry “Learning to Rank” in http://www.stanford.edu/class/cs276/cs276-2009-syllabus.html
http://www.cs.stanford.edu/∼nmramesh
Ramesh Nallapati
5/5
the model. I am currently working on ways to regularize Topic Models to make the topics as distinct from one another as possible, and as succinct in terms of words as possible. During my research as a post doctoral fellow at CMU, I encountered difficulties in running the standard implementation of LDA on large corpora due to its high computational costs. As a solution, I built one of the first parallel and distributed implementations of the variational inference algorithm for LDA (ICDM Workshop‘07) and achieved significant speed-ups. This experience has aroused in me a keen interest in developing MapReduce style scalable solutions for advanced Topic Models and other Machine Learning models in general. I believe these exciting times of explosion of social media present the fields of Data Mining and Information Retrieval with a golden opportunity to make a significant difference to the user. I am looking forward to the unique opportunity of a research career that allows me to contribute to these critical areas of research in a positive way.
References [1] Ramesh Nallapati and Christopher Manning. TopicFlow model: Unsupervised learning of topic specific influences of hyperlinked documents. In AISTATS, 2011. [2] Ramesh Nallapati, Xiaolin Shi, Jure Leskovec, Dan McFarland, and Dan Jurafsky. LeadLag LDA: Estimating Topic Specific Leads and Lags of Information Outlets. In ICWSM, 2011. [3] Ramesh Nallapati and Christopher Manning. TopicFlow model: Unsupervised learning of topic specific influences of hyperlinked documents. In NIPS workshop on Machine Learning for Social Computing, 2010. [4] Ramesh Nallapati, Mihai Surdeanu, and Christopher Manning. Blind domain transfer for Named Entity Recognition using Generative Latent Topic Models. In NIPS workshop on Transfer learning using Rich Generative Models, 2010. [5] Xiaolin Shi, Ramesh Nallapati, Jure Leskovec, Dan McFarland, and Dan Jurafsky. Who Leads Whom: Topical Lead-Lag analysis across corpora. In NIPS Workshop on Computational Social Science and Wisdom of Crowds, 2010. [6] Mihai Surdeanu, Ramesh Nallapati, and Christopher Manning. Legal Claim Identification: Information Extraction with hierarchically labeled data. In Semantic Processing of Legal texts workshop, LREC, 2010. [7] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher Manning. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In EMNLP, 2009. [8] Ramesh Nallapati, Amr Ahmed, Eric P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. In KDD, 2008. [9] Ramesh Nallapati and William Cohen. Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs. In ICWSM, 2008. [10] Ramesh Nallapati and Christopher Manning. Legal docket classification: Where machine learning stumbles. In EMNLP, 2008. [11] Andrew Arnold, Ramesh Nallapati, and William Cohen. Exploiting feature hierarchy for Transfer Learning in Named Entity Recognition. In ACL, 2008. [12] Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty, and Kin Ung. Multi scale Topic Tomography. In KDD, 2007. [13] Ramesh Nallapati, William Cohen, and John Lafferty. Parallelized Variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability. In ICDM Workshop on High Performance Data Mining, 2007. [14] Ramesh Nallapati. Smoothed Dirichlet distribution: Understanding cross-entropy ranking in Information Retrieval. In Ph.D. thesis, University of Massachusetts, Amherst, 2006. [15] Ramesh Nallapati. Discriminative Models for Information Retrieval. In SIGIR, 2004. [16] Ramesh Nallapati and James Allan. An Adaptive Local Dependency Language Model: Relaxing the Naive Bayes assumption. In SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval, 2003. [17] Ramesh Nallapati and James Allan. Capturing Term Dependencies using a Sentence Tree based Language Model. In CIKM, 2002.
http://www.cs.stanford.edu/∼nmramesh