A PhD student with Prof. Andrew McCallum at UMass.
I am broadly interested in deep learning, reinforcement learning and meta-learning with applications to text data, knowledge bases, and multi-agent systems.
Self-supervised pre-training of transformer models has revolutionized NLP applications. Such pre-training with language modeling objectives provides a useful initial point for parameters that generalize well to new tasks with fine-tuning. However, fine-tuning is still data inefficient --- when there are few labeled examples, accuracy can be low. Data efficiency can be improved by optimizing pre-training directly for future fine-tuning with few examples; this can be treated as a meta-learning problem. However, standard meta-learning techniques require many training tasks in order to generalize; unfortunately, finding a diverse set of such supervised tasks is usually difficult. This paper proposes a self-supervised approach to generate a large, rich, meta-learning task distribution from unlabeled text. This is achieved using a cloze-style objective, but creating separate multi-class classification tasks by gathering tokens-to-be blanked from among only a handful of vocabulary terms. This yields as many unique meta-training tasks as the number of subsets of vocabulary terms. We meta-train a transformer model on this distribution of tasks using a recent meta-learning framework. On 17 NLP tasks, we show that this meta-training leads to better few-shot generalization than language-model pre-training followed by finetuning. Furthermore, we show how the self-supervised tasks can be combined with supervised tasks for meta-learning, providing substantial accuracy gains over previous supervised meta-learning.
Self-supervised pre-training of transformer models has shown enormous success in improving performance on a number of downstream tasks. However, fine-tuning on a new task still requires large amounts of task-specific labelled data to achieve good performance. We consider this problem of learning to generalize to new tasks with few examples as a meta-learning problem. While meta-learning has shown tremendous progress in recent years, its application is still limited to simulated problems or problems with limited diversity across tasks. We develop a novel method, LEOPARD, which enables optimization-based meta-learning across tasks with different number of classes, and evaluate different methods on generalization to diverse NLP classification tasks. LEOPARD is trained with the state-of-the-art transformer architecture and shows better generalization to tasks not seen at all during training, with as few as 4 examples per label. Across 17 NLP tasks, including diverse domains of entity typing, natural language inference, sentiment analysis, and several other text classification tasks, we show that LEOPARD learns better initial parameters for few-shot learning than self-supervised pre-training or multi-task training, outperforming many strong baselines, for example, yielding 14.5% average relative gain in accuracy on unseen tasks with only 4 examples per label.
Understanding the meaning of text often involves reasoning about entities and their relationships. This requires identifying textual mentions of entities, linking them to a canonical concept, and discerning their relationships. These tasks are nearly always viewed as separate components within a pipeline, each requiring a distinct model and training data. While relation extraction can often be trained with readily available weak or distant supervision, entity linkers typically require expensive mention-level supervision -- which is not available in many domains. Instead, we propose a model which is trained to simultaneously produce entity linking and relation decisions while requiring no mention-level annotations. This approach avoids cascading errors that arise from pipelined methods and more accurately predicts entity relationships from text. We show that our model outperforms a state-of-the art entity linking and relation extraction pipeline on two biomedical datasets and can drastically improve the overall recall of the system.
State-of-the-art models for knowledge graph completion aim at learning a fixed embedding representation of entities in a multi-relational graph which can generalize to infer unseen entity relationships at test time. This can be sub-optimal as it requires memorizing and generalizing to all possible entity relationships using these fixed representations. We thus propose a novel attention-based method to learn query-dependent representation of entities which adaptively combines the relevant graph neighborhood of an entity leading to more accurate KG completion. The proposed method is evaluated on two benchmark datasets for knowledge graph completion, and experimental results show that the proposed model performs competitively or better than existing state-of-the-art, including recent methods for explicit multi-hop reasoning. Qualitative probing offers insight into how the model can reason about facts involving multiple hops in the knowledge graph, through the use of neighborhood attention.
Reinforcement learning algorithms can train agents that solve problems in complex, interesting environments. Normally, the complexity of the trained agent is closely related to the complexity of the environment. This suggests that a highly capable agent requires a complex environment for training. In this paper, we point out that a competitive multi-agent environment trained with self-play can produce behaviors that are far more complex than the environment itself. We also point out that such environments come with a natural curriculum, because for any skill level, an environment full of agents of this level will have the right level of difficulty.
This work introduces several competitive multi-agent environments where agents compete in a 3D world with simulated physics. The trained agents learn a wide variety of complex and interesting skills, even though the environment themselves are relatively simple. The skills include behaviors such as running, blocking, ducking, tackling, fooling opponents, kicking, and defending using both arms and legs. A highlight of the learned behaviors can be found here.
Ability to continuously learn and adapt from limited experience in nonstationary environments is an important milestone on the path towards general intelligence. In this paper, we cast the problem of continuous adaptation into the learning-to-learn framework. We develop a simple gradient-based meta-learning algorithm suitable for adaptation in dynamically changing and adversarial scenarios. Additionally, we design a new multi-agent competitive environment, RoboSumo, and define iterated adaptation games for testing various aspects of continuous adaptation strategies. We demonstrate that meta-learning enables significantly more efficient adaptation than reactive baselines in the few-shot regime. Our experiments with a population of agents that learn and compete suggest that meta-learners are the fittest.
Extracting typed entity mentions from text is a fundamental component to language understanding and reasoning. While there exist substantial labeled text datasets for multiple subsets of biomedical entity types---such as genes and proteins, or chemicals and diseases---it is rare to find large labeled datasets containing labels for all desired entity types together. This paper presents a method for training a single CRF extractor from multiple datasets with disjoint or partially overlapping sets of entity types. Our approach employs marginal likelihood training to insist on labels that are present in the data, while filling in ``missing labels''. This allows us to leverage all the available data within a single model. In experimental results on the Biocreative V CDR (chemicals/diseases), Biocreative VI ChemProt (chemicals/proteins) and MedMentions (19 entity types) datasets, we show that joint training on multiple datasets improves NER F1 over training in isolation, and our methods achieve state-of-the-art results.
We introduce RelNet: a new model for relational reasoning. RelNet is a memory augmented neural network which models entities as abstract memory slots and is equipped with an additional relational memory which models relations between all memory pairs. The model thus builds an abstract knowledge graph on the entities and relations present in a document which can then be used to answer questions about the document. It is trained end-to-end: only supervision to the model is in the form of correct answers to the questions. We test the model on the 20 bAbI question-answering tasks with 10k examples per task and find that it solves all the tasks with a mean error of 0.3%, achieving 0% error on 11 of the 20 tasks.
In textual information extraction and other sequence
labeling tasks it is now common to use recurrent
neural networks (such as LSTM) to form
rich embedded representations of long-term input
co-occurrence patterns. Representation of
output co-occurrence patterns is typically limited
to a hand-designed graphical model, such
as a linear-chain CRF representing short-term
Markov dependencies among successive labels.
This paper presents a method that learns embedded
representations of latent output structure in
sequence data. Our model takes the form of a
finite-state machine with a large number of latent
states per label (a latent variable CRF), where the
state-transition matrix is factorized—effectively
forming an embedded representation of statetransitions
capable of enforcing long-term label
dependencies, while supporting exact Viterbi inference
over output labels. We demonstrate accuracy
improvements and interpretable latent structure
in a synthetic but complex task based on
CoNLL named entity recognition.
In a variety of application domains the content to be recommended to users is associated with text.
This includes research papers, movies with associated plot summaries, news articles, blog posts, etc.
Recommendation approaches based on latent factor models can be extended naturally to leverage text by
employing an explicit mapping from text to factors. This enables recommendations for new, unseen content,
and may generalize better, since the factors for all items are produced by a compactly-parametrized model.
Previous work has used topic models or averages of word embeddings for this mapping.
In this paper we present a method leveraging deep recurrent neural networks to encode the text sequence
into a latent vector, specifically gated recurrent units (GRUs) trained end-to-end on the collaborative
filtering task. For the task of scientific paper recommendation, this yields models with significantly
higher accuracy. In cold-start scenarios, we beat the previous state-of-the-art, all of which ignore word order.
Performance is further improved by multi-task learning, where the text encoder network is trained for a combination
of content recommendation and item metadata prediction. This regularizes the collaborative filtering model,
ameliorating the problem of sparsity of the observed rating matrix.
We consider the problem of recommending comment-worthy articles such as news and blog-posts. An article is defined to be comment-worthy for a particular user if that user is interested to leave a comment on it. We note that recommending comment-worthy articles calls for elicitation of commenting-interests of the user from the content of both the articles and the past comments made by users. We thus propose to develop content-driven user profiles to elicit these latent interests of users in commenting and use them to recommend articles for future commenting. The difficulty of modeling comment content and the varied nature of users' commenting interests make the problem technically challenging.
The problem of recommending comment-worthy articles is resolved by leveraging article and comment content through topic modeling and the co-commenting pattern of users through collaborative filtering, combined within a novel hierarchical Bayesian modeling approach. Our solution, Collaborative Correspondence Topic Models (CCTM), generates user profiles which are leveraged to provide a personalized ranking of comment-worthy articles for each user. Through these content-driven user profiles, CCTM effectively handle the ubiquitous problem of cold-start without relying on additional meta-data. The inference problem for the model is intractable with no off-the-shelf solution and we develop an efficient Monte Carlo EM algorithm. CCTM is evaluated on three real world data-sets, crawled from two blogs, ArsTechnica (AT) Gadgets (102,087 comments) and AT-Science (71,640 comments), and a news site, DailyMail (33,500 comments). We show average improvement of 14% (warm-start) and 18% (cold-start) in AUC, and 80% (warm-start) and 250% (cold-start) in Hit-Rank@5, over state of the art.
This paper introduces a novel stick-breaking process namely ordered stick-breaking process (OSBP), where the atoms appear in order. The choice of weights on atoms of OSBP ensure two important things; (1) that probability of adding new atoms exponentially decrease with time and (2) OSBP, though non-exchangeable, admits pre- dictive probability functions (PPFs). We apply OSBP to Bayesian nonparametric (BNP) models and find that in a sequential setting where data is arriving in mini-batches OSBP forms a natural prior over mini-batches, facilitating exchange of relevant statistical information across mini-batches by sharing the atoms of OSBP. One of the major contributions of this paper is SUMO, an MCMC algorithm for solving the inference problem arising from applying OSBP to BNP models. SUMO uses the PPFs of OSBP to obtain a Gibbs sampling based truncation-free algorithm which applies generally to BNP models. For large scale inference problems existing algorithms such as Particle Filtering (PF) are not practical and variational procedures such as TSVI (Wang & Blei, 2012) are the only alternative. SUMO is thus an important addition to MCMC family which works well empirically. For Dirichlet process mixture model (DPMM), SUMO outperforms TSVI (Wang & Blei, 2012) on perplexity by 33% on 3 datasets with million data points, which are beyond the scope of PF, using only 3GB RAM.
Commenting is a popular facility provided by news sites. Analyzing such user-generated content has recently attracted research interest. However, in multilingual societies such as India, analyzing such user-generated content is hard due to several reasons: (1) There are more than 20 official languages but linguistic resources are available mainly for Hindi. It is observed that people frequently use romanized text as it is easy and quick using an English keyboard, resulting in multi-glyphic comments, where the texts are in the same language but in different scripts. Such romanized texts are almost unexplored in machine learning so far. (2) In many cases, comments are made on a specific part of the article rather than the topic of the entire article. Off-the-shelf methods such as correspondence LDA are insufficient to model such relationships between articles and comments. In this paper, we extend the notion of correspondence to model multi-lingual, multi-script, and inter-lingual topics in a unified probabilistic model called the Multi-glyphic Correspondence Topic Model (MCTM). Using several metrics, we verify our approach and show that it improves over the state-of-the-art.
Topic models, such as Latent Dirichlet Allocation (LDA), posit that documents are drawn from admixtures of distributions over words, known as topics. The inference problem of recovering topics from such a collection of documents drawn from admixtures, is NP-hard. Making a strong assumption called separability, Arora et. al. (2012) gave the first provable algorithm for inference. For the widely used LDA model, Anandkumar et. al. (2012) gave a provable algorithm using clever tensor-methods. But Arora et. al. (2012) and Anandkumar et. al. (2012) do not learn topic vectors with bounded \(l_1\) error (a natural measure for probability vectors).
Our aim is to develop a model which makes intuitive and empirically supported assumptions and to design an algorithm with natural, simple components such as SVD, which provably solves the inference problem for the model with bounded \(l_1\) error. A topic in LDA and other models is essentially characterized by a group of co-occurring words. Motivated by this, we introduce topic specific Catchwords, a group of words which occur with strictly greater frequency in a topic than any other topic individually and are required to have high frequency together rather than individually. A major contribution of the paper is to show that under this more realistic assumption, which is empirically verified on real corpora, a singular value decomposition (SVD) based algorithm with a crucial pre-processing step of thresholding, can provably recover the topics from a collection of documents drawn from Dominant admixtures. Dominant admixtures are convex combination of distributions in which one distribution has a significantly higher contribution than the others. Apart from the simplicity of the algorithm, the sample complexity has near optimal dependence on \(w_0\), the lowest probability that a topic is dominant, and is better than Arora et. al. (2012). Empirical evidence shows that on several real world corpora, both Catchwords and Dominant admixture assumptions hold and the proposed algorithm substantially outperforms the state of the art Arora et. al. (2013).
Understanding user generated comments in response to news and blog posts is an important area of research. After ignoring irrelevant comments, one finds that a large fraction, approximately 50%, of the comments are very specific and can be further related to certain parts of the article instead of the entire story. For example, in a recent product review of Google Nexus 7 in ArsTechnica (a popular blog), the reviewer talks about the prospect of “Retina equipped iPad mini” in a few sentences. It is interesting that although the article is on Nexus 7, but a significant number of comments are focused on this specific point regarding “iPad ”. We pose the problem of detecting such comments as specific comments location (SCL) problem. SCL is an important open problem with no prior work.
SCL can be posed as a correspondence problem between comments and the parts of the relevant article, and one could potentially use Corr-LDA type models. Unfortunately, such models do not give satisfactory performance as they are restricted to using a single topic vector per article-comments pair. In this paper we propose to go beyond the single topic vector assumption and propose a novel correspondence topic model, namely SCTM, which admits multiple topic vectors (MTV) per article-comments pair. The resulting inference problem is quite complicated because of MTV and has no off-the-shelf solution. One of the major contributions of this paper is to show that using stick-breaking process as a prior over MTV, one can derive a collapsed Gibbs sampling procedure, which empirically works well for SCL.
SCTM is rigorously evaluated on three datasets, crawled from Yahoo! News (138,000 comments) and two blogs, ArsTechnica (AT) Science (90,000 comments) and AT-Gadget (160,000 comments). We observe that SCTM performs better than Corr-LDA, not only in terms of metrics like perplexity and topic coherence but also discovers more unique topics. We see that this immediately leads to an order of magnitude improvement in F1 score over Corr-LDA for SCL.