Skip to main content

Statistics Seminar

Graph-regularized topic modeling: Extending pLSI with document similarity by Claire Donnat

Event Details

Date
Wednesday, November 20, 2024
Time
4-5 p.m.
Description

Abstract: Topic modeling is a popular unsupervised method for uncovering latent structure within text corpora by representing documents as mixtures of topics. The incorporation of additional document-level information can however significantly improve the estimation of both the topic and the mixture matrices. While recent advances have primarily explored Bayesian methods for integrating document metadata, these methods often lack theoretical guarantees and can be computationally intensive. To address these limitations, in this talk, we propose an extension of probabilistic latent semantic indexing (pLSI), a frequentist framework for topic modeling, that incorporates document-level covariates or known similarities between documents using a graph formalism. Modeling documents as nodes in a network with edges denoting similarities, we propose a new estimator based on a fast graph-regularized iterative Singular Value Decomposition (SVD) that encourages similar documents to share similar topic proportions. We further characterize the estimation error of both the topics and topic assignment matrix under our proposed method by deriving high-probability bounds, and validate our model through comprehensive experiments on synthetic datasets and real-world corpora.


 

Cost
Free

Tags