Title: Improving Topic Models with Latent Feature Word Representations
Speaker: Mark Johnson (Macquarie University, Australia)
web.science.mq.edu.au/~mjohnson
Time: 30th July 2015, 15:30
(Re-scheduled, originally announced to be 31st July 15:00. Apologies.)
Venue: Seminar Room (334), Level 3, Building 5,
Institute of Software, Chinese Academy of Sciences (CAS),
4 Zhongguancun South Fourth Street, Haidian District, Beijing 100190
Abstract:
Probabilistic topic models are widely used to discover latent topics
in document collections, while latent feature vector representations
of words learnt from very large external corpora have been used to
improve the performance of many NLP tasks. In this talk I explain how
we extended two existing Dirichlet multinomial topic models by
incorporating latent feature vector representations of words trained
on very large corpora, and show that this improves the word-topic
mapping learnt on much smaller target corpora.Experimental results
show that by using latent feature information from large external
corpora, our new models produce significantimprovements on topic
coherence, document clustering and document classification tasks. The
improvement is greatest on datasets with few or short documents,
including social media such as Twitter.
Joint work with Dat Quoc Nguyen, Richard Billingsley and Lan Du.