[7-30]Improving Topic Models with Latent Feature Word Representations----中國科學院軟件研究所

[7-30]Improving Topic Models with Latent Feature Word Representations

文章來源： | 發布時間：2015-07-27 | 【打印】【關閉】

Title: Improving Topic Models with Latent Feature Word Representations

Speaker: Mark Johnson (Macquarie University, Australia)

web.science.mq.edu.au/~mjohnson

Time: 30th July 2015, 15:30

(Re-scheduled, originally announced to be 31st July 15:00. Apologies.)

Venue: Seminar Room (334), Level 3, Building 5,

Institute of Software, Chinese Academy of Sciences (CAS),

4 Zhongguancun South Fourth Street, Haidian District, Beijing 100190

Abstract:

Probabilistic topic models are widely used to discover latent topics

in document collections, while latent feature vector representations

of words learnt from very large external corpora have been used to

improve the performance of many NLP tasks. In this talk I explain how

we extended two existing Dirichlet multinomial topic models by

incorporating latent feature vector representations of words trained

on very large corpora, and show that this improves the word-topic

mapping learnt on much smaller target corpora.Experimental results

show that by using latent feature information from large external

corpora, our new models produce significantimprovements on topic

coherence, document clustering and document classification tasks. The

improvement is greatest on datasets with few or short documents,

including social media such as Twitter.

Joint work with Dat Quoc Nguyen, Richard Billingsley and Lan Du.