본문 바로가기

everyday paper📃

[ACL 2021]Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

saeran 2022. 8. 4. 09:31

🥑 introduction

the quality of the topics : Coherent topics
- Coherence : Fruit를 주제로, “apple pear lemon banana kiwi” > “apple, knife, lemon, banana, spoon”
Bag-of-Words document 표현방식을 사용했을 때는 syntactic과 semantic relationship이 무시되었음. → pre-trained word and sentence representation, BERT와 같은 contextual representation을 통해서 보완해보겠음!
Neural ProdLDA를 디벨롭하였음. ProdLDA는 sota topic model이고 black-box variational inference를 사용함.

🥑 Neural Topic Models with Language Model Pre-training

Combined Topic Model (Combined TM)으로 모델 이름 붙임
구성 : prodLDA + SBERT embedded representation
- ProdLDA : Variational AutoEncoder(VAE)에 기반한 모델.
  - step1 : BoW document representation → continuous latent representation
  - step2 - decoder : latent representation → BoW(reconstruct)
  - Dirichlet prior using Gaussian distributionhttps://donghwa-kim.github.io/distributions.html
  - 디리클레 분포 : 다변량, 연속형 분포. k개의 연속형 확률변수에 대응되는 k개의 continuous values를 사용하여 분포를 표현. k개의 continuous random variables를 샘플할 수 있음(k차원의 벡터라고 생각해도 됨 )
- SBERT : BERT의 sentence embedding

🥑 Datasets

20NewsGroups, Wiki20K, Tweets2011, Google News, StackOverflow dataset

🥑 evaluation

normalized pointwise mutual information : 허나의 토픽내에서 상위 10개의 단어들이 서로 얼마나 관련되어 있는가?

external word embeddings topic coherence : 하나의 토픽내의 단어들이 얼마나 비슷한가
- top-10 단어들의 pair-wise cosine similarity의 평균
- 모든 토픽에 대해서 코사인 유사도 평균 : external topic coherence
inversed rank-biased overlap : 하나의 모델이 얼마나 다양한 주제를 만들었는가?
- 0 : 동일한 주제 ~ 1 : 다른 주제
- RBO 두개의 토픽에 대한 top-10 단어들을 비교함. weighted ranking을 사용해서 단어가 높은 랭킹에서 겹치면 패널티를 더 크게 줌.

🥑 비교 models

ProdLDA : 해당 연구에서 디벨롭한 모델
Neural Variational Document Model
ETM
MetaLDA(MLDA)
LDA

🥑 results

combined TM이 가장 성능이 좋았다^_^
- LDA랑 NVDM은 택도 없었음. 낮은 coherence를 보여줌

Roberta랑 Bert 임베딩을 비교해보았는데, Roberta가 더 좋았음
- 더 나은 contextualized embedding을 사용하면 토픽 모델의 성능을 더 좋게한다!

'everyday paper📃' 카테고리의 다른 글

sentence embedding (2)	2023.01.25
WELL-READ STUDENTS LEARN BETTER: ON THE IMPORTANCE OF PRE-TRAINING COMPACT MODELS (0)	2022.08.02
[neurIPS 2021] Pay Attention to MLPs (0)	2022.07.08

티스토리툴바