首页 > 代码库 > LDA学习总结
LDA学习总结
1. What is the LDA?
LDA(latent dilichlet allocation) is a method to assign the topic (distribution) of a given document. However, note that this model is not necessarilly tied to text applications. The complementary applications can refer to original paper[1]. To have a simple overview of this algorithm, refer to [2].
2. Why does it outperform pLSI?
pLSI is another topic model which also involves mixture concept of topics in a document. But pLSI lack the generative procedure of topics estimation which can be solved appropriately by LDA. There is a topic distribution for each document. Hence the parameters for this corpus increase in order of corpus size. Thus, this model will suffer from the overfitting problem. The 5-th section in blog[2] specifically explained this description.
3. Why to use the variational inference to approximite posterior distribution?
The optimization method used in [1] is variational EM, which is a little more difficult(inconvenient) than gibbs sampling method. Recall the EM algorithm, we need to firstly find the Q function which is the expectation of the complete-data log likelihood with respect to the posterior distribution of the latent variables and then update parameters of this certain model. The main problem in using EM algorithm is to calculate the posterior distribution, i.e. $p(\theta,z| w, \alpha, \beta)$. But due to the complexity of $p(w|\alpha,\beta.)$, it is intractable. Hence variational inference is an alternative left for approximation.
Reference:
[1] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation[J]. the Journal of machine Learning research, 2003, 3: 993-1022.
[2] LDA概念解析