Given a corpus we can use topic modelling to get insights into the structure of information embedded in the docs. LDA is topic modelling algorithm that can be used for this purpose.

LDA is a generative algorithm that assumes documents as a bag of words where each document has mixture of topics and each topic has a discrete probability distribution of words. In LDA the topic distribution is assumed to have a Dirichlet prior which gives a smoother topic distribution per document.

Topic models in general try to represent the document with a set of topic vectors.

library(“topicmodels”)
#LDA based on gibbs method
ldaObj <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))

#Get LDA top Topic terms for docs
ldaOut.topics <- as.matrix(topics(ldaObj))

#Get the topic probability distribution for the docs
gammaDF <- as.data.frame(ldaObj@gamma)

With the LDA object we can get top terms by topics or get the probability distribution of topics for each docs.

While running the LDA model, model tuning is required to fit the model find the values which best describes the data.

References