The tm package provides functionality for exploratory analysis on text corpus.
The nature of text corpus has a big effect on the data modeling. Some of the text corpus that i have worked on in past includes
Structured news feeds
Unstructured html news webpages
Web page Comments
Tweets
Posts on news sharing forum corpus
Emails
After the text preprocessing stage all of these would be a list of documents. But the nature of data in each of these vary a lot. HTML pages have to go through rigorous content processing to filter out unwanted content. Comments and tweets will have short text where each of them will have there own semantic keywords. Similarly a corpus of emails will have to handled in specific ways.
After the preprocessing stage we are ready to explore the corpus to further find the relationships between the docs and understand the deeper meaning
Now that we have converted a corpus into a numerical matrix we can do all kind of computation on it. The term document matrix is a sparse matrix since many of the words would be unique to certain docs. Use the inspect function to look into the details of this matrix
inspect(tdm)
Some operations you could do on this
Since this is numerical matrix, you can slice and dice it to your hearts content.
Some other functionality provided by tm package
Often we want to do sentence level analyis of a give corpus. The text could be a article, book or a corpus where breaking it into sentences and then doing a word level analysis into the sentence structure could give a better understanding of the document.
For doing these kind of NLP analysis we use the packages NLP, OpenNLP and RWeka
The package openNLPmodels.en contains the following models for identifying these entities from text :
Date
Location
Money
Organization
Percentage
Person
Time
Now we can extract the various entities from the extracted document for further analysis