Text Mining of The Complete Works of Jane Austen

Text mining refers to extraction of meaningful information from qualitative and unstructured text data. In this document we will perform text mining on The Complete Works of Jane Austen. We can take this file from Project Gutenberg.

For the task, we will use these packages in R – tm, SnowballC, slam, wordcloud, reshape2, ggplot2. So, first let us install and load these packages.

library(tm)
library(SnowballC) 
library(slam) 
library(wordcloud)
library(reshape2) 
library(ggplot2)

Now, we will extract the text from the .txt file and read it into R.

jane_text = "C:/Users/Sunaksham/Documents/pg31100.txt"
if (!file.exists(jane_text)) { download.file("http://www.gutenberg.org/cache/epub/31100/pg31100.txt", destfile = jane_text) }
austenjane = readLines(jane_text)
length(austenjane)
## [1] 80476

This data has 80476 lines. We shall check the few lines at the top and bottom of the data.

head(austenjane)
## [1] ""                                                                  
## [2] "Project Gutenberg's The Complete Works of Jane Austen, by Jane Austen"
## [3] ""                                                                     
## [4] "This eBook is for the use of anyone anywhere at no cost and with"     
## [5] "almost no restrictions whatsoever.  You may copy it, give it away or" 
## [6] "re-use it under the terms of the Project Gutenberg License included"
tail(austenjane)
## [1] ""                                                                  
## [2] "This Web site includes information about Project Gutenberg-tm,"    
## [3] "including how to make donations to the Project Gutenberg Literary" 
## [4] "Archive Foundation, how to help produce our new eBooks, and how to"
## [5] "subscribe to our email newsletter to hear about new eBooks."       
## [6] ""

We can see a lot of header and footer text and shall get rid of it. After checking that how many lines are occupied by metadata, we remove those lines.

austenjane[(1:50)]
##  [1] ""                                                                   
##  [2] "Project Gutenberg's The Complete Works of Jane Austen, by Jane Austen" 
##  [3] ""                                                                      
##  [4] "This eBook is for the use of anyone anywhere at no cost and with"      
##  [5] "almost no restrictions whatsoever.  You may copy it, give it away or"  
##  [6] "re-use it under the terms of the Project Gutenberg License included"   
##  [7] "with this eBook or online at www.gutenberg.org"                        
##  [8] ""                                                                      
##  [9] ""                                                                      
## [10] "Title: The Complete Project Gutenberg Works of Jane Austen"            
## [11] ""                                                                      
## [12] "Author: Jane Austen"                                                   
## [13] ""                                                                      
## [14] "Editor: David Widger"                                                  
## [15] ""                                                                      
## [16] "Release Date: January 25, 2010 [EBook #31100]"                         
## [17] ""                                                                      
## [18] "Language: English"                                                     
## [19] ""                                                                      
## [20] ""                                                                      
## [21] "*** START OF THIS PROJECT GUTENBERG EBOOK THE WORKS OF JANE AUSTEN ***"
## [22] ""                                                                      
## [23] ""                                                                      
## [24] ""                                                                      
## [25] ""                                                                      
## [26] "Produced by many Project Gutenberg volunteers."                        
## [27] ""                                                                      
## [28] ""                                                                      
## [29] ""                                                                      
## [30] ""                                                                      
## [31] ""                                                                      
## [32] ""                                                                      
## [33] ""                                                                      
## [34] "THE WORKS OF JANE AUSTEN"                                              
## [35] ""                                                                      
## [36] ""                                                                      
## [37] ""                                                                      
## [38] "Edited by David Widger"                                                
## [39] ""                                                                      
## [40] "Project Gutenberg Editions"                                            
## [41] ""                                                                      
## [42] ""                                                                      
## [43] ""                                                                      
## [44] "             DEDICATION"                                               
## [45] ""                                                                      
## [46] "     This Jane Austen collection"                                      
## [47] "         is dedicated to"                                              
## [48] "     Alice Goodson [Hart] Woodby"                                      
## [49] ""                                                                      
## [50] ""
austenjane[(80310:80383)]
##  [1] "1.F.1.  Project Gutenberg volunteers and employees expend considerable"  
##  [2] "effort to identify, do copyright research on, transcribe and proofread"  
##  [3] "public domain works in creating the Project Gutenberg-tm"                
##  [4] "collection.  Despite these efforts, Project Gutenberg-tm electronic"     
##  [5] "works, and the medium on which they may be stored, may contain"          
##  [6] "\"Defects,\" such as, but not limited to, incomplete, inaccurate or"     
##  [7] "corrupt data, transcription errors, a copyright or other intellectual"   
##  [8] "property infringement, a defective or damaged disk or other medium, a"   
##  [9] "computer virus, or computer codes that damage or cannot be read by"      
## [10] "your equipment."                                                         
## [11] ""                                                                        
## [12] "1.F.2.  LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for the \"Right"
## [13] "of Replacement or Refund\" described in paragraph 1.F.3, the Project"    
## [14] "Gutenberg Literary Archive Foundation, the owner of the Project"         
## [15] "Gutenberg-tm trademark, and any other party distributing a Project"      
## [16] "Gutenberg-tm electronic work under this agreement, disclaim all"         
## [17] "liability to you for damages, costs and expenses, including legal"       
## [18] "fees.  YOU AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT"       
## [19] "LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE"        
## [20] "PROVIDED IN PARAGRAPH F3.  YOU AGREE THAT THE FOUNDATION, THE"           
## [21] "TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE"   
## [22] "LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE OR"  
## [23] "INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH"   
## [24] "DAMAGE."                                                                 
## [25] ""                                                                        
## [26] "1.F.3.  LIMITED RIGHT OF REPLACEMENT OR REFUND - If you discover a"      
## [27] "defect in this electronic work within 90 days of receiving it, you can"  
## [28] "receive a refund of the money (if any) you paid for it by sending a"     
## [29] "written explanation to the person you received the work from.  If you"   
## [30] "received the work on a physical medium, you must return the medium with" 
## [31] "your written explanation.  The person or entity that provided you with"  
## [32] "the defective work may elect to provide a replacement copy in lieu of a" 
## [33] "refund.  If you received the work electronically, the person or entity"  
## [34] "providing it to you may choose to give you a second opportunity to"      
## [35] "receive the work electronically in lieu of a refund.  If the second copy"
## [36] "is also defective, you may demand a refund in writing without further"   
## [37] "opportunities to fix the problem."                                       
## [38] ""                                                                        
## [39] "1.F.4.  Except for the limited right of replacement or refund set forth" 
## [40] "in paragraph 1.F.3, this work is provided to you 'AS-IS' WITH NO OTHER"  
## [41] "WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO"
## [42] "WARRANTIES OF MERCHANTIBILITY OR FITNESS FOR ANY PURPOSE."               
## [43] ""                                                                        
## [44] "1.F.5.  Some states do not allow disclaimers of certain implied"         
## [45] "warranties or the exclusion or limitation of certain types of damages."  
## [46] "If any disclaimer or limitation set forth in this agreement violates the"
## [47] "law of the state applicable to this agreement, the agreement shall be"   
## [48] "interpreted to make the maximum disclaimer or limitation permitted by"   
## [49] "the applicable state law.  The invalidity or unenforceability of any"    
## [50] "provision of this agreement shall not void the remaining provisions."    
## [51] ""                                                                        
## [52] "1.F.6.  INDEMNITY - You agree to indemnify and hold the Foundation, the" 
## [53] "trademark owner, any agent or employee of the Foundation, anyone"        
## [54] "providing copies of Project Gutenberg-tm electronic works in accordance" 
## [55] "with this agreement, and any volunteers associated with the production," 
## [56] "promotion and distribution of Project Gutenberg-tm electronic works,"    
## [57] "harmless from all liability, costs and expenses, including legal fees,"  
## [58] "that arise directly or indirectly from any of the following which you do"
## [59] "or cause to occur: (a) distribution of this or any Project Gutenberg-tm" 
## [60] "work, (b) alteration, modification, or additions or deletions to any"    
## [61] "Project Gutenberg-tm work, and (c) any Defect you cause."                
## [62] ""                                                                        
## [63] ""                                                                        
## [64] "Section  2.  Information about the Mission of Project Gutenberg-tm"      
## [65] ""                                                                        
## [66] "Project Gutenberg-tm is synonymous with the free distribution of"        
## [67] "electronic works in formats readable by the widest variety of computers" 
## [68] "including obsolete, old, middle-aged and new computers.  It exists"      
## [69] "because of the efforts of hundreds of volunteers and donations from"     
## [70] "people in all walks of life."                                            
## [71] ""                                                                        
## [72] "Volunteers and financial support to provide volunteers with the"         
## [73] "assistance they need, are critical to reaching Project Gutenberg-tm's"   
## [74] "goals and ensuring that the Project Gutenberg-tm collection will"
austenjane = austenjane[-(1:93)]
austenjane = austenjane[-(80012:80383)]

Now, we concatenate the lines to form a single string with the help of paste function, while leaving a single space gap between the words.

austenjane = paste(austenjane, collapse = " ")
nchar(austenjane)
## [1] 4353592

The data contains 4353592 characters. Now, we will convert this text data into a corpus. This process uses the “tm” package.

jane_vec <- VectorSource(austenjane)
jane_corpus <- Corpus(jane_vec)
summary(jane_corpus)
##   Length Class             Mode
## 1 2      PlainTextDocument list

Before proceeding further, we convert convert all the text data into lowercase. Then, we remove punctuation marks, numbers and common stopwords in English language.

jane_corpus <- tm_map(jane_corpus, PlainTextDocument)
jane_corpus <- tm_map(jane_corpus, content_transformer(tolower))
jane_corpus <- tm_map(jane_corpus, removePunctuation)
jane_corpus <- tm_map(jane_corpus, removeNumbers)
jane_corpus <- tm_map(jane_corpus, removeWords, stopwords("english"))

Also, we stem the text data to remove affixes from words so that words are reduced to their root form. This process uses “SnowBallC” package. Now, there will be a lot of extra space between words which we shall remove.

jane_corpus <- tm_map(jane_corpus, stemDocument)
jane_corpus <- tm_map(jane_corpus, stripWhitespace)

We create a wordcloud of the corpus. This process will use “wordcloud” package.

wordcloud(jane_corpus,scale=c(3,0.1),max.words=60,random.order=FALSE,rot.per=0.40, use.r.layout=FALSE)

wordcloud

We create a term-document matrix or TDM. A TDM is a mathematical matrix that provides the frequency of each term that occurs in a collection of documents. But before that, we should make sure all of your data is in PlainTextDocument or else it will give the error – “Error: inherits(doc,”TextDocument“) is not TRUE.”

jane_corpus <- tm_map(jane_corpus, PlainTextDocument)
jane_tdm <- TermDocumentMatrix(jane_corpus)
jane_tdm
## <<TermDocumentMatrix (terms: 13313, documents: 1)>>
## Non-/sparse entries: 13313/0
## Sparsity           : 0%
## Maximal term length: 32
## Weighting          : term frequency (tf)

Sparsity – High sparsity tells that there are a lot of terms that only occur in one or a few documents. In our case, we have one document in the text data, so all terms occur in this document itself resulting in 0% sparsity.

Next, we try to find out frequently occuring terms. Then, we try to find associations of some of those terms with other terms with 0.01 as lower correlation limit.

findFreqTerms(jane_tdm, 1000)
##  [1] "can"     "come"    "day"     "even"    "everi"   "feel"    "first"  
##  [8] "friend"  "good"    "great"   "know"    "ladi"    "like"    "littl"  
## [15] "look"    "make"    "may"     "might"   "miss"    "mrs"     "much"   
## [22] "must"    "never"   "noth"    "now"     "one"     "quit"    "said"   
## [29] "say"     "see"     "sister"  "soon"    "thing"   "think"   "though" 
## [36] "thought" "time"    "well"    "will"    "wish"    "without"
findFreqTerms(jane_tdm, 1500)
##  [1] "everi" "know"  "miss"  "mrs"   "much"  "must"  "now"   "one"  
##  [9] "said"  "think" "time"  "will"
findFreqTerms(jane_tdm, 2000)
## [1] "mrs"  "much" "must" "one"  "said" "will"
findAssocs(jane_tdm, "first", 0.01)
## $first
## numeric(0)
findAssocs(jane_tdm, "miss", 0.01)
## $miss
## numeric(0)
findAssocs(jane_tdm, "lady", 0.01)
## $lady
## numeric(0)
findAssocs(jane_tdm, "little", 0.01)
## $little
## numeric(0)
findAssocs(jane_tdm, "never", 0.01)
## $never
## numeric(0)
findAssocs(jane_tdm, "time", 0.01)
## $time
## numeric(0)
findAssocs(jane_tdm, "mrs", 0.01)
## $mrs
## numeric(0)

As mrs, miss and lady are some of the most commonly used words in Jane Austen’s works, we can infer that her stories were related to female characters. We could not find any associations of most common terms with other terms.

Now we will convert the data into a matrix. we will see that this conversion will also help us save space. This process will use “slam” package.

dim(jane_tdm)
## [1] 13313     1
jane_tdm_matrix <- as.matrix(jane_tdm)
head(jane_tdm_matrix)
##            Docs
## Terms       character(0)
##   abandon              5
##   abash                2
##   abat                 7
##   abbey               70
##   abbeyfor             1
##   abbeyland            1
object.size(jane_tdm)
## 739416 bytes
object.size(jane_tdm_matrix)
## 632208 bytes

We will convert this matrix into a tidy looking matrix using melt function. This process will use “reshape2” package.

jane_tdm_matrix= melt(jane_tdm_matrix, value.name = "count")
head(jane_tdm_matrix)
##       Terms         Docs count
## 1   abandon character(0)     5
## 2     abash character(0)     2
## 3      abat character(0)     7
## 4     abbey character(0)    70
## 5  abbeyfor character(0)     1
## 6 abbeyland character(0)     1

Finally, we plot the frequencies of terms using ggplot function.This process uses ggplot2 package.

jane_plot <- ggplot(jane_tdm_matrix, aes(x= Docs, y= Terms, group = 1))
jane_plot <- jane_plot + geom_line()
jane_plot

docs

4 thoughts on “Text Mining of The Complete Works of Jane Austen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s