Text Mining of The Complete Works of Jane Austen

Text mining refers to extraction of meaningful information from qualitative and unstructured text data. In this document we will perform text mining on The Complete Works of Jane Austen. We can take this file from Project Gutenberg.

For the task, we will use these packages in R – tm, SnowballC, slam, wordcloud, reshape2, ggplot2. So, first let us install and load these packages.

library(tm)
library(SnowballC) 
library(slam) 
library(wordcloud)
library(reshape2) 
library(ggplot2)

Now, we will extract the text from the .txt file and read it into R.

jane_text = "C:/Users/Sunaksham/Documents/pg31100.txt"
if (!file.exists(jane_text)) { download.file("http://www.gutenberg.org/cache/epub/31100/pg31100.txt", destfile = jane_text) }
austenjane = readLines(jane_text)
length(austenjane)
## [1] 80476

This data has 80476 lines. We shall check the few lines at the top and bottom of the data.

head(austenjane)
## [1] ""                                                                  
## [2] "Project Gutenberg's The Complete Works of Jane Austen, by Jane Austen"
## [3] ""                                                                     
## [4] "This eBook is for the use of anyone anywhere at no cost and with"     
## [5] "almost no restrictions whatsoever.  You may copy it, give it away or" 
## [6] "re-use it under the terms of the Project Gutenberg License included"
tail(austenjane)
## [1] ""                                                                  
## [2] "This Web site includes information about Project Gutenberg-tm,"    
## [3] "including how to make donations to the Project Gutenberg Literary" 
## [4] "Archive Foundation, how to help produce our new eBooks, and how to"
## [5] "subscribe to our email newsletter to hear about new eBooks."       
## [6] ""

We can see a lot of header and footer text and shall get rid of it. After checking that how many lines are occupied by metadata, we remove those lines.

austenjane[(1:50)]
##  [1] ""                                                                   
##  [2] "Project Gutenberg's The Complete Works of Jane Austen, by Jane Austen" 
##  [3] ""                                                                      
##  [4] "This eBook is for the use of anyone anywhere at no cost and with"      
##  [5] "almost no restrictions whatsoever.  You may copy it, give it away or"  
##  [6] "re-use it under the terms of the Project Gutenberg License included"   
##  [7] "with this eBook or online at www.gutenberg.org"                        
##  [8] ""                                                                      
##  [9] ""                                                                      
## [10] "Title: The Complete Project Gutenberg Works of Jane Austen"            
## [11] ""                                                                      
## [12] "Author: Jane Austen"                                                   
## [13] ""                                                                      
## [14] "Editor: David Widger"                                                  
## [15] ""                                                                      
## [16] "Release Date: January 25, 2010 [EBook #31100]"                         
## [17] ""                                                                      
## [18] "Language: English"                                                     
## [19] ""                                                                      
## [20] ""                                                                      
## [21] "*** START OF THIS PROJECT GUTENBERG EBOOK THE WORKS OF JANE AUSTEN ***"
## [22] ""                                                                      
## [23] ""                                                                      
## [24] ""                                                                      
## [25] ""                                                                      
## [26] "Produced by many Project Gutenberg volunteers."                        
## [27] ""                                                                      
## [28] ""                                                                      
## [29] ""                                                                      
## [30] ""                                                                      
## [31] ""                                                                      
## [32] ""                                                                      
## [33] ""                                                                      
## [34] "THE WORKS OF JANE AUSTEN"                                              
## [35] ""                                                                      
## [36] ""                                                                      
## [37] ""                                                                      
## [38] "Edited by David Widger"                                                
## [39] ""                                                                      
## [40] "Project Gutenberg Editions"                                            
## [41] ""                                                                      
## [42] ""                                                                      
## [43] ""                                                                      
## [44] "             DEDICATION"                                               
## [45] ""                                                                      
## [46] "     This Jane Austen collection"                                      
## [47] "         is dedicated to"                                              
## [48] "     Alice Goodson [Hart] Woodby"                                      
## [49] ""                                                                      
## [50] ""
austenjane[(80310:80383)]
##  [1] "1.F.1.  Project Gutenberg volunteers and employees expend considerable"  
##  [2] "effort to identify, do copyright research on, transcribe and proofread"  
##  [3] "public domain works in creating the Project Gutenberg-tm"                
##  [4] "collection.  Despite these efforts, Project Gutenberg-tm electronic"     
##  [5] "works, and the medium on which they may be stored, may contain"          
##  [6] "\"Defects,\" such as, but not limited to, incomplete, inaccurate or"     
##  [7] "corrupt data, transcription errors, a copyright or other intellectual"   
##  [8] "property infringement, a defective or damaged disk or other medium, a"   
##  [9] "computer virus, or computer codes that damage or cannot be read by"      
## [10] "your equipment."                                                         
## [11] ""                                                                        
## [12] "1.F.2.  LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for the \"Right"
## [13] "of Replacement or Refund\" described in paragraph 1.F.3, the Project"    
## [14] "Gutenberg Literary Archive Foundation, the owner of the Project"         
## [15] "Gutenberg-tm trademark, and any other party distributing a Project"      
## [16] "Gutenberg-tm electronic work under this agreement, disclaim all"         
## [17] "liability to you for damages, costs and expenses, including legal"       
## [18] "fees.  YOU AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT"       
## [19] "LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE"        
## [20] "PROVIDED IN PARAGRAPH F3.  YOU AGREE THAT THE FOUNDATION, THE"           
## [21] "TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE"   
## [22] "LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE OR"  
## [23] "INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH"   
## [24] "DAMAGE."                                                                 
## [25] ""                                                                        
## [26] "1.F.3.  LIMITED RIGHT OF REPLACEMENT OR REFUND - If you discover a"      
## [27] "defect in this electronic work within 90 days of receiving it, you can"  
## [28] "receive a refund of the money (if any) you paid for it by sending a"     
## [29] "written explanation to the person you received the work from.  If you"   
## [30] "received the work on a physical medium, you must return the medium with" 
## [31] "your written explanation.  The person or entity that provided you with"  
## [32] "the defective work may elect to provide a replacement copy in lieu of a" 
## [33] "refund.  If you received the work electronically, the person or entity"  
## [34] "providing it to you may choose to give you a second opportunity to"      
## [35] "receive the work electronically in lieu of a refund.  If the second copy"
## [36] "is also defective, you may demand a refund in writing without further"   
## [37] "opportunities to fix the problem."                                       
## [38] ""                                                                        
## [39] "1.F.4.  Except for the limited right of replacement or refund set forth" 
## [40] "in paragraph 1.F.3, this work is provided to you 'AS-IS' WITH NO OTHER"  
## [41] "WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO"
## [42] "WARRANTIES OF MERCHANTIBILITY OR FITNESS FOR ANY PURPOSE."               
## [43] ""                                                                        
## [44] "1.F.5.  Some states do not allow disclaimers of certain implied"         
## [45] "warranties or the exclusion or limitation of certain types of damages."  
## [46] "If any disclaimer or limitation set forth in this agreement violates the"
## [47] "law of the state applicable to this agreement, the agreement shall be"   
## [48] "interpreted to make the maximum disclaimer or limitation permitted by"   
## [49] "the applicable state law.  The invalidity or unenforceability of any"    
## [50] "provision of this agreement shall not void the remaining provisions."    
## [51] ""                                                                        
## [52] "1.F.6.  INDEMNITY - You agree to indemnify and hold the Foundation, the" 
## [53] "trademark owner, any agent or employee of the Foundation, anyone"        
## [54] "providing copies of Project Gutenberg-tm electronic works in accordance" 
## [55] "with this agreement, and any volunteers associated with the production," 
## [56] "promotion and distribution of Project Gutenberg-tm electronic works,"    
## [57] "harmless from all liability, costs and expenses, including legal fees,"  
## [58] "that arise directly or indirectly from any of the following which you do"
## [59] "or cause to occur: (a) distribution of this or any Project Gutenberg-tm" 
## [60] "work, (b) alteration, modification, or additions or deletions to any"    
## [61] "Project Gutenberg-tm work, and (c) any Defect you cause."                
## [62] ""                                                                        
## [63] ""                                                                        
## [64] "Section  2.  Information about the Mission of Project Gutenberg-tm"      
## [65] ""                                                                        
## [66] "Project Gutenberg-tm is synonymous with the free distribution of"        
## [67] "electronic works in formats readable by the widest variety of computers" 
## [68] "including obsolete, old, middle-aged and new computers.  It exists"      
## [69] "because of the efforts of hundreds of volunteers and donations from"     
## [70] "people in all walks of life."                                            
## [71] ""                                                                        
## [72] "Volunteers and financial support to provide volunteers with the"         
## [73] "assistance they need, are critical to reaching Project Gutenberg-tm's"   
## [74] "goals and ensuring that the Project Gutenberg-tm collection will"
austenjane = austenjane[-(1:93)]
austenjane = austenjane[-(80012:80383)]

Now, we concatenate the lines to form a single string with the help of paste function, while leaving a single space gap between the words.

austenjane = paste(austenjane, collapse = " ")
nchar(austenjane)
## [1] 4353592

The data contains 4353592 characters. Now, we will convert this text data into a corpus. This process uses the “tm” package.

jane_vec <- VectorSource(austenjane)
jane_corpus <- Corpus(jane_vec)
summary(jane_corpus)
##   Length Class             Mode
## 1 2      PlainTextDocument list

Before proceeding further, we convert convert all the text data into lowercase. Then, we remove punctuation marks, numbers and common stopwords in English language.

jane_corpus <- tm_map(jane_corpus, PlainTextDocument)
jane_corpus <- tm_map(jane_corpus, content_transformer(tolower))
jane_corpus <- tm_map(jane_corpus, removePunctuation)
jane_corpus <- tm_map(jane_corpus, removeNumbers)
jane_corpus <- tm_map(jane_corpus, removeWords, stopwords("english"))

Also, we stem the text data to remove affixes from words so that words are reduced to their root form. This process uses “SnowBallC” package. Now, there will be a lot of extra space between words which we shall remove.

jane_corpus <- tm_map(jane_corpus, stemDocument)
jane_corpus <- tm_map(jane_corpus, stripWhitespace)

We create a wordcloud of the corpus. This process will use “wordcloud” package.

wordcloud(jane_corpus,scale=c(3,0.1),max.words=60,random.order=FALSE,rot.per=0.40, use.r.layout=FALSE)

wordcloud

We create a term-document matrix or TDM. A TDM is a mathematical matrix that provides the frequency of each term that occurs in a collection of documents. But before that, we should make sure all of your data is in PlainTextDocument or else it will give the error – “Error: inherits(doc,”TextDocument“) is not TRUE.”

jane_corpus <- tm_map(jane_corpus, PlainTextDocument)
jane_tdm <- TermDocumentMatrix(jane_corpus)
jane_tdm
## <<TermDocumentMatrix (terms: 13313, documents: 1)>>
## Non-/sparse entries: 13313/0
## Sparsity           : 0%
## Maximal term length: 32
## Weighting          : term frequency (tf)

Sparsity – High sparsity tells that there are a lot of terms that only occur in one or a few documents. In our case, we have one document in the text data, so all terms occur in this document itself resulting in 0% sparsity.

Next, we try to find out frequently occuring terms. Then, we try to find associations of some of those terms with other terms with 0.01 as lower correlation limit.

findFreqTerms(jane_tdm, 1000)
##  [1] "can"     "come"    "day"     "even"    "everi"   "feel"    "first"  
##  [8] "friend"  "good"    "great"   "know"    "ladi"    "like"    "littl"  
## [15] "look"    "make"    "may"     "might"   "miss"    "mrs"     "much"   
## [22] "must"    "never"   "noth"    "now"     "one"     "quit"    "said"   
## [29] "say"     "see"     "sister"  "soon"    "thing"   "think"   "though" 
## [36] "thought" "time"    "well"    "will"    "wish"    "without"
findFreqTerms(jane_tdm, 1500)
##  [1] "everi" "know"  "miss"  "mrs"   "much"  "must"  "now"   "one"  
##  [9] "said"  "think" "time"  "will"
findFreqTerms(jane_tdm, 2000)
## [1] "mrs"  "much" "must" "one"  "said" "will"
findAssocs(jane_tdm, "first", 0.01)
## $first
## numeric(0)
findAssocs(jane_tdm, "miss", 0.01)
## $miss
## numeric(0)
findAssocs(jane_tdm, "lady", 0.01)
## $lady
## numeric(0)
findAssocs(jane_tdm, "little", 0.01)
## $little
## numeric(0)
findAssocs(jane_tdm, "never", 0.01)
## $never
## numeric(0)
findAssocs(jane_tdm, "time", 0.01)
## $time
## numeric(0)
findAssocs(jane_tdm, "mrs", 0.01)
## $mrs
## numeric(0)

As mrs, miss and lady are some of the most commonly used words in Jane Austen’s works, we can infer that her stories were related to female characters. We could not find any associations of most common terms with other terms.

Now we will convert the data into a matrix. we will see that this conversion will also help us save space. This process will use “slam” package.

dim(jane_tdm)
## [1] 13313     1
jane_tdm_matrix <- as.matrix(jane_tdm)
head(jane_tdm_matrix)
##            Docs
## Terms       character(0)
##   abandon              5
##   abash                2
##   abat                 7
##   abbey               70
##   abbeyfor             1
##   abbeyland            1
object.size(jane_tdm)
## 739416 bytes
object.size(jane_tdm_matrix)
## 632208 bytes

We will convert this matrix into a tidy looking matrix using melt function. This process will use “reshape2” package.

jane_tdm_matrix= melt(jane_tdm_matrix, value.name = "count")
head(jane_tdm_matrix)
##       Terms         Docs count
## 1   abandon character(0)     5
## 2     abash character(0)     2
## 3      abat character(0)     7
## 4     abbey character(0)    70
## 5  abbeyfor character(0)     1
## 6 abbeyland character(0)     1

Finally, we plot the frequencies of terms using ggplot function.This process uses ggplot2 package.

jane_plot <- ggplot(jane_tdm_matrix, aes(x= Docs, y= Terms, group = 1))
jane_plot <- jane_plot + geom_line()
jane_plot

docs

GroupBy summary statistics in R

First we install and load few packages – “data.table”, “Hmisc” and “doBy”. Here, we require data.table for its data input and data manipulation techniques, Hmisc for its data analysis techniques and doBy for its grouping and data analysis techniques. After loading the libraries/packages, we input the diamonds data using fread() and find the number of observations in the dataset using nrow().

library(data.table)
library(Hmisc)

## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: ‘Hmisc’
##
## The following objects are masked from ‘package:base’:
##
##     format.pval, round.POSIXt, trunc.POSIXt, units

library(doBy)
diamonds = fread("diamonds.csv")
nrow(diamonds)

## [1] 53940

str() gives the basic structure of objects.

str(diamonds)

## Classes ‘data.table’ and ‘data.frame’:   53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 …
##  $ cut    : chr  “Ideal” “Premium” “Good” “Premium” …
##  $ color  : chr  “E” “E” “E” “I” …
##  $ clarity: chr  “SI2” “SI1” “VS1” “VS2” …
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 …
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 …
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 …
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 …
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 …
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 …
##  – attr(*, “.internal.selfref”)=<externalptr>

head() and tail() show the top six and bottom six observations, respectively.

head(diamonds)

##    carat       cut color clarity depth table price    x    y    z
## 1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

tail(diamonds)

##    carat       cut color clarity depth table price    x    y    z
## 1:  0.72   Premium     D     SI1  62.7    59  2757 5.69 5.73 3.58
## 2:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 3:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 4:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 5:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 6:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

summary() provides the minimum, maximum, mean, median, 1st and 3rd quartiles for numeric variables and length(count) and type for character variables.

summary(diamonds)

##      carat            cut               color             clarity
##  Min.   :0.2000   Length:53940       Length:53940       Length:53940
##  1st Qu.:0.4000   Class :character   Class :character   Class :character
##  Median :0.7000   Mode  :character   Mode  :character   Mode  :character
##  Mean   :0.7979
##  3rd Qu.:1.0400
##  Max.   :5.0100
##      depth           table           price             x
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740
##        y                z
##  Min.   : 0.000   Min.   : 0.000
##  1st Qu.: 4.720   1st Qu.: 2.910
##  Median : 5.710   Median : 3.530
##  Mean   : 5.735   Mean   : 3.539
##  3rd Qu.: 6.540   3rd Qu.: 4.040
##  Max.   :58.900   Max.   :31.800

describe() from Hmisc package provides number of observations, number of missing values, number of unique values, mean, percentiles in multiples of 5 and lowest and highest values(5 each) for each numeric variable.

For each character variable describe() displays frequency count and frequency ratio for each of its categories.

describe(diamonds)

## diamonds
##
##  10  Variables      53940  Observations
## —————————————————————————
## carat
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     273       1  0.7979    0.30    0.31    0.40    0.70
##     .75     .90     .95
##    1.04    1.51    1.70
##
## lowest : 0.20 0.21 0.22 0.23 0.24, highest: 4.00 4.01 4.13 4.50 5.01
## —————————————————————————
## cut
##       n missing  unique
##   53940       0       5
##
##           Fair Good Ideal Premium Very Good
## Frequency 1610 4906 21551   13791     12082
## %            3    9    40      26        22
## —————————————————————————
## color
##       n missing  unique
##   53940       0       7
##
##              D    E    F     G    H    I    J
## Frequency 6775 9797 9542 11292 8304 5422 2808
## %           13   18   18    21   15   10    5
## —————————————————————————
## clarity
##       n missing  unique
##   53940       0       8
##
##            I1   IF   SI1  SI2  VS1   VS2 VVS1 VVS2
## Frequency 741 1790 13065 9194 8171 12258 3655 5066
## %           1    3    24   17   15    23    7    9
## —————————————————————————
## depth
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     184       1   61.75    59.3    60.0    61.0    61.8
##     .75     .90     .95
##    62.5    63.3    63.8
##
## lowest : 43.0 44.0 50.8 51.0 52.2, highest: 72.2 72.9 73.6 78.2 79.0
## —————————————————————————
## table
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     127    0.98   57.46      54      55      56      57
##     .75     .90     .95
##      59      60      61
##
## lowest : 43.0 44.0 49.0 50.0 50.1, highest: 71.0 73.0 76.0 79.0 95.0
## —————————————————————————
## price
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0   11602       1    3933     544     646     950    2401
##     .75     .90     .95
##    5324    9821   13107
##
## lowest :   326   327   334   335   336
## highest: 18803 18804 18806 18818 18823
## —————————————————————————
## x
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     554       1   5.731    4.29    4.36    4.71    5.70
##     .75     .90     .95
##    6.54    7.31    7.66
##
## lowest :  0.00  3.73  3.74  3.76  3.77
## highest: 10.01 10.02 10.14 10.23 10.74
## —————————————————————————
## y
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     552       1   5.735    4.30    4.36    4.72    5.71
##     .75     .90     .95
##    6.54    7.30    7.65
##
## lowest :  0.00  3.68  3.71  3.72  3.73
## highest: 10.10 10.16 10.54 31.80 58.90
## —————————————————————————
## z
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     375       1   3.539    2.65    2.69    2.91    3.53
##     .75     .90     .95
##    4.04    4.52    4.73
##
## lowest :  0.00  1.07  1.41  1.53  2.06
## highest:  6.43  6.72  6.98  8.06 31.80
## —————————————————————————

summaryBy() is present in doBY package. It provides summary statistics for numeric variables by group/category. Here, we find mean of price for each category of cut.

summaryBy(price ~ cut, data = diamonds, FUN = mean)

##          cut price.mean
## 1:      Fair   4358.758
## 2:      Good   3928.864
## 3:     Ideal   3457.542
## 4:   Premium   4584.258
## 5: Very Good   3981.760

Next, we find mean of price and cut for each combination of cut and color categories.

summaryBy(price + carat ~ cut + color, data = diamonds, FUN = mean)

##           cut color price.mean carat.mean
##  1:      Fair     D   4291.061  0.9201227
##  2:      Fair     E   3682.312  0.8566071
##  3:      Fair     F   3827.003  0.9047115
##  4:      Fair     G   4239.255  1.0238217
##  5:      Fair     H   5135.683  1.2191749
##  6:      Fair     I   4685.446  1.1980571
##  7:      Fair     J   4975.655  1.3411765
##  8:      Good     D   3405.382  0.7445166
##  9:      Good     E   3423.644  0.7451340
## 10:      Good     F   3495.750  0.7759296
## 11:      Good     G   4123.482  0.8508955
## 12:      Good     H   4276.255  0.9147293
## 13:      Good     I   5078.533  1.0572222
## 14:      Good     J   4574.173  1.0995440
## 15:     Ideal     D   2629.095  0.5657657
## 16:     Ideal     E   2597.550  0.5784012
## 17:     Ideal     F   3374.939  0.6558285
## 18:     Ideal     G   3720.706  0.7007146
## 19:     Ideal     H   3889.335  0.7995249
## 20:     Ideal     I   4451.970  0.9130291
## 21:     Ideal     J   4918.186  1.0635937
## 22:   Premium     D   3631.293  0.7215471
## 23:   Premium     E   3538.914  0.7177450
## 24:   Premium     F   4324.890  0.8270356
## 25:   Premium     G   4500.742  0.8414877
## 26:   Premium     H   5216.707  1.0164492
## 27:   Premium     I   5946.181  1.1449370
## 28:   Premium     J   6294.592  1.2930941
## 29: Very Good     D   3470.467  0.6964243
## 30: Very Good     E   3214.652  0.6763167
## 31: Very Good     F   3778.820  0.7409612
## 32: Very Good     G   3872.754  0.7667986
## 33: Very Good     H   4535.390  0.9159485
## 34: Very Good     I   5255.880  1.0469518
## 35: Very Good     J   5103.513  1.1332153
##           cut color price.mean carat.mean

Finally, we find frequency/count, mean, median and standard deviation for price and carat for each combination of cut and color categories.

summaryBy(price + carat ~ cut + color, data = diamonds, FUN = function(x)c(count = length(x), mean = mean(x), median = median(x), sd = sd(x)))

##           cut color price.count price.mean price.median price.sd
##  1:      Fair     D         163   4291.061       3730.0 3286.114
##  2:      Fair     E         224   3682.312       2956.0 2976.652
##  3:      Fair     F         312   3827.003       3035.0 3223.303
##  4:      Fair     G         314   4239.255       3057.0 3609.644
##  5:      Fair     H         303   5135.683       3816.0 3886.482
##  6:      Fair     I         175   4685.446       3246.0 3730.271
##  7:      Fair     J         119   4975.655       3302.0 4050.459
##  8:      Good     D         662   3405.382       2728.5 3175.149
##  9:      Good     E         933   3423.644       2420.0 3330.702
## 10:      Good     F         909   3495.750       2647.0 3202.411
## 11:      Good     G         871   4123.482       3340.0 3702.505
## 12:      Good     H         702   4276.255       3468.5 4020.660
## 13:      Good     I         522   5078.533       3639.5 4631.702
## 14:      Good     J         307   4574.173       3733.0 3707.791
## 15:     Ideal     D        2834   2629.095       1576.0 3001.070
## 16:     Ideal     E        3903   2597.550       1437.0 2956.007
## 17:     Ideal     F        3826   3374.939       1775.0 3766.635
## 18:     Ideal     G        4884   3720.706       1857.5 4006.262
## 19:     Ideal     H        3115   3889.335       2278.0 4013.375
## 20:     Ideal     I        2093   4451.970       2659.0 4505.150
## 21:     Ideal     J         896   4918.186       4096.0 4476.207
## 22:   Premium     D        1603   3631.293       2009.0 3711.634
## 23:   Premium     E        2337   3538.914       1928.0 3794.987
## 24:   Premium     F        2331   4324.890       2841.0 4012.023
## 25:   Premium     G        2924   4500.742       2745.0 4356.571
## 26:   Premium     H        2360   5216.707       4511.0 4466.190
## 27:   Premium     I        1428   5946.181       4640.0 5053.746
## 28:   Premium     J         808   6294.592       5063.0 4788.937
## 29: Very Good     D        1513   3470.467       2310.0 3523.753
## 30: Very Good     E        2400   3214.652       1989.5 3408.024
## 31: Very Good     F        2164   3778.820       2471.0 3786.124
## 32: Very Good     G        2299   3872.754       2437.0 3861.375
## 33: Very Good     H        1824   4535.390       3734.0 4185.798
## 34: Very Good     I        1204   5255.880       3888.0 4687.105
## 35: Very Good     J         678   5103.513       4113.0 4135.653
##           cut color price.count price.mean price.median price.sd
##     carat.count carat.mean carat.median  carat.sd
##  1:         163  0.9201227        0.900 0.4054185
##  2:         224  0.8566071        0.900 0.3645848
##  3:         312  0.9047115        0.900 0.4188899
##  4:         314  1.0238217        0.980 0.4927241
##  5:         303  1.2191749        1.010 0.5482389
##  6:         175  1.1980571        1.010 0.5219776
##  7:         119  1.3411765        1.030 0.7339713
##  8:         662  0.7445166        0.700 0.3631169
##  9:         933  0.7451340        0.700 0.3808900
## 10:         909  0.7759296        0.710 0.3700142
## 11:         871  0.8508955        0.900 0.4327176
## 12:         702  0.9147293        0.900 0.4977162
## 13:         522  1.0572222        1.000 0.5756366
## 14:         307  1.0995440        1.020 0.5371248
## 15:        2834  0.5657657        0.500 0.2993503
## 16:        3903  0.5784012        0.500 0.3125406
## 17:        3826  0.6558285        0.530 0.3745245
## 18:        4884  0.7007146        0.540 0.4106182
## 19:        3115  0.7995249        0.700 0.4868741
## 20:        2093  0.9130291        0.740 0.5537277
## 21:         896  1.0635937        1.030 0.5821001
## 22:        1603  0.7215471        0.580 0.3974635
## 23:        2337  0.7177450        0.580 0.4097847
## 24:        2331  0.8270356        0.760 0.4201959
## 25:        2924  0.8414877        0.755 0.4795344
## 26:        2360  1.0164492        1.010 0.5440777
## 27:        1428  1.1449370        1.140 0.6136041
## 28:         808  1.2930941        1.250 0.6137086
## 29:        1513  0.6964243        0.610 0.3692291
## 30:        2400  0.6763167        0.570 0.3779140
## 31:        2164  0.7409612        0.700 0.3888827
## 32:        2299  0.7667986        0.700 0.4180156
## 33:        1824  0.9159485        0.900 0.5029465
## 34:        1204  1.0469518        1.005 0.5519840
## 35:         678  1.1332153        1.060 0.5559197
##     carat.count carat.mean carat.median  carat.sd

New York Flights Dataset

Let us understand the New York Flights Dataset. This dataset is available in the package called nycflights13. I already have it installed. You can install the package by the command install.packages(“nycflights13”).

2015-02-4

Package

> library(nycflights13)

library command loads the package

Data Inspection

> dim(flights)

dim command gives the dimensions of the dataset = no. of observation * no. of variables

[1] 336776     16
> head(flights)

head gives the first few observations

  year month day dep_time dep_delay arr_time arr_delay carrier tailnum
1 2013     1   1      517         2      830        11      UA  N14228
2 2013     1   1      533         4      850        20      UA  N24211
3 2013     1   1      542         2      923        33      AA  N619AA
4 2013     1   1      544        -1     1004       -18      B6  N804JB
5 2013     1   1      554        -6      812       -25      DL  N668DN
6 2013     1   1      554        -4      740        12      UA  N39463
  flight origin dest air_time distance hour minute
1   1545    EWR  IAH      227     1400    5     17
2   1714    LGA  IAH      227     1416    5     33
3   1141    JFK  MIA      160     1089    5     42
4    725    JFK  BQN      183     1576    5     44
5    461    LGA  ATL      116      762    5     54
6   1696    EWR  ORD      150      719    5     54
> tail(flights)

tail gives the last few observations

With the help of head and tail  commands we inspect whether the data has loaded properly or not.

       year month day dep_time dep_delay arr_time arr_delay carrier
336771 2013     9  30       NA        NA       NA        NA      EV
336772 2013     9  30       NA        NA       NA        NA      9E
336773 2013     9  30       NA        NA       NA        NA      9E
336774 2013     9  30       NA        NA       NA        NA      MQ
336775 2013     9  30       NA        NA       NA        NA      MQ
336776 2013     9  30       NA        NA       NA        NA      MQ
       tailnum flight origin dest air_time distance hour minute
336771  N740EV   5274    LGA  BNA       NA      764   NA     NA
336772           3393    JFK  DCA       NA      213   NA     NA
336773           3525    LGA  SYR       NA      198   NA     NA
336774  N535MQ   3461    LGA  BNA       NA      764   NA     NA
336775  N511MQ   3572    LGA  CLE       NA      419   NA     NA
336776  N839MQ   3531    LGA  RDU       NA      431   NA     NA
> str(flights)

str command gives the structure of dataset. It briefs us about the variable names and variable types.

Classes 'tbl_df', 'tbl' and 'data.frame':   336776 obs. of  16 variables:
 $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
 $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
 $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier  : chr  "UA" "UA" "AA" "B6" ...
 $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
 $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
 $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
 $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
 $ distance : num  1400 1416 1089 1576 762 ...
 $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
 $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...
> summary(flights)

summary gives the minimum and maximum values, mean and median and quartiles of all variables.

      year          month             day           dep_time   
 Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1  
 1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907  
 Median :2013   Median : 7.000   Median :16.00   Median :1401  
 Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349  
 3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744  
 Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400  
                                                 NA's   :8255  
   dep_delay          arr_time      arr_delay          carrier         
 Min.   : -43.00   Min.   :   1   Min.   : -86.000   Length:336776     
 1st Qu.:  -5.00   1st Qu.:1104   1st Qu.: -17.000   Class :character  
 Median :  -2.00   Median :1535   Median :  -5.000   Mode  :character  
 Mean   :  12.64   Mean   :1502   Mean   :   6.895                     
 3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:  14.000                     
 Max.   :1301.00   Max.   :2400   Max.   :1272.000                     
 NA's   :8255      NA's   :8713   NA's   :9430                         
   tailnum              flight        origin              dest          
 Length:336776      Min.   :   1   Length:336776      Length:336776     
 Class :character   1st Qu.: 553   Class :character   Class :character  
 Mode  :character   Median :1496   Mode  :character   Mode  :character  
                    Mean   :1972                                        
                    3rd Qu.:3465                                        
                    Max.   :8500                                        

    air_time        distance         hour           minute     
 Min.   : 20.0   Min.   :  17   Min.   : 0.00   Min.   : 0.00  
 1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00   1st Qu.:16.00  
 Median :129.0   Median : 872   Median :14.00   Median :31.00  
 Mean   :150.7   Mean   :1040   Mean   :13.17   Mean   :31.76  
 3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00   3rd Qu.:49.00  
 Max.   :695.0   Max.   :4983   Max.   :24.00   Max.   :59.00  
 NA's   :9430                   NA's   :8255    NA's   :8255   

 

 

Basic list of R commands

Today , I will give a brief introduction about basic commands used in R programming –

Memory Allocation – 

ls() – lists the objects that are currently in R’s memory

rm() – removes the names of the objects present in R’s memory, however, doesnt erase the objects

R_commands

 

memory.size() – tells the current memory allocated to R

gc() – erases the objects and hence, reduces the memory allocation

System Date and Time –

Sys.time() – gives the current time according to the CPU. Keep in mind that R commands are case sensitive.

Sys.Date() – gives the current Date according to the CPU

R_commands

Working Directory

getwd()

setwd()

dir()

Installing and Loading Packages –

install.packages(“Rcmdr”) – installs the package Rcmdr, notice the quotation marks

library(Rcmdr) – loads package Rcmdr or makes Rcmdr available for use(no quotation marks)

R_commands