Who could be Myopic? Use Logistic Regression to know.

Here I use Logistic regression to classify myopic and non-myopic students.

Data source: myopia dataset

Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

Loading Dataset

In [2]:
myopia = pd.read_excel("myopia.xls")
In [3]:
myopia['MYOPIC'].value_counts()
Out[3]:
0    537
1     81
dtype: int64
This dataset is an imbalanced dataset. Only 15%(approx.) data represents cases for myopic students.
In [4]:
#creating dataframe with intercept column using dmatrices from patsy
y,x = dmatrices('MYOPIC ~ AGE + GENDER + READHR + COMPHR + STUDYHR + TVHR + SPORTHR + DIOPTERHR + MOMMY + DADMY', myopia, return_type="dataframe")
In [5]:
# creating a logistic regression model, and fitting with x and y
logit = LogisticRegression()
logit = logit.fit(x, y)

# checking accuracy
logit.score(x, y)
Out[5]:
0.8689320388349514

Model Training

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

#creating logistic regression, fitting training data set : x_train, y_train
logit2 = LogisticRegression()
logit2.fit(x_train, y_train)
Out[6]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [7]:
# predicting using test data set
predicted = logit2.predict(x_test)

# getting probabilities
probs = logit2.predict_proba(x_test)
predicted
Out[7]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.])

Checking Accuracy

In [8]:
# generate evaluation metrics
print metrics.accuracy_score(y_test, predicted)
print metrics.confusion_matrix(y_test,predicted)
print metrics.precision_score(y_test, predicted) # what proportion of students identified as myopic were actually myopic.
print metrics.recall_score(y_test, predicted) # what proportion of students that were myopic , were identified as myopic. 
 Out[8]:
0.854838709677
[[159   1]
 [ 26   0]]
0.0
0.0

The accuracy score is about 86%. According to the confusion matrix, the model predicts 159/160 cases for non-myopic students correctly which is the reason for high accuracy. However, the model is unable to predict any case of myopic students correctly. Clearly, this is a case of class imbalance as proportion of non-myopic cases is much higher than myopic cases. Class imbalance leads to ‘accuracy paradox‘ i.e. models with a given level of accuracy may have better predictive power than models with higher level of accuracy. Ratio of correct predictions to total number of cases may seem to be an important metric for predictive models, but in cases of class imbalance it may be useless.

Class imbalance is also the reason for low precision score and low recall score. Hence, while measuring the performance of a model we must look at different types of performance measures.

Cross Validation

In [9]:
#converting y into one dimensional array
y = np.ravel(y)
In [10]:
# evaluating using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), x, y, scoring='accuracy', cv=10)
print scores
print scores.mean()
 Out[10]:
[ 0.85714286  0.87096774  0.87096774  0.87096774  0.87096774  0.87096774
  0.87096774  0.86885246  0.86885246  0.86885246]
0.86895066858

Cross validation score shows 86.9% accuracy. The cross validation score is high as it is aggregated accuracy score with different combinations of training and test datasets.

In further blog posts we will use other modelling techniques to solve this problem of classification.

 

 

Text Mining of The Complete Works of Jane Austen

Text mining refers to extraction of meaningful information from qualitative and unstructured text data. In this document we will perform text mining on The Complete Works of Jane Austen. We can take this file from Project Gutenberg.

For the task, we will use these packages in R – tm, SnowballC, slam, wordcloud, reshape2, ggplot2. So, first let us install and load these packages.

library(tm)
library(SnowballC) 
library(slam) 
library(wordcloud)
library(reshape2) 
library(ggplot2)

Now, we will extract the text from the .txt file and read it into R.

jane_text = "C:/Users/Sunaksham/Documents/pg31100.txt"
if (!file.exists(jane_text)) { download.file("http://www.gutenberg.org/cache/epub/31100/pg31100.txt", destfile = jane_text) }
austenjane = readLines(jane_text)
length(austenjane)
## [1] 80476

This data has 80476 lines. We shall check the few lines at the top and bottom of the data.

head(austenjane)
## [1] ""                                                                  
## [2] "Project Gutenberg's The Complete Works of Jane Austen, by Jane Austen"
## [3] ""                                                                     
## [4] "This eBook is for the use of anyone anywhere at no cost and with"     
## [5] "almost no restrictions whatsoever.  You may copy it, give it away or" 
## [6] "re-use it under the terms of the Project Gutenberg License included"
tail(austenjane)
## [1] ""                                                                  
## [2] "This Web site includes information about Project Gutenberg-tm,"    
## [3] "including how to make donations to the Project Gutenberg Literary" 
## [4] "Archive Foundation, how to help produce our new eBooks, and how to"
## [5] "subscribe to our email newsletter to hear about new eBooks."       
## [6] ""

We can see a lot of header and footer text and shall get rid of it. After checking that how many lines are occupied by metadata, we remove those lines.

austenjane[(1:50)]
##  [1] ""                                                                   
##  [2] "Project Gutenberg's The Complete Works of Jane Austen, by Jane Austen" 
##  [3] ""                                                                      
##  [4] "This eBook is for the use of anyone anywhere at no cost and with"      
##  [5] "almost no restrictions whatsoever.  You may copy it, give it away or"  
##  [6] "re-use it under the terms of the Project Gutenberg License included"   
##  [7] "with this eBook or online at www.gutenberg.org"                        
##  [8] ""                                                                      
##  [9] ""                                                                      
## [10] "Title: The Complete Project Gutenberg Works of Jane Austen"            
## [11] ""                                                                      
## [12] "Author: Jane Austen"                                                   
## [13] ""                                                                      
## [14] "Editor: David Widger"                                                  
## [15] ""                                                                      
## [16] "Release Date: January 25, 2010 [EBook #31100]"                         
## [17] ""                                                                      
## [18] "Language: English"                                                     
## [19] ""                                                                      
## [20] ""                                                                      
## [21] "*** START OF THIS PROJECT GUTENBERG EBOOK THE WORKS OF JANE AUSTEN ***"
## [22] ""                                                                      
## [23] ""                                                                      
## [24] ""                                                                      
## [25] ""                                                                      
## [26] "Produced by many Project Gutenberg volunteers."                        
## [27] ""                                                                      
## [28] ""                                                                      
## [29] ""                                                                      
## [30] ""                                                                      
## [31] ""                                                                      
## [32] ""                                                                      
## [33] ""                                                                      
## [34] "THE WORKS OF JANE AUSTEN"                                              
## [35] ""                                                                      
## [36] ""                                                                      
## [37] ""                                                                      
## [38] "Edited by David Widger"                                                
## [39] ""                                                                      
## [40] "Project Gutenberg Editions"                                            
## [41] ""                                                                      
## [42] ""                                                                      
## [43] ""                                                                      
## [44] "             DEDICATION"                                               
## [45] ""                                                                      
## [46] "     This Jane Austen collection"                                      
## [47] "         is dedicated to"                                              
## [48] "     Alice Goodson [Hart] Woodby"                                      
## [49] ""                                                                      
## [50] ""
austenjane[(80310:80383)]
##  [1] "1.F.1.  Project Gutenberg volunteers and employees expend considerable"  
##  [2] "effort to identify, do copyright research on, transcribe and proofread"  
##  [3] "public domain works in creating the Project Gutenberg-tm"                
##  [4] "collection.  Despite these efforts, Project Gutenberg-tm electronic"     
##  [5] "works, and the medium on which they may be stored, may contain"          
##  [6] "\"Defects,\" such as, but not limited to, incomplete, inaccurate or"     
##  [7] "corrupt data, transcription errors, a copyright or other intellectual"   
##  [8] "property infringement, a defective or damaged disk or other medium, a"   
##  [9] "computer virus, or computer codes that damage or cannot be read by"      
## [10] "your equipment."                                                         
## [11] ""                                                                        
## [12] "1.F.2.  LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for the \"Right"
## [13] "of Replacement or Refund\" described in paragraph 1.F.3, the Project"    
## [14] "Gutenberg Literary Archive Foundation, the owner of the Project"         
## [15] "Gutenberg-tm trademark, and any other party distributing a Project"      
## [16] "Gutenberg-tm electronic work under this agreement, disclaim all"         
## [17] "liability to you for damages, costs and expenses, including legal"       
## [18] "fees.  YOU AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT"       
## [19] "LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE"        
## [20] "PROVIDED IN PARAGRAPH F3.  YOU AGREE THAT THE FOUNDATION, THE"           
## [21] "TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE"   
## [22] "LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE OR"  
## [23] "INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH"   
## [24] "DAMAGE."                                                                 
## [25] ""                                                                        
## [26] "1.F.3.  LIMITED RIGHT OF REPLACEMENT OR REFUND - If you discover a"      
## [27] "defect in this electronic work within 90 days of receiving it, you can"  
## [28] "receive a refund of the money (if any) you paid for it by sending a"     
## [29] "written explanation to the person you received the work from.  If you"   
## [30] "received the work on a physical medium, you must return the medium with" 
## [31] "your written explanation.  The person or entity that provided you with"  
## [32] "the defective work may elect to provide a replacement copy in lieu of a" 
## [33] "refund.  If you received the work electronically, the person or entity"  
## [34] "providing it to you may choose to give you a second opportunity to"      
## [35] "receive the work electronically in lieu of a refund.  If the second copy"
## [36] "is also defective, you may demand a refund in writing without further"   
## [37] "opportunities to fix the problem."                                       
## [38] ""                                                                        
## [39] "1.F.4.  Except for the limited right of replacement or refund set forth" 
## [40] "in paragraph 1.F.3, this work is provided to you 'AS-IS' WITH NO OTHER"  
## [41] "WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO"
## [42] "WARRANTIES OF MERCHANTIBILITY OR FITNESS FOR ANY PURPOSE."               
## [43] ""                                                                        
## [44] "1.F.5.  Some states do not allow disclaimers of certain implied"         
## [45] "warranties or the exclusion or limitation of certain types of damages."  
## [46] "If any disclaimer or limitation set forth in this agreement violates the"
## [47] "law of the state applicable to this agreement, the agreement shall be"   
## [48] "interpreted to make the maximum disclaimer or limitation permitted by"   
## [49] "the applicable state law.  The invalidity or unenforceability of any"    
## [50] "provision of this agreement shall not void the remaining provisions."    
## [51] ""                                                                        
## [52] "1.F.6.  INDEMNITY - You agree to indemnify and hold the Foundation, the" 
## [53] "trademark owner, any agent or employee of the Foundation, anyone"        
## [54] "providing copies of Project Gutenberg-tm electronic works in accordance" 
## [55] "with this agreement, and any volunteers associated with the production," 
## [56] "promotion and distribution of Project Gutenberg-tm electronic works,"    
## [57] "harmless from all liability, costs and expenses, including legal fees,"  
## [58] "that arise directly or indirectly from any of the following which you do"
## [59] "or cause to occur: (a) distribution of this or any Project Gutenberg-tm" 
## [60] "work, (b) alteration, modification, or additions or deletions to any"    
## [61] "Project Gutenberg-tm work, and (c) any Defect you cause."                
## [62] ""                                                                        
## [63] ""                                                                        
## [64] "Section  2.  Information about the Mission of Project Gutenberg-tm"      
## [65] ""                                                                        
## [66] "Project Gutenberg-tm is synonymous with the free distribution of"        
## [67] "electronic works in formats readable by the widest variety of computers" 
## [68] "including obsolete, old, middle-aged and new computers.  It exists"      
## [69] "because of the efforts of hundreds of volunteers and donations from"     
## [70] "people in all walks of life."                                            
## [71] ""                                                                        
## [72] "Volunteers and financial support to provide volunteers with the"         
## [73] "assistance they need, are critical to reaching Project Gutenberg-tm's"   
## [74] "goals and ensuring that the Project Gutenberg-tm collection will"
austenjane = austenjane[-(1:93)]
austenjane = austenjane[-(80012:80383)]

Now, we concatenate the lines to form a single string with the help of paste function, while leaving a single space gap between the words.

austenjane = paste(austenjane, collapse = " ")
nchar(austenjane)
## [1] 4353592

The data contains 4353592 characters. Now, we will convert this text data into a corpus. This process uses the “tm” package.

jane_vec <- VectorSource(austenjane)
jane_corpus <- Corpus(jane_vec)
summary(jane_corpus)
##   Length Class             Mode
## 1 2      PlainTextDocument list

Before proceeding further, we convert convert all the text data into lowercase. Then, we remove punctuation marks, numbers and common stopwords in English language.

jane_corpus <- tm_map(jane_corpus, PlainTextDocument)
jane_corpus <- tm_map(jane_corpus, content_transformer(tolower))
jane_corpus <- tm_map(jane_corpus, removePunctuation)
jane_corpus <- tm_map(jane_corpus, removeNumbers)
jane_corpus <- tm_map(jane_corpus, removeWords, stopwords("english"))

Also, we stem the text data to remove affixes from words so that words are reduced to their root form. This process uses “SnowBallC” package. Now, there will be a lot of extra space between words which we shall remove.

jane_corpus <- tm_map(jane_corpus, stemDocument)
jane_corpus <- tm_map(jane_corpus, stripWhitespace)

We create a wordcloud of the corpus. This process will use “wordcloud” package.

wordcloud(jane_corpus,scale=c(3,0.1),max.words=60,random.order=FALSE,rot.per=0.40, use.r.layout=FALSE)

wordcloud

We create a term-document matrix or TDM. A TDM is a mathematical matrix that provides the frequency of each term that occurs in a collection of documents. But before that, we should make sure all of your data is in PlainTextDocument or else it will give the error – “Error: inherits(doc,”TextDocument“) is not TRUE.”

jane_corpus <- tm_map(jane_corpus, PlainTextDocument)
jane_tdm <- TermDocumentMatrix(jane_corpus)
jane_tdm
## <<TermDocumentMatrix (terms: 13313, documents: 1)>>
## Non-/sparse entries: 13313/0
## Sparsity           : 0%
## Maximal term length: 32
## Weighting          : term frequency (tf)

Sparsity – High sparsity tells that there are a lot of terms that only occur in one or a few documents. In our case, we have one document in the text data, so all terms occur in this document itself resulting in 0% sparsity.

Next, we try to find out frequently occuring terms. Then, we try to find associations of some of those terms with other terms with 0.01 as lower correlation limit.

findFreqTerms(jane_tdm, 1000)
##  [1] "can"     "come"    "day"     "even"    "everi"   "feel"    "first"  
##  [8] "friend"  "good"    "great"   "know"    "ladi"    "like"    "littl"  
## [15] "look"    "make"    "may"     "might"   "miss"    "mrs"     "much"   
## [22] "must"    "never"   "noth"    "now"     "one"     "quit"    "said"   
## [29] "say"     "see"     "sister"  "soon"    "thing"   "think"   "though" 
## [36] "thought" "time"    "well"    "will"    "wish"    "without"
findFreqTerms(jane_tdm, 1500)
##  [1] "everi" "know"  "miss"  "mrs"   "much"  "must"  "now"   "one"  
##  [9] "said"  "think" "time"  "will"
findFreqTerms(jane_tdm, 2000)
## [1] "mrs"  "much" "must" "one"  "said" "will"
findAssocs(jane_tdm, "first", 0.01)
## $first
## numeric(0)
findAssocs(jane_tdm, "miss", 0.01)
## $miss
## numeric(0)
findAssocs(jane_tdm, "lady", 0.01)
## $lady
## numeric(0)
findAssocs(jane_tdm, "little", 0.01)
## $little
## numeric(0)
findAssocs(jane_tdm, "never", 0.01)
## $never
## numeric(0)
findAssocs(jane_tdm, "time", 0.01)
## $time
## numeric(0)
findAssocs(jane_tdm, "mrs", 0.01)
## $mrs
## numeric(0)

As mrs, miss and lady are some of the most commonly used words in Jane Austen’s works, we can infer that her stories were related to female characters. We could not find any associations of most common terms with other terms.

Now we will convert the data into a matrix. we will see that this conversion will also help us save space. This process will use “slam” package.

dim(jane_tdm)
## [1] 13313     1
jane_tdm_matrix <- as.matrix(jane_tdm)
head(jane_tdm_matrix)
##            Docs
## Terms       character(0)
##   abandon              5
##   abash                2
##   abat                 7
##   abbey               70
##   abbeyfor             1
##   abbeyland            1
object.size(jane_tdm)
## 739416 bytes
object.size(jane_tdm_matrix)
## 632208 bytes

We will convert this matrix into a tidy looking matrix using melt function. This process will use “reshape2” package.

jane_tdm_matrix= melt(jane_tdm_matrix, value.name = "count")
head(jane_tdm_matrix)
##       Terms         Docs count
## 1   abandon character(0)     5
## 2     abash character(0)     2
## 3      abat character(0)     7
## 4     abbey character(0)    70
## 5  abbeyfor character(0)     1
## 6 abbeyland character(0)     1

Finally, we plot the frequencies of terms using ggplot function.This process uses ggplot2 package.

jane_plot <- ggplot(jane_tdm_matrix, aes(x= Docs, y= Terms, group = 1))
jane_plot <- jane_plot + geom_line()
jane_plot

docs

Everything in Command Line/ Terminal – Navigation

Here, we learn about Navigating through Directories using the Command Line

searchforterminal

 

  • pwd = print working directory
pwd
/home/all/wild
  • ls = list : lists all folders/files in working directory
ls
animals  birds  flowers.txt
  • cd <directory> = change directory :

Moving to an internal directory
provide name of the folder to move to a directory within the current directory , example – ‘cd animals’ will move directory to path /home/all/wild/animals

cd animals
pwd
/home/all/wild/animals

Moving to an external directory(outside the current working directory : provide the path)

cd /home/all/domestic/birds
pwd
/home/all/domestic/birds

Moving to one directory up or one step back

cd ..
pwd
/home/all/domestic

mkdir <directory_name> = make a new directory :

Make a new directory within the working directory

mkdir cattle
pwd
/home/all/domestic
ls
cattle fish friendly.txt

Make a new directory outside the working directory
Here, making directory ‘cuckoo’ under birds in wild.

mkdir /home/all/wild/birds/cuckoo
cd /home/all/wild/birds
ls
cuckoo parrot.txt

touch <filename> = creates a new file :

Create a new file in current working directory

pwd
/home/all/wild/birds
touch owl.txt
ls
cuckoo parrot.txt owl.txt

Create a new file in outside working directory

touch /home/all/wild/animals/lion.txt
cd /home/all/wild/animals
ls
lion.txt

 

GroupBy summary statistics in R

First we install and load few packages – “data.table”, “Hmisc” and “doBy”. Here, we require data.table for its data input and data manipulation techniques, Hmisc for its data analysis techniques and doBy for its grouping and data analysis techniques. After loading the libraries/packages, we input the diamonds data using fread() and find the number of observations in the dataset using nrow().

library(data.table)
library(Hmisc)

## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: ‘Hmisc’
##
## The following objects are masked from ‘package:base’:
##
##     format.pval, round.POSIXt, trunc.POSIXt, units

library(doBy)
diamonds = fread("diamonds.csv")
nrow(diamonds)

## [1] 53940

str() gives the basic structure of objects.

str(diamonds)

## Classes ‘data.table’ and ‘data.frame’:   53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 …
##  $ cut    : chr  “Ideal” “Premium” “Good” “Premium” …
##  $ color  : chr  “E” “E” “E” “I” …
##  $ clarity: chr  “SI2” “SI1” “VS1” “VS2” …
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 …
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 …
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 …
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 …
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 …
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 …
##  – attr(*, “.internal.selfref”)=<externalptr>

head() and tail() show the top six and bottom six observations, respectively.

head(diamonds)

##    carat       cut color clarity depth table price    x    y    z
## 1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

tail(diamonds)

##    carat       cut color clarity depth table price    x    y    z
## 1:  0.72   Premium     D     SI1  62.7    59  2757 5.69 5.73 3.58
## 2:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 3:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 4:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 5:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 6:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

summary() provides the minimum, maximum, mean, median, 1st and 3rd quartiles for numeric variables and length(count) and type for character variables.

summary(diamonds)

##      carat            cut               color             clarity
##  Min.   :0.2000   Length:53940       Length:53940       Length:53940
##  1st Qu.:0.4000   Class :character   Class :character   Class :character
##  Median :0.7000   Mode  :character   Mode  :character   Mode  :character
##  Mean   :0.7979
##  3rd Qu.:1.0400
##  Max.   :5.0100
##      depth           table           price             x
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740
##        y                z
##  Min.   : 0.000   Min.   : 0.000
##  1st Qu.: 4.720   1st Qu.: 2.910
##  Median : 5.710   Median : 3.530
##  Mean   : 5.735   Mean   : 3.539
##  3rd Qu.: 6.540   3rd Qu.: 4.040
##  Max.   :58.900   Max.   :31.800

describe() from Hmisc package provides number of observations, number of missing values, number of unique values, mean, percentiles in multiples of 5 and lowest and highest values(5 each) for each numeric variable.

For each character variable describe() displays frequency count and frequency ratio for each of its categories.

describe(diamonds)

## diamonds
##
##  10  Variables      53940  Observations
## —————————————————————————
## carat
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     273       1  0.7979    0.30    0.31    0.40    0.70
##     .75     .90     .95
##    1.04    1.51    1.70
##
## lowest : 0.20 0.21 0.22 0.23 0.24, highest: 4.00 4.01 4.13 4.50 5.01
## —————————————————————————
## cut
##       n missing  unique
##   53940       0       5
##
##           Fair Good Ideal Premium Very Good
## Frequency 1610 4906 21551   13791     12082
## %            3    9    40      26        22
## —————————————————————————
## color
##       n missing  unique
##   53940       0       7
##
##              D    E    F     G    H    I    J
## Frequency 6775 9797 9542 11292 8304 5422 2808
## %           13   18   18    21   15   10    5
## —————————————————————————
## clarity
##       n missing  unique
##   53940       0       8
##
##            I1   IF   SI1  SI2  VS1   VS2 VVS1 VVS2
## Frequency 741 1790 13065 9194 8171 12258 3655 5066
## %           1    3    24   17   15    23    7    9
## —————————————————————————
## depth
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     184       1   61.75    59.3    60.0    61.0    61.8
##     .75     .90     .95
##    62.5    63.3    63.8
##
## lowest : 43.0 44.0 50.8 51.0 52.2, highest: 72.2 72.9 73.6 78.2 79.0
## —————————————————————————
## table
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     127    0.98   57.46      54      55      56      57
##     .75     .90     .95
##      59      60      61
##
## lowest : 43.0 44.0 49.0 50.0 50.1, highest: 71.0 73.0 76.0 79.0 95.0
## —————————————————————————
## price
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0   11602       1    3933     544     646     950    2401
##     .75     .90     .95
##    5324    9821   13107
##
## lowest :   326   327   334   335   336
## highest: 18803 18804 18806 18818 18823
## —————————————————————————
## x
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     554       1   5.731    4.29    4.36    4.71    5.70
##     .75     .90     .95
##    6.54    7.31    7.66
##
## lowest :  0.00  3.73  3.74  3.76  3.77
## highest: 10.01 10.02 10.14 10.23 10.74
## —————————————————————————
## y
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     552       1   5.735    4.30    4.36    4.72    5.71
##     .75     .90     .95
##    6.54    7.30    7.65
##
## lowest :  0.00  3.68  3.71  3.72  3.73
## highest: 10.10 10.16 10.54 31.80 58.90
## —————————————————————————
## z
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     375       1   3.539    2.65    2.69    2.91    3.53
##     .75     .90     .95
##    4.04    4.52    4.73
##
## lowest :  0.00  1.07  1.41  1.53  2.06
## highest:  6.43  6.72  6.98  8.06 31.80
## —————————————————————————

summaryBy() is present in doBY package. It provides summary statistics for numeric variables by group/category. Here, we find mean of price for each category of cut.

summaryBy(price ~ cut, data = diamonds, FUN = mean)

##          cut price.mean
## 1:      Fair   4358.758
## 2:      Good   3928.864
## 3:     Ideal   3457.542
## 4:   Premium   4584.258
## 5: Very Good   3981.760

Next, we find mean of price and cut for each combination of cut and color categories.

summaryBy(price + carat ~ cut + color, data = diamonds, FUN = mean)

##           cut color price.mean carat.mean
##  1:      Fair     D   4291.061  0.9201227
##  2:      Fair     E   3682.312  0.8566071
##  3:      Fair     F   3827.003  0.9047115
##  4:      Fair     G   4239.255  1.0238217
##  5:      Fair     H   5135.683  1.2191749
##  6:      Fair     I   4685.446  1.1980571
##  7:      Fair     J   4975.655  1.3411765
##  8:      Good     D   3405.382  0.7445166
##  9:      Good     E   3423.644  0.7451340
## 10:      Good     F   3495.750  0.7759296
## 11:      Good     G   4123.482  0.8508955
## 12:      Good     H   4276.255  0.9147293
## 13:      Good     I   5078.533  1.0572222
## 14:      Good     J   4574.173  1.0995440
## 15:     Ideal     D   2629.095  0.5657657
## 16:     Ideal     E   2597.550  0.5784012
## 17:     Ideal     F   3374.939  0.6558285
## 18:     Ideal     G   3720.706  0.7007146
## 19:     Ideal     H   3889.335  0.7995249
## 20:     Ideal     I   4451.970  0.9130291
## 21:     Ideal     J   4918.186  1.0635937
## 22:   Premium     D   3631.293  0.7215471
## 23:   Premium     E   3538.914  0.7177450
## 24:   Premium     F   4324.890  0.8270356
## 25:   Premium     G   4500.742  0.8414877
## 26:   Premium     H   5216.707  1.0164492
## 27:   Premium     I   5946.181  1.1449370
## 28:   Premium     J   6294.592  1.2930941
## 29: Very Good     D   3470.467  0.6964243
## 30: Very Good     E   3214.652  0.6763167
## 31: Very Good     F   3778.820  0.7409612
## 32: Very Good     G   3872.754  0.7667986
## 33: Very Good     H   4535.390  0.9159485
## 34: Very Good     I   5255.880  1.0469518
## 35: Very Good     J   5103.513  1.1332153
##           cut color price.mean carat.mean

Finally, we find frequency/count, mean, median and standard deviation for price and carat for each combination of cut and color categories.

summaryBy(price + carat ~ cut + color, data = diamonds, FUN = function(x)c(count = length(x), mean = mean(x), median = median(x), sd = sd(x)))

##           cut color price.count price.mean price.median price.sd
##  1:      Fair     D         163   4291.061       3730.0 3286.114
##  2:      Fair     E         224   3682.312       2956.0 2976.652
##  3:      Fair     F         312   3827.003       3035.0 3223.303
##  4:      Fair     G         314   4239.255       3057.0 3609.644
##  5:      Fair     H         303   5135.683       3816.0 3886.482
##  6:      Fair     I         175   4685.446       3246.0 3730.271
##  7:      Fair     J         119   4975.655       3302.0 4050.459
##  8:      Good     D         662   3405.382       2728.5 3175.149
##  9:      Good     E         933   3423.644       2420.0 3330.702
## 10:      Good     F         909   3495.750       2647.0 3202.411
## 11:      Good     G         871   4123.482       3340.0 3702.505
## 12:      Good     H         702   4276.255       3468.5 4020.660
## 13:      Good     I         522   5078.533       3639.5 4631.702
## 14:      Good     J         307   4574.173       3733.0 3707.791
## 15:     Ideal     D        2834   2629.095       1576.0 3001.070
## 16:     Ideal     E        3903   2597.550       1437.0 2956.007
## 17:     Ideal     F        3826   3374.939       1775.0 3766.635
## 18:     Ideal     G        4884   3720.706       1857.5 4006.262
## 19:     Ideal     H        3115   3889.335       2278.0 4013.375
## 20:     Ideal     I        2093   4451.970       2659.0 4505.150
## 21:     Ideal     J         896   4918.186       4096.0 4476.207
## 22:   Premium     D        1603   3631.293       2009.0 3711.634
## 23:   Premium     E        2337   3538.914       1928.0 3794.987
## 24:   Premium     F        2331   4324.890       2841.0 4012.023
## 25:   Premium     G        2924   4500.742       2745.0 4356.571
## 26:   Premium     H        2360   5216.707       4511.0 4466.190
## 27:   Premium     I        1428   5946.181       4640.0 5053.746
## 28:   Premium     J         808   6294.592       5063.0 4788.937
## 29: Very Good     D        1513   3470.467       2310.0 3523.753
## 30: Very Good     E        2400   3214.652       1989.5 3408.024
## 31: Very Good     F        2164   3778.820       2471.0 3786.124
## 32: Very Good     G        2299   3872.754       2437.0 3861.375
## 33: Very Good     H        1824   4535.390       3734.0 4185.798
## 34: Very Good     I        1204   5255.880       3888.0 4687.105
## 35: Very Good     J         678   5103.513       4113.0 4135.653
##           cut color price.count price.mean price.median price.sd
##     carat.count carat.mean carat.median  carat.sd
##  1:         163  0.9201227        0.900 0.4054185
##  2:         224  0.8566071        0.900 0.3645848
##  3:         312  0.9047115        0.900 0.4188899
##  4:         314  1.0238217        0.980 0.4927241
##  5:         303  1.2191749        1.010 0.5482389
##  6:         175  1.1980571        1.010 0.5219776
##  7:         119  1.3411765        1.030 0.7339713
##  8:         662  0.7445166        0.700 0.3631169
##  9:         933  0.7451340        0.700 0.3808900
## 10:         909  0.7759296        0.710 0.3700142
## 11:         871  0.8508955        0.900 0.4327176
## 12:         702  0.9147293        0.900 0.4977162
## 13:         522  1.0572222        1.000 0.5756366
## 14:         307  1.0995440        1.020 0.5371248
## 15:        2834  0.5657657        0.500 0.2993503
## 16:        3903  0.5784012        0.500 0.3125406
## 17:        3826  0.6558285        0.530 0.3745245
## 18:        4884  0.7007146        0.540 0.4106182
## 19:        3115  0.7995249        0.700 0.4868741
## 20:        2093  0.9130291        0.740 0.5537277
## 21:         896  1.0635937        1.030 0.5821001
## 22:        1603  0.7215471        0.580 0.3974635
## 23:        2337  0.7177450        0.580 0.4097847
## 24:        2331  0.8270356        0.760 0.4201959
## 25:        2924  0.8414877        0.755 0.4795344
## 26:        2360  1.0164492        1.010 0.5440777
## 27:        1428  1.1449370        1.140 0.6136041
## 28:         808  1.2930941        1.250 0.6137086
## 29:        1513  0.6964243        0.610 0.3692291
## 30:        2400  0.6763167        0.570 0.3779140
## 31:        2164  0.7409612        0.700 0.3888827
## 32:        2299  0.7667986        0.700 0.4180156
## 33:        1824  0.9159485        0.900 0.5029465
## 34:        1204  1.0469518        1.005 0.5519840
## 35:         678  1.1332153        1.060 0.5559197
##     carat.count carat.mean carat.median  carat.sd

Review Tuples !

my_list = [2,3]
my_tuple = (4,5)
other_tuple = 6,7
my_list[1] = 10
my_list
[2, 10]
my_tuple[1] = 11 #see that tuple cannot be modified
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-ce8c7e54784a> in <module>()
----> 1 my_tuple[1] = 11

TypeError: 'tuple' object does not support item assignment
def sum_product(x, y):
    return (x + y),(x * y)
sp = sum_product(11,12)
sp
(23, 132)
s, p = sum_product(11,12)
s
23
p
132
x,y = 1,2 # multiple assignment is possible both in lists and tuples
x,y = 3,4

Studying Data Science from Scratch by Joel Grus.

Practice 1.2 – Python Pandas Cookbook by Alfred Essa

import pandas as pd
import datetime as dt
#creating list containing dates from 9-01 to 9-10
start  = dt.datetime(2013,9,1)
end = dt.datetime(2013,9,11)
step = dt.timedelta(days = 1)
dates = []
#populate the list
while start < end:
    dates.append(start.strftime('%m-%d'))
    start += step
dates
[’09-01′,
’09-02′,
’09-03′,
’09-04′,
’09-05′,
’09-06′,
’09-07′,
’09-08′,
’09-09′,
’09-10′]
d = {'Date' : dates, 'Tokyo':[3,4,5,4,6,3,32,2,3,13], 'Paris':[45,2,4,5,46,4,7,85,12,9], 'Mumbai':[23,32,12,45,3,6,7,8,1,9]} 
d
{‘Date’: [’09-01′,
’09-02′,
’09-03′,
’09-04′,
’09-05′,
’09-06′,
’09-07′,
’09-08′,
’09-09′,
’09-10′],
‘Mumbai’: [23, 32, 12, 45, 3, 6, 7, 8, 1, 9],
‘Paris’: [45, 2, 4, 5, 46, 4, 7, 85, 12, 9],
‘Tokyo’: [3, 4, 5, 4, 6, 3, 32, 2, 3, 13]}
Creating dataframe using dictionary with equal length of lists
temp = pd.DataFrame(d)
temp
Date Mumbai Paris Tokyo
0 09-01 23 45 3
1 09-02 32 2 4
2 09-03 12 4 5
3 09-04 45 5 4
4 09-05 3 46 6
5 09-06 6 4 3
6 09-07 7 7 32
7 09-08 8 85 2
8 09-09 1 12 3
9 09-10 9 9 13
temp['Tokyo']
0     3
1     4
2     5
3     4
4     6
5     3
6    32
7     2
8     3
9    13
Name: Tokyo, dtype: int64
temp = temp.set_index('Date')
temp
Mumbai Paris Tokyo
Date
09-01 23 45 3
09-02 32 2 4
09-03 12 4 5
09-04 45 5 4
09-05 3 46 6
09-06 6 4 3
09-07 7 7 32
09-08 8 85 2
09-09 1 12 3
09-10 9 9 13
import os as os
os.getcwd()
'C:\\Anaconda'
tb = pd.read_csv('C:/Anaconda/TB_outcomes.csv')
tb.head()
country iso2 iso3 iso_numeric g_whoregion year rep_meth new_sp_coh new_sp_cur new_sp_cmplt mdr_coh mdr_succ mdr_fail mdr_died mdr_lost xdr_coh xdr_succ xdr_fail xdr_died xdr_lost
0 Afghanistan AF AFG 4 EMR 1994 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Afghanistan AF AFG 4 EMR 1995 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Afghanistan AF AFG 4 EMR 1996 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Afghanistan AF AFG 4 EMR 1997 100 2001 786 108 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Afghanistan AF AFG 4 EMR 1998 100 2913 772 199 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 72 columns

tb.tail()
country iso2 iso3 iso_numeric g_whoregion year rep_meth new_sp_coh new_sp_cur new_sp_cmplt mdr_coh mdr_succ mdr_fail mdr_died mdr_lost xdr_coh xdr_succ xdr_fail xdr_died xdr_lost
4052 Zimbabwe ZW ZWE 716 AFR 2008 100 10370 6973 734 0 NaN NaN NaN NaN 0 NaN NaN NaN NaN
4053 Zimbabwe ZW ZWE 716 AFR 2009 100 10195 7131 868 1 1 0 0 0 0 0 0 0 0
4054 Zimbabwe ZW ZWE 716 AFR 2010 100 11654 8377 1116 6 4 0 2 0 0 0 0 0 0
4055 Zimbabwe ZW ZWE 716 AFR 2011 NaN 12596 9208 995 70 57 0 9 2 0 0 0 0 0
4056 Zimbabwe ZW ZWE 716 AFR 2012 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 72 columns

To get unique values

tb['country'].unique()
array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
       'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia',
       'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Bolivia (Plurinational State of)',
       'Bonaire, Saint Eustatius and Saba', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'British Virgin Islands', 'Brunei Darussalam',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia',
       'Cameroon', 'Canada', 'Cayman Islands', 'Central African Republic',
       'Chad', 'Chile', 'China', 'China, Hong Kong SAR',
       'China, Macao SAR', 'Colombia', 'Comoros', 'Congo', 'Cook Islands',
       'Costa Rica', "C\xc3\xb4te d'Ivoire", 'Croatia', 'Cuba',
       'Cura\xc3\xa7ao', 'Cyprus', 'Czech Republic',
       "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
       'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
       'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Fiji',
       'Finland', 'France', 'French Polynesia', 'Gabon', 'Gambia',
       'Georgia', 'Germany', 'Ghana', 'Greece', 'Greenland', 'Grenada',
       'Guam', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti',
       'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia',
       'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati',
       'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic",
       'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Lithuania',
       'Luxembourg', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives',
       'Mali', 'Malta', 'Marshall Islands', 'Mauritania', 'Mauritius',
       'Mexico', 'Micronesia (Federated States of)', 'Monaco', 'Mongolia',
       'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar',
       'Namibia', 'Nauru', 'Nepal', 'Netherlands Antilles', 'Netherlands',
       'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria',
       'Niue', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan',
       'Palau', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru',
       'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar',
       'Republic of Korea', 'Republic of Moldova', 'Romania',
       'Russian Federation', 'Rwanda', 'Saint Kitts and Nevis',
       'Saint Lucia', 'Saint Vincent and the Grenadines', 'Samoa',
       'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal',
       'Serbia & Montenegro', 'Serbia', 'Seychelles', 'Sierra Leone',
       'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia',
       'Solomon Islands', 'Somalia', 'South Africa', 'South Sudan',
       'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Swaziland', 'Sweden',
       'Switzerland', 'Syrian Arab Republic', 'Tajikistan', 'Thailand',
       'The Former Yugoslav Republic of Macedonia', 'Timor-Leste', 'Togo',
       'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey',
       'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda',
       'Ukraine', 'United Arab Emirates',
       'United Kingdom of Great Britain and Northern Ireland',
       'United Republic of Tanzania', 'United States of America',
       'Uruguay', 'US Virgin Islands', 'Uzbekistan', 'Vanuatu',
       'Venezuela (Bolivarian Republic of)', 'Viet Nam',
       'Wallis and Futuna Islands', 'West Bank and Gaza Strip', 'Yemen',
       'Zambia', 'Zimbabwe'], dtype=object)

Counting number of Unique values

tb.country.value_counts() 
Botswana                            19
Bolivia (Plurinational State of)    19
Greenland                           19
Armenia                             19
China                               19
Togo                                19
Mongolia                            19
Saint Kitts and Nevis               19
Cuba                                19
Benin                               19
Cook Islands                        19
Malawi                              19
Norway                              19
Nauru                               19
Solomon Islands                     19
...
US Virgin Islands                    19
China, Hong Kong SAR                 19
Denmark                              19
Philippines                          19
Canada                               19
China, Macao SAR                     19
Netherlands Antilles                 15
Timor-Leste                          11
Serbia & Montenegro                  10
Montenegro                            8
Serbia                                8
Bonaire, Saint Eustatius and Saba     4
Sint Maarten (Dutch part)             4
Curaçao                               4
South Sudan                           3
Length: 219, dtype: int64
tb.describe()
iso_numeric year rep_meth new_sp_coh new_sp_cur new_sp_cmplt new_sp_died new_sp_fail new_sp_def c_new_sp_tsr mdr_coh mdr_succ mdr_fail mdr_died mdr_lost xdr_coh xdr_succ xdr_fail xdr_died xdr_lost
count 4057.000000 4057.000000 3037.000000 3053.000000 2944.000000 2943.00000 2993.000000 2876.000000 2955.000000 3004.000000 1050.000000 1017.000000 959.000000 1000.000000 987.000000 562.000000 525.000000 524.000000 525.000000 524.000000
mean 433.592310 2003.042149 100.271320 10867.512611 7897.903533 963.62827 430.973939 184.123088 613.043655 75.767643 139.985714 71.208456 14.385819 22.544000 22.217832 6.181495 1.390476 0.837786 2.230476 0.776718
std 254.908076 5.485677 0.647391 45621.976594 37520.862855 3325.39556 1615.996031 812.662201 2386.874910 16.305073 726.653931 342.387797 106.821966 138.383012 113.607426 48.815990 9.570645 5.019886 20.085652 5.790293
min 4.000000 1994.000000 100.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 212.000000 1998.000000 100.000000 124.000000 66.750000 13.00000 7.000000 0.000000 4.000000 69.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 430.000000 2003.000000 100.000000 1229.000000 721.500000 124.00000 60.000000 15.000000 90.000000 79.000000 6.000000 3.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 646.000000 2008.000000 100.000000 5366.000000 3401.500000 580.50000 257.000000 99.000000 393.000000 87.000000 43.000000 24.000000 1.000000 6.000000 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 894.000000 2012.000000 102.000000 642321.000000 544731.000000 64938.00000 27005.000000 12505.000000 35469.000000 100.000000 15896.000000 5895.000000 2916.000000 3037.000000 2344.000000 751.000000 116.000000 64.000000 305.000000 94.000000

8 rows × 68 columns

 

Practice 1.1 – Python Pandas Cookbook by Alfred Essa

Following Alfred Essa’s Python Pandas Cookbook on YouTube
Different Ways to Construct Series
import pandas as pd
import numpy as np
Using Series Constructor
s1 = pd.Series([463,3,-728,236,32,-773])
s1
0    463
1      3
2   -728
3    236
4     32
5   -773
dtype: int64
type(s1)
pandas.core.series.Series
s1.values
array([ 463,    3, -728,  236,   32, -773], dtype=int64)
type(s1.values)
numpy.ndarray
s1.index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
s1[3]
236

Defining data and index

data1 = [3.5,5,343,9.3,23]
index1 = ['Mon','Tue','Wed','Thur','Fri']

Creating Series

s2 = pd.Series(data1, index = index1)
s2
Mon       3.5
Tue       5.0
Wed     343.0
Thur      9.3
Fri      23.0
dtype: float64
s2[4]
23.0
s2.index
Index([u'Mon', u'Tue', u'Wed', u'Thur', u'Fri'], dtype='object')
s2.name = 'Daily numbers'
s2.index.name = 'Working days'
s2
Working days
Mon               3.5
Tue               5.0
Wed             343.0
Thur              9.3
Fri              23.0
Name: Daily numbers, dtype: float64
Creating Dictionary
dict1 = {'Jan': -7,'Feb': 2,'March': 12,'April': -9,'May': 3,'June': 4}
s3 = pd.Series(dict1)
s3
April    -9
Feb       2
Jan      -7
June      4
March    12
May       3
dtype: int64
Vectorized Operations
s3 * 2
April   -18
Feb       4
Jan     -14
June      8
March    24
May       6
dtype: int64
np.log(s3)
April         NaN
Feb      0.693147
Jan           NaN
June     1.386294
March    2.484907
May      1.098612
dtype: float64
Slicing
s3['Feb':'May']
Feb       2
Jan      -7
June      4
March    12
May       3
dtype: int64
s3[3:5]
June      4
March    12
dtype: int64
Offset value
s3[3] = 54
s3
April    -9
Feb       2
Jan      -7
June     54
March    12
May       3
dtype: int64
s3.median()
2.5
s3.min()
-9
s3.max()
54
s3.cumsum()
April    -9
Feb      -7
Jan     -14
June     40
March    52
May      55
dtype: int64

Making Looping Clearer – enumerate() returns iterators

for i, v in enumerate(s3):
    print i,v
0 -9
1 2
2 -7
3 54
4 12
5 3
new_s3 = [x**2 for x in s3]
new_s3
[81, 4, 49, 2916, 144, 9]
Series using dictionary
s3['Feb']
2
'Feb' in s3
True
Assignment using key
s3['May'] = 45.8
s3
April    -9
Feb       2
Jan      -7
June     54
March    12
May      45
dtype: int64

Looping over dictionary keys and values

for k,v in s3.iteritems():
    print k,v
April -9
Feb 2
Jan -7
June 54
March 12
May 45

Pycurl and Pandas – Get csv file and explore it

Importing libraries

import pandas as pd
import os as os
import pycurl
import csv

To get the location of current working directory

os.getcwd()

‘C:\\Anaconda’ To change the working directory

os.chdir('C:\\Anaconda\\abalone')
os.getcwd()

‘C:\\Anaconda\\abalone’ Use pycurl to get a datafile from https and write it to a csv file

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
c = pycurl.Curl()
c.setopt(c.URL, url)
with open('abalone.csv', 'w+') as s:
    c.setopt(c.WRITEFUNCTION, s.write)
    c.perform()

To read csv file into abalone object

abalone = pd.read_csv('abalone.csv')
abalone
M 0.455 0.365 0.095 0.514 0.2245 0.101 0.15 15
0 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
1 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
2 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
3 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
4 I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.1200 8
5 F 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.3300 20
6 F 0.545 0.425 0.125 0.7680 0.2940 0.1495 0.2600 16
7 M 0.475 0.370 0.125 0.5095 0.2165 0.1125 0.1650 9
8 F 0.550 0.440 0.150 0.8945 0.3145 0.1510 0.3200 19
9 F 0.525 0.380 0.140 0.6065 0.1940 0.1475 0.2100 14
10 M 0.430 0.350 0.110 0.4060 0.1675 0.0810 0.1350 10
11 M 0.490 0.380 0.135 0.5415 0.2175 0.0950 0.1900 11
12 F 0.535 0.405 0.145 0.6845 0.2725 0.1710 0.2050 10
13 F 0.470 0.355 0.100 0.4755 0.1675 0.0805 0.1850 10
14 M 0.500 0.400 0.130 0.6645 0.2580 0.1330 0.2400 12
15 I 0.355 0.280 0.085 0.2905 0.0950 0.0395 0.1150 7
16 F 0.440 0.340 0.100 0.4510 0.1880 0.0870 0.1300 10
17 M 0.365 0.295 0.080 0.2555 0.0970 0.0430 0.1000 7
18 M 0.450 0.320 0.100 0.3810 0.1705 0.0750 0.1150 9
19 M 0.355 0.280 0.095 0.2455 0.0955 0.0620 0.0750 11
20 I 0.380 0.275 0.100 0.2255 0.0800 0.0490 0.0850 10
21 F 0.565 0.440 0.155 0.9395 0.4275 0.2140 0.2700 12
22 F 0.550 0.415 0.135 0.7635 0.3180 0.2100 0.2000 9
23 F 0.615 0.480 0.165 1.1615 0.5130 0.3010 0.3050 10
24 F 0.560 0.440 0.140 0.9285 0.3825 0.1880 0.3000 11
25 F 0.580 0.450 0.185 0.9955 0.3945 0.2720 0.2850 11
26 M 0.590 0.445 0.140 0.9310 0.3560 0.2340 0.2800 12
27 M 0.605 0.475 0.180 0.9365 0.3940 0.2190 0.2950 15
28 M 0.575 0.425 0.140 0.8635 0.3930 0.2270 0.2000 11
29 M 0.580 0.470 0.165 0.9975 0.3935 0.2420 0.3300 10
4146 M 0.695 0.550 0.195 1.6645 0.7270 0.3600 0.4450 11
4147 M 0.770 0.605 0.175 2.0505 0.8005 0.5260 0.3550 11
4148 I 0.280 0.215 0.070 0.1240 0.0630 0.0215 0.0300 6
4149 I 0.330 0.230 0.080 0.1400 0.0565 0.0365 0.0460 7
4150 I 0.350 0.250 0.075 0.1695 0.0835 0.0355 0.0410 6
4151 I 0.370 0.280 0.090 0.2180 0.0995 0.0545 0.0615 7
4152 I 0.430 0.315 0.115 0.3840 0.1885 0.0715 0.1100 8
4153 I 0.435 0.330 0.095 0.3930 0.2190 0.0750 0.0885 6
4154 I 0.440 0.350 0.110 0.3805 0.1575 0.0895 0.1150 6
4155 M 0.475 0.370 0.110 0.4895 0.2185 0.1070 0.1460 8
4156 M 0.475 0.360 0.140 0.5135 0.2410 0.1045 0.1550 8
4157 I 0.480 0.355 0.110 0.4495 0.2010 0.0890 0.1400 8
4158 F 0.560 0.440 0.135 0.8025 0.3500 0.1615 0.2590 9
4159 F 0.585 0.475 0.165 1.0530 0.4580 0.2170 0.3000 11
4160 F 0.585 0.455 0.170 0.9945 0.4255 0.2630 0.2845 11
4161 M 0.385 0.255 0.100 0.3175 0.1370 0.0680 0.0920 8
4162 I 0.390 0.310 0.085 0.3440 0.1810 0.0695 0.0790 7
4163 I 0.390 0.290 0.100 0.2845 0.1255 0.0635 0.0810 7
4164 I 0.405 0.300 0.085 0.3035 0.1500 0.0505 0.0880 7
4165 I 0.475 0.365 0.115 0.4990 0.2320 0.0885 0.1560 10
4166 M 0.500 0.380 0.125 0.5770 0.2690 0.1265 0.1535 9
4167 F 0.515 0.400 0.125 0.6150 0.2865 0.1230 0.1765 8
4168 M 0.520 0.385 0.165 0.7910 0.3750 0.1800 0.1815 10
4169 M 0.550 0.430 0.130 0.8395 0.3155 0.1955 0.2405 10
4170 M 0.560 0.430 0.155 0.8675 0.4000 0.1720 0.2290 8
4171 F 0.565 0.450 0.165 0.8870 0.3700 0.2390 0.2490 11
4172 M 0.590 0.440 0.135 0.9660 0.4390 0.2145 0.2605 10
4173 M 0.600 0.475 0.205 1.1760 0.5255 0.2875 0.3080 9
4174 F 0.625 0.485 0.150 1.0945 0.5310 0.2610 0.2960 10
4175 M 0.710 0.555 0.195 1.9485 0.9455 0.3765 0.4950 12
To add column names
abalone.columns = ['Sex', 'Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight','Rings']
 To write data to a csv file
abalone.to_csv('abalone.csv')
 To get 4 top-most observations
abalone.head(4)
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.35 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
1 F 0.53 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
2 M 0.44 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
3 I 0.33 0.255 0.080 0.2050 0.0895 0.0395 0.055 7

To get 4 bottom-most observations

abalone.tail(4)
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
4172 M 0.590 0.440 0.135 0.9660 0.4390 0.2145 0.2605 10
4173 M 0.600 0.475 0.205 1.1760 0.5255 0.2875 0.3080 9
4174 F 0.625 0.485 0.150 1.0945 0.5310 0.2610 0.2960 10
4175 M 0.710 0.555 0.195 1.9485 0.9455 0.3765 0.4950 12
 To get basic statistics for all numeric variables
abalone.describe()
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
count 4176.000000 4176.000000 4176.000000 4176.000000 4176.00000 4176.000000 4176.000000 4176.000000
mean 0.524009 0.407892 0.139527 0.828818 0.35940 0.180613 0.238852 9.932471
std 0.120103 0.099250 0.041826 0.490424 0.22198 0.109620 0.139213 3.223601
min 0.075000 0.055000 0.000000 0.002000 0.00100 0.000500 0.001500 1.000000
25% 0.450000 0.350000 0.115000 0.441500 0.18600 0.093375 0.130000 8.000000
50% 0.545000 0.425000 0.140000 0.799750 0.33600 0.171000 0.234000 9.000000
75% 0.615000 0.480000 0.165000 1.153250 0.50200 0.253000 0.329000 11.000000
max 0.815000 0.650000 1.130000 2.825500 1.48800 0.760000 1.005000 29.000000
To get covariance
abalone.cov()
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
Length 0.014425 0.011763 0.004157 0.054499 0.023938 0.011889 0.015009 0.215697
Diameter 0.011763 0.009850 0.003461 0.045046 0.019678 0.009789 0.012509 0.183968
Height 0.004157 0.003461 0.001749 0.016804 0.007195 0.003660 0.004759 0.075251
Whole weight 0.054499 0.045046 0.016804 0.240515 0.105533 0.051953 0.065225 0.854995
Shucked weight 0.023938 0.019678 0.007195 0.105533 0.049275 0.022678 0.027275 0.301440
Viscera weight 0.011889 0.009789 0.003660 0.051953 0.022678 0.012017 0.013851 0.178196
Shell weight 0.015009 0.012509 0.004759 0.065225 0.027275 0.013851 0.019380 0.281839
Rings 0.215697 0.183968 0.075251 0.854995 0.301440 0.178196 0.281839 10.391606
 To get pairwise-correlation coefficients for all numeric variables
abalone.corr()
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
Length 1.000000 0.986813 0.827552 0.925255 0.897905 0.903010 0.897697 0.557123
Diameter 0.986813 1.000000 0.833705 0.925452 0.893159 0.899726 0.905328 0.575005
Height 0.827552 0.833705 1.000000 0.819209 0.774957 0.798293 0.817326 0.558109
Whole weight 0.925255 0.925452 0.819209 1.000000 0.969403 0.966372 0.955351 0.540818
Shucked weight 0.897905 0.893159 0.774957 0.969403 1.000000 0.931956 0.882606 0.421256
Viscera weight 0.903010 0.899726 0.798293 0.966372 0.931956 1.000000 0.907647 0.504274
Shell weight 0.897697 0.905328 0.817326 0.955351 0.882606 0.907647 1.000000 0.628031
Rings 0.557123 0.575005 0.558109 0.540818 0.421256 0.504274 0.628031 1.000000
 To get unique values of ‘Rings’ column
abalone['Rings'].unique()
array([ 7, 9, 10, 8, 20, 16, 19, 14, 11, 12, 15, 18, 13, 5, 4, 6, 21, 17, 22, 1, 3, 26, 23, 29, 2, 27, 25, 24], dtype=int64)
To subset – have only ‘Length’,’Diameter’ and ‘Height’ in data set abalone1
abalone1 = abalone[['Length','Diameter','Height']]
Inspect abalone1 by checking head and tail
abalone1.head(3)
Length Diameter Height
0 0.35 0.265 0.090
1 0.53 0.420 0.135
2 0.44 0.365 0.125
abalone1.tail(3)
Length Diameter Height
4173 0.600 0.475 0.205
4174 0.625 0.485 0.150
4175 0.710 0.555 0.195

For code : http://nbviewer.ipython.org/gist/sunakshi132/4791b6838e7bf3fde38b

Creating a Dictionary in Python

Dictionary is similar to a list. It contains key:value pairs. Here, we will learn about creating and modifying dictionaries.
Creating Dictionary
dictionary_name = {key:value, key:value,…..}

We create a dictionary with colors of jams as keys and fruit names as values.

jam = {'red':'strawberry', 'yellow': 'mango', 'orange':'orange'}
print jam['orange']
print jam['red']

orange
strawberry

Insertion – Here, we add a new key/value pair to the existing dictionary.

jam['blue'] = 'blueberry'

{‘blue’: ‘blueberry’, ‘red’: ‘strawberry’, ‘yellow’: ‘mango’, ‘orange’: ‘orange’}

Deletion – Here, we delete a key/value pair

del jam['orange']
jam

{‘blue’: ‘blueberry’, ‘red’: ‘strawberry’, ‘yellow’: ‘mango’}

Replacing values – We replace value for key ‘red’ from strawberry to cherry.

jam ['red'] = 'cherry'
jam

{‘blue’: ‘blueberry’, ‘red’: ‘cherry’, ‘yellow’: ‘mango’}

Nesting : We nest a new dictionary within an existing dictionary.

 jam['red'] = {'light' : 'cherry', 'dark' : 'strawberry' }

{‘blue’: ‘blueberry’, ‘red’: {‘dark’: ‘strawberry’, ‘light’: ‘cherry’}, ‘yellow’: ‘mango’}
We set a list as a value for key ‘blue’.

jam['blue'] = ['blueberry','plum']
jam

{‘blue’: [‘blueberry’, ‘plum’], ‘red’: {‘dark’: ‘strawberry’, ‘light’: ‘cherry’}, ‘yellow’: ‘mango’}

Append to a list

jam['blue'].append('jamun')  
print jam['blue']

[‘blueberry’, ‘plum’, ‘jamun’]

Indexing the nested dictionary

print jam['red']['dark']

strawberry

jam

{‘blue’: [‘blueberry’, ‘plum’, ‘jamun’], ‘red’: {‘dark’: ‘strawberry’, ‘light’: ‘cherry’}, ‘yellow’: ‘mango’}

Build-In function in Python : map()

map(function, sequence)

map() has two parameters : 1. function 2. sequence. It calls the function for each element in the sequence.

To see how map() works, let us first define a function called square.

def square(i): return i*i
square(4)

16

Now, apply map() on square function for range 0 to 5.

map(square, range(6))

[0, 1, 4, 9, 16, 25]

Now, apply map() on square function for range 13 to 23.

map(square,range(13,24))

[169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529]

Next, we define a remainder function.

def remainder(j): return j % 3
remainder(14)

2

Now, apply map() on remainder function for range 1 to 10.

map(remainder,range(1,11))

[1, 2, 0, 1, 2, 0, 1, 2, 0, 1]

Now, apply map() on remainder function for range 91 to 100.

map(remainder,range(91,101))

[1, 2, 0, 1, 2, 0, 1, 2, 0, 1]