Who could be Myopic? Use Logistic Regression to know.

Here I use Logistic regression to classify myopic and non-myopic students.

Data source: myopia dataset

Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

Loading Dataset

In [2]:
myopia = pd.read_excel("myopia.xls")
In [3]:
myopia['MYOPIC'].value_counts()
Out[3]:
0    537
1     81
dtype: int64
This dataset is an imbalanced dataset. Only 15%(approx.) data represents cases for myopic students.
In [4]:
#creating dataframe with intercept column using dmatrices from patsy
y,x = dmatrices('MYOPIC ~ AGE + GENDER + READHR + COMPHR + STUDYHR + TVHR + SPORTHR + DIOPTERHR + MOMMY + DADMY', myopia, return_type="dataframe")
In [5]:
# creating a logistic regression model, and fitting with x and y
logit = LogisticRegression()
logit = logit.fit(x, y)

# checking accuracy
logit.score(x, y)
Out[5]:
0.8689320388349514

Model Training

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

#creating logistic regression, fitting training data set : x_train, y_train
logit2 = LogisticRegression()
logit2.fit(x_train, y_train)
Out[6]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [7]:
# predicting using test data set
predicted = logit2.predict(x_test)

# getting probabilities
probs = logit2.predict_proba(x_test)
predicted
Out[7]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.])

Checking Accuracy

In [8]:
# generate evaluation metrics
print metrics.accuracy_score(y_test, predicted)
print metrics.confusion_matrix(y_test,predicted)
print metrics.precision_score(y_test, predicted) # what proportion of students identified as myopic were actually myopic.
print metrics.recall_score(y_test, predicted) # what proportion of students that were myopic , were identified as myopic. 
 Out[8]:
0.854838709677
[[159   1]
 [ 26   0]]
0.0
0.0

The accuracy score is about 86%. According to the confusion matrix, the model predicts 159/160 cases for non-myopic students correctly which is the reason for high accuracy. However, the model is unable to predict any case of myopic students correctly. Clearly, this is a case of class imbalance as proportion of non-myopic cases is much higher than myopic cases. Class imbalance leads to ‘accuracy paradox‘ i.e. models with a given level of accuracy may have better predictive power than models with higher level of accuracy. Ratio of correct predictions to total number of cases may seem to be an important metric for predictive models, but in cases of class imbalance it may be useless.

Class imbalance is also the reason for low precision score and low recall score. Hence, while measuring the performance of a model we must look at different types of performance measures.

Cross Validation

In [9]:
#converting y into one dimensional array
y = np.ravel(y)
In [10]:
# evaluating using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), x, y, scoring='accuracy', cv=10)
print scores
print scores.mean()
 Out[10]:
[ 0.85714286  0.87096774  0.87096774  0.87096774  0.87096774  0.87096774
  0.87096774  0.86885246  0.86885246  0.86885246]
0.86895066858

Cross validation score shows 86.9% accuracy. The cross validation score is high as it is aggregated accuracy score with different combinations of training and test datasets.

In further blog posts we will use other modelling techniques to solve this problem of classification.

 

 

Practice 1.1 – Python Pandas Cookbook by Alfred Essa

Following Alfred Essa’s Python Pandas Cookbook on YouTube
Different Ways to Construct Series
import pandas as pd
import numpy as np
Using Series Constructor
s1 = pd.Series([463,3,-728,236,32,-773])
s1
0    463
1      3
2   -728
3    236
4     32
5   -773
dtype: int64
type(s1)
pandas.core.series.Series
s1.values
array([ 463,    3, -728,  236,   32, -773], dtype=int64)
type(s1.values)
numpy.ndarray
s1.index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
s1[3]
236

Defining data and index

data1 = [3.5,5,343,9.3,23]
index1 = ['Mon','Tue','Wed','Thur','Fri']

Creating Series

s2 = pd.Series(data1, index = index1)
s2
Mon       3.5
Tue       5.0
Wed     343.0
Thur      9.3
Fri      23.0
dtype: float64
s2[4]
23.0
s2.index
Index([u'Mon', u'Tue', u'Wed', u'Thur', u'Fri'], dtype='object')
s2.name = 'Daily numbers'
s2.index.name = 'Working days'
s2
Working days
Mon               3.5
Tue               5.0
Wed             343.0
Thur              9.3
Fri              23.0
Name: Daily numbers, dtype: float64
Creating Dictionary
dict1 = {'Jan': -7,'Feb': 2,'March': 12,'April': -9,'May': 3,'June': 4}
s3 = pd.Series(dict1)
s3
April    -9
Feb       2
Jan      -7
June      4
March    12
May       3
dtype: int64
Vectorized Operations
s3 * 2
April   -18
Feb       4
Jan     -14
June      8
March    24
May       6
dtype: int64
np.log(s3)
April         NaN
Feb      0.693147
Jan           NaN
June     1.386294
March    2.484907
May      1.098612
dtype: float64
Slicing
s3['Feb':'May']
Feb       2
Jan      -7
June      4
March    12
May       3
dtype: int64
s3[3:5]
June      4
March    12
dtype: int64
Offset value
s3[3] = 54
s3
April    -9
Feb       2
Jan      -7
June     54
March    12
May       3
dtype: int64
s3.median()
2.5
s3.min()
-9
s3.max()
54
s3.cumsum()
April    -9
Feb      -7
Jan     -14
June     40
March    52
May      55
dtype: int64

Making Looping Clearer – enumerate() returns iterators

for i, v in enumerate(s3):
    print i,v
0 -9
1 2
2 -7
3 54
4 12
5 3
new_s3 = [x**2 for x in s3]
new_s3
[81, 4, 49, 2916, 144, 9]
Series using dictionary
s3['Feb']
2
'Feb' in s3
True
Assignment using key
s3['May'] = 45.8
s3
April    -9
Feb       2
Jan      -7
June     54
March    12
May      45
dtype: int64

Looping over dictionary keys and values

for k,v in s3.iteritems():
    print k,v
April -9
Feb 2
Jan -7
June 54
March 12
May 45

Introduction to Programming : Python

I started my programming journey with Python. To begin with, there are two well known sources that one could choose from : i. http://www.codecademy.com ii. http://www.coursera.org. I chose codecademy as it is based on an interactive platform which makes learning quite simple. As a beginner to programming I could complete the Codecademy Python in two weeks. There are some tips from my experience:

  1. Its important to sharpen one’s memory and mathematical and logical skills. Lumosity is a good way to improve these skills.
  2. Chill, relax and catch good sleep at night. Yes, I know coding and gaming could be a bit addictive, but do sleep well.
  3. Be consistent. I liked a quote from lumosity.com, they display it when you play for longer hours in a single day “Training is a marathon, not a sprint. Come back tomorrow to challenge yourself again.”

Coming up next – Posts on Introduction to Python, interactive course by http://www.codecademy.com.

For Beginners in Coding

Today, lets begin lessons on Coding for Beginners. When I started my journey in data science, the following were the first techniques I learnt:

  1. Programming in Python- Learning source(s) : i.http://www.codecademy.com/tracks/python  and ii. https://www.coursera.org/course/pythonlearn
  2. R techniques and packages- Learning source(s) : i. https://www.datacamp.com and ii. http://www.r-bloggers.com

The following blogs will contain my solutions to codecademy’s exercises on Python.

Any questions or suggestions are welcome from all readers 🙂

What is Data Science? Why study Data Science?

“Data Science is extraction of knowledge from data”.

“By combining aspects of statistics, computer science, applied mathematics, and visualization, data science can turn the vast amounts of data the digital age generates into new insights and new knowledge”.

industry analytics

Data Science gives insights across almost all industries – from retail to government to biotech. Lets have a look on industry-wise use of data science –

  1. Government –
    • Prevent waste, fraud and abuse
    • Combat cyber-attacks and safeguard sensitive information
    • Use business intelligence to make better financial decisions
    • Improve defense systems and protect soldiers on the ground
  1. Internet –
    • Personalized recommendations based on weather, seasonal trends, traffic reports, past purchase history, your dog’s favorite chew toys…
    • Smarter sentiment analysis
    • Product insights gleaned from RFID and sensor data
    • Detailed market basket and video analysis
    • Geo-targeted marketing
    • Real-time pricing and inventory management
  1. Healthcare –
    • Clinical trials
    • Direct observations of other physicians
    • Electronic medical records
    • Online patient networks
  1. Finance –
    • Running live simulations of market events.
    • Build algorithms around market sentiment data that can short the market when disasters occur
    • Track trends, monitor the launch of new products, respond to issues and improve overall brand perception
    • Detect financial fraud
    • Improve credit ratings.
    • Provide more accurate pricing
  1. Biotechnology –
    • Bioinformatics tools for studying biological data
    • Genomics research
    • DNA sequencing
  2. Retail –
    • Sentiment analysis of social media streams
    • Predictive analytics
    • Real-time pricing using “second by second” metrics
    • Tailored offers through online behavioral analysis and web analytics
    • Improved, real-time inventory tracking and management
    • Route optimization and more efficient transportation
  1. Telecom
    • Combine variables to predict the likelihood of change
    • Sentiment analysis of social media to detect changes in opinion
    • Personalized promotions based on historical behavior
    • React to retain customers as soon as change is noted
  1. Manufacturing
    • Predictive  model equipment failure rates
    • Streamline inventory management
    • Target energy-inefficient components
    • Optimize factory floor space
  1. Insurance
    • Personalized Risk Pricing
    • Property Insurance – Moisture sensors that detect flooding or leaks, Utility and appliance usage records
    • Life and Health Insurance – Transactional data – e.g., where and what (junk food?) customers buy, Body sensors – i.e., devices that monitor consumption or alert the wearer to early signs of illness, Exterior monitors – e.g., data from workout machines, Social media – e.g., tweets about one’s personal health or state of mind
  1. Travel and Transportation
    • Behavioral targeting – e.g., website visitor behavior
    • Social media – e.g., posts on travel, reviews from friends
    • Location tracking records
    • Previous purchase patterns
    • Mobile device usage data
  1. Pharmaceutical
    • Drug Discovery
    • Avoiding Negative Outcomes: Predictive modeling can also be used to short-circuit potential disasters such as deaths from risk factors.
    • Data-Based Patient Selection:  identify which populations would work best in trials.
    • Real-Time Monitoring: Companies now monitor real-time data from trials to identify safety or operational risks and nip problems in the bud.
    • Drug Safety Assurance
    • Sophisticated Sales
    • Patterns in drug-drug interactions
  1. Utilities
    • Real-time sensor data
    • Operating history
    • Technician notes
    • Predictive models (e.g., expected earthquake effects)
    • Public datasets (e.g., weather reports)