Who could be Myopic? Use Logistic Regression to know.

Here I use Logistic regression to classify myopic and non-myopic students.

Data source: myopia dataset

Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

Loading Dataset

In [2]:
myopia = pd.read_excel("myopia.xls")
In [3]:
myopia['MYOPIC'].value_counts()
Out[3]:
0    537
1     81
dtype: int64
This dataset is an imbalanced dataset. Only 15%(approx.) data represents cases for myopic students.
In [4]:
#creating dataframe with intercept column using dmatrices from patsy
y,x = dmatrices('MYOPIC ~ AGE + GENDER + READHR + COMPHR + STUDYHR + TVHR + SPORTHR + DIOPTERHR + MOMMY + DADMY', myopia, return_type="dataframe")
In [5]:
# creating a logistic regression model, and fitting with x and y
logit = LogisticRegression()
logit = logit.fit(x, y)

# checking accuracy
logit.score(x, y)
Out[5]:
0.8689320388349514

Model Training

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

#creating logistic regression, fitting training data set : x_train, y_train
logit2 = LogisticRegression()
logit2.fit(x_train, y_train)
Out[6]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [7]:
# predicting using test data set
predicted = logit2.predict(x_test)

# getting probabilities
probs = logit2.predict_proba(x_test)
predicted
Out[7]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.])

Checking Accuracy

In [8]:
# generate evaluation metrics
print metrics.accuracy_score(y_test, predicted)
print metrics.confusion_matrix(y_test,predicted)
print metrics.precision_score(y_test, predicted) # what proportion of students identified as myopic were actually myopic.
print metrics.recall_score(y_test, predicted) # what proportion of students that were myopic , were identified as myopic. 
 Out[8]:
0.854838709677
[[159   1]
 [ 26   0]]
0.0
0.0

The accuracy score is about 86%. According to the confusion matrix, the model predicts 159/160 cases for non-myopic students correctly which is the reason for high accuracy. However, the model is unable to predict any case of myopic students correctly. Clearly, this is a case of class imbalance as proportion of non-myopic cases is much higher than myopic cases. Class imbalance leads to ‘accuracy paradox‘ i.e. models with a given level of accuracy may have better predictive power than models with higher level of accuracy. Ratio of correct predictions to total number of cases may seem to be an important metric for predictive models, but in cases of class imbalance it may be useless.

Class imbalance is also the reason for low precision score and low recall score. Hence, while measuring the performance of a model we must look at different types of performance measures.

Cross Validation

In [9]:
#converting y into one dimensional array
y = np.ravel(y)
In [10]:
# evaluating using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), x, y, scoring='accuracy', cv=10)
print scores
print scores.mean()
 Out[10]:
[ 0.85714286  0.87096774  0.87096774  0.87096774  0.87096774  0.87096774
  0.87096774  0.86885246  0.86885246  0.86885246]
0.86895066858

Cross validation score shows 86.9% accuracy. The cross validation score is high as it is aggregated accuracy score with different combinations of training and test datasets.

In further blog posts we will use other modelling techniques to solve this problem of classification.

 

 

Advertisements

Introduction to Bitwise Operators

I Binary Representation

i. Just a Little BIT 1/14

print 5 >> 4  # Right Shift
print 5 << 1  # Left Shift
print 8 & 5   # Bitwise AND
print 9 | 4   # Bitwise OR
print 12 ^ 42 # Bitwise XOR
print ~88     # Bitwise NOT

ii. Lesson 10: The Base 2 Number System 2/14

print 0b1,    #1
print 0b10,   #2
print 0b11,   #3
print 0b100,  #4
print 0b101,  #5
print 0b110,  #6
print 0b111   #7
print "******"
print 0b1 + 0b11
print 0b11 * 0b11

iii. I Can Count to 1100! 3/14

one = 0b1
two = 0b10
three = 0b11
four = 0b100
five = 0b101
six = 0b110
seven = 0b111
eight = 0b1000
nine = 0b1001
ten = 0b1010
eleven = 0b1011
twelve = 0b1100

iv. The bin() Function

print bin(1)
for each in range(2,6):
    print bin(each)

v. int()’s Second Parameter

print int("1",2)
print int("10",2)
print int("111",2)
print int("0b100",2)
print int(bin(5),2)
# Print out the decimal equivalent of the binary 11001001.
print int("11001001",2)

II The Bitwise Operators
i. Slide to the Left! Slide to the Right!

shift_right = 0b1100
shift_left = 0b1

# Your code here!
shift_right = shift_right >>2
shift_left = shift_left <

ii.A BIT of This AND That

print bin(0b1110 & 0b101)

iii.A BIT of This OR That

print bin(0b1110 | 0b101)

iv.This XOR That?

print bin(0b1110^0b101)

v. See? This is NOT That Hard!
> Save and Submit!
III A Bit More Complicated
i. The Man Behind the Bit Mask

def check_bit4(input):
    mask = 0b1000
    if input & mask > 0:
        return "on"
    else:
        return "off"

ii. Turn It On

a = 0b10111011
mask = 0b100
print bin(a|mask)

iii. Just Flip Out

a = 0b11101110
mask = 0b11111111
print bin(a^mask)

iv. Slip and Slide

def flip_bit(number, n):
    mask = 0b1 << n-1
    return bin(number^mask)