Who could be Myopic? Use Logistic Regression to know.

Here I use Logistic regression to classify myopic and non-myopic students.

Data source: myopia dataset

Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

Loading Dataset

In [2]:
myopia = pd.read_excel("myopia.xls")
In [3]:
myopia['MYOPIC'].value_counts()
Out[3]:
0    537
1     81
dtype: int64
This dataset is an imbalanced dataset. Only 15%(approx.) data represents cases for myopic students.
In [4]:
#creating dataframe with intercept column using dmatrices from patsy
y,x = dmatrices('MYOPIC ~ AGE + GENDER + READHR + COMPHR + STUDYHR + TVHR + SPORTHR + DIOPTERHR + MOMMY + DADMY', myopia, return_type="dataframe")
In [5]:
# creating a logistic regression model, and fitting with x and y
logit = LogisticRegression()
logit = logit.fit(x, y)

# checking accuracy
logit.score(x, y)
Out[5]:
0.8689320388349514

Model Training

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

#creating logistic regression, fitting training data set : x_train, y_train
logit2 = LogisticRegression()
logit2.fit(x_train, y_train)
Out[6]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [7]:
# predicting using test data set
predicted = logit2.predict(x_test)

# getting probabilities
probs = logit2.predict_proba(x_test)
predicted
Out[7]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.])

Checking Accuracy

In [8]:
# generate evaluation metrics
print metrics.accuracy_score(y_test, predicted)
print metrics.confusion_matrix(y_test,predicted)
print metrics.precision_score(y_test, predicted) # what proportion of students identified as myopic were actually myopic.
print metrics.recall_score(y_test, predicted) # what proportion of students that were myopic , were identified as myopic. 
 Out[8]:
0.854838709677
[[159   1]
 [ 26   0]]
0.0
0.0

The accuracy score is about 86%. According to the confusion matrix, the model predicts 159/160 cases for non-myopic students correctly which is the reason for high accuracy. However, the model is unable to predict any case of myopic students correctly. Clearly, this is a case of class imbalance as proportion of non-myopic cases is much higher than myopic cases. Class imbalance leads to ‘accuracy paradox‘ i.e. models with a given level of accuracy may have better predictive power than models with higher level of accuracy. Ratio of correct predictions to total number of cases may seem to be an important metric for predictive models, but in cases of class imbalance it may be useless.

Class imbalance is also the reason for low precision score and low recall score. Hence, while measuring the performance of a model we must look at different types of performance measures.

Cross Validation

In [9]:
#converting y into one dimensional array
y = np.ravel(y)
In [10]:
# evaluating using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), x, y, scoring='accuracy', cv=10)
print scores
print scores.mean()
 Out[10]:
[ 0.85714286  0.87096774  0.87096774  0.87096774  0.87096774  0.87096774
  0.87096774  0.86885246  0.86885246  0.86885246]
0.86895066858

Cross validation score shows 86.9% accuracy. The cross validation score is high as it is aggregated accuracy score with different combinations of training and test datasets.

In further blog posts we will use other modelling techniques to solve this problem of classification.

 

 

Advertisements