Here I use Logistic regression to classify myopic and non-myopic students.

Data source: myopia dataset

### Importing Libraries

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
```

### Loading Dataset

`myopia = pd.read_excel("myopia.xls")`

```
#creating dataframe with intercept column using dmatrices from patsy
y,x = dmatrices('MYOPIC ~ AGE + GENDER + READHR + COMPHR + STUDYHR + TVHR + SPORTHR + DIOPTERHR + MOMMY + DADMY', myopia, return_type="dataframe")
```

```
# creating a logistic regression model, and fitting with x and y
logit = LogisticRegression()
logit = logit.fit(x, y)
# checking accuracy
logit.score(x, y)
```

### Model Training

```
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
#creating logistic regression, fitting training data set : x_train, y_train
logit2 = LogisticRegression()
logit2.fit(x_train, y_train)
```

```
# predicting using test data set
predicted = logit2.predict(x_test)
# getting probabilities
probs = logit2.predict_proba(x_test)
predicted
```

### Checking Accuracy

```
# generate evaluation metrics
print metrics.accuracy_score(y_test, predicted)
print metrics.confusion_matrix(y_test,predicted)
print metrics.precision_score(y_test, predicted) # what proportion of students identified as myopic were actually myopic.
print metrics.recall_score(y_test, predicted) # what proportion of students that were myopic , were identified as myopic.
```

The accuracy score is about 86%. According to the confusion matrix, the model predicts 159/160 cases for non-myopic students correctly which is the reason for high accuracy. However, the model is unable to predict any case of myopic students correctly. Clearly, this is a case of **class imbalance** as proportion of non-myopic cases is much higher than myopic cases. Class imbalance leads to ‘**accuracy paradox**‘ i.e. models with a given level of accuracy may have better predictive power than models with higher level of accuracy. Ratio of correct predictions to total number of cases may seem to be an important metric for predictive models, but in cases of class imbalance it may be useless.

Class imbalance is also the reason for low precision score and low recall score. Hence, while measuring the performance of a model we must look at different types of performance measures.

### Cross Validation

```
#converting y into one dimensional array
y = np.ravel(y)
```

```
# evaluating using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), x, y, scoring='accuracy', cv=10)
print scores
print scores.mean()
```

Cross validation score shows 86.9% accuracy. The cross validation score is high as it is aggregated accuracy score with different combinations of training and test datasets.

In further blog posts we will use other modelling techniques to solve this problem of classification.