BLOG - SAVAN VISALPARA

[ Musings on Machine Learning ]
Posted on 23 May, 2017

Practical Machine Learning With Python [Part - 1]

In this part, we discussed about what is machine learning, types of machine learning, linear regression, logistic regression, cross validation and overfitting. In this lab session, I will demonstrate these concepts in Python code. Python is widely used programming language in the field of scientific computing. And the reason is the awesome libraries such as numpy, scikit-learn, matplotlib, etc. We are also going to use these libraries in the lab sessions. Check out github repository of this series here.

Linear Regression

We will start with very simple algorithm called Linear Regression. In the blog post, I explained in-depth - what is linear regression and how it works. In this session, we will focus on implementation rather than theory. We will follow the standard procedure of training machine learning models.

  • Load the dataset
  • Preprocess/Augment the dataset
  • Train a model
  • Test a model
  • Deploy a model

In practice, most of the time we spent behind getting dataset ready for a model, that is, preprocessing and all stuff. Here, I will use preprocessed dataset.

We will use a python library called scikit-learn which is the widely used machine learning lib. For installation process please visit - scikit-learn website. You can install it with pip - pip install -U scikit-learn

1 - Load the dataset

In [1]:
#sklearn comes with few small datasets. We will use one of them called "boston housing". Which is identical to
#to the example we saw in theory part. This dataset has 506 samples with 13 features (columns). Here target variable
#is the price of the house.

#import the libs
from sklearn.datasets import load_boston
#load the dataset
data = load_boston()  #returns dictionary-like object, attributes are - data, target, DESCR
#first of all, let's see the shape of the training data
print(data.data.shape)
(506, 13)
In [2]:
#shape of a target/labels
print(data.target.shape)
(506,)
In [3]:
#important info about the dataset
print(data.DESCR)
Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

In [4]:
#how target values look like
data.target[:40]
Out[4]:
array([ 24. ,  21.6,  34.7,  33.4,  36.2,  28.7,  22.9,  27.1,  16.5,
        18.9,  15. ,  18.9,  21.7,  20.4,  18.2,  19.9,  23.1,  17.5,
        20.2,  18.2,  13.6,  19.6,  15.2,  14.5,  15.6,  13.9,  16.6,
        14.8,  18.4,  21. ,  12.7,  14.5,  13.2,  13.1,  13.5,  18.9,
        20. ,  21. ,  24.7,  30.8])

2 - Preprocess the dataset

Since this dataset is already preprocessed, we dont have to do anything in this phase.

3 - Train a model

In [5]:
from sklearn.linear_model import LinearRegression
#create a linear regression object
lin_reg = LinearRegression()
#train a model
lin_reg.fit(data.data, data.target)
Out[5]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [6]:
#learned weights
lin_reg.coef_
Out[6]:
array([ -1.07170557e-01,   4.63952195e-02,   2.08602395e-02,
         2.68856140e+00,  -1.77957587e+01,   3.80475246e+00,
         7.51061703e-04,  -1.47575880e+00,   3.05655038e-01,
        -1.23293463e-02,  -9.53463555e-01,   9.39251272e-03,
        -5.25466633e-01])
In [7]:
#learned intercept
lin_reg.intercept_
Out[7]:
36.491103280363134
In [ ]:
 

4 - Test a model

In [37]:
# we can use a model to predict as follows
lin_reg.predict(data.data[4].reshape(1,-1))  #first sample
Out[37]:
array([ 27.94288232])
In [10]:
#let's see what was the true value
data.target[4]  #not good :(
Out[10]:
36.200000000000003
In [11]:
#find mean squared error
from sklearn.metrics import mean_squared_error
mean_squared_error(data.target, lin_reg.predict(data.data))
Out[11]:
21.897779217687496
In [12]:
#let us calculate mse from scratch to make sure its correct
import numpy as np
np.mean((lin_reg.predict(data.data) - data.target) ** 2)
Out[12]:
21.897779217687496
In [ ]:
 

5 - Deploy a model

We can use predict method to predict the price of a house.

As you can see, the main benifit of these libraries are we do not have to worry about internal algorithms. It does this work for us. Later in this session, I will make some visualization to make concepts more clear.

Logistic Regression

Logistic regression is the classification algorithm. In theory session, I explained how it works. Sigmoid function is the core of this algorithm. We can implement this function in numpy as follows:

In [13]:
def sigmoid(x):
    return 1/(1+np.exp(x))
In [14]:
numbers = np.linspace(-20,20,50) #generate a list of numbers
numbers
Out[14]:
array([-20.        , -19.18367347, -18.36734694, -17.55102041,
       -16.73469388, -15.91836735, -15.10204082, -14.28571429,
       -13.46938776, -12.65306122, -11.83673469, -11.02040816,
       -10.20408163,  -9.3877551 ,  -8.57142857,  -7.75510204,
        -6.93877551,  -6.12244898,  -5.30612245,  -4.48979592,
        -3.67346939,  -2.85714286,  -2.04081633,  -1.2244898 ,
        -0.40816327,   0.40816327,   1.2244898 ,   2.04081633,
         2.85714286,   3.67346939,   4.48979592,   5.30612245,
         6.12244898,   6.93877551,   7.75510204,   8.57142857,
         9.3877551 ,  10.20408163,  11.02040816,  11.83673469,
        12.65306122,  13.46938776,  14.28571429,  15.10204082,
        15.91836735,  16.73469388,  17.55102041,  18.36734694,
        19.18367347,  20.        ])
In [15]:
#we will pass each number through sigmoid function
results = sigmoid(numbers)
results[:10]  #print few numbers
Out[15]:
array([ 1.        ,  1.        ,  0.99999999,  0.99999998,  0.99999995,
        0.99999988,  0.99999972,  0.99999938,  0.99999859,  0.9999968 ])

As you can see, all numbers are squashed between [0,1]

Now, we will implement logistic regression using sklearn.

1 - Load the dataset

In [18]:
#this time we will use digit dataset.
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
%matplotlib inline

digits = load_digits()
X = digits.data  #input
y = digits.target #output
print(digits.data.shape)  #1797 samples * 64 (8*8)pixels
#input is an image and we would like to train a model which can predict the digit that image contains
#each image is of 8 * 8 pixels

#plot few digits  ## dont worry if u dont understand it
fig = plt.figure()
plt.gray()
ax1 = fig.add_subplot(231)
ax1.imshow(digits.images[0])

ax2 = fig.add_subplot(232)
ax2.imshow(digits.images[1])

ax3 = fig.add_subplot(233)
ax3.imshow(digits.images[2])

plt.tight_layout()
plt.show()
(1797, 64)

3 - Train a model

Since we dont need to preprocess our dataset, we will directly move to third step.

In [19]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
#train a model
log_reg.fit(X, y)
Out[19]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

4 - Test a model

In [20]:
#sklearn provides several ways to test a classifier
from sklearn.metrics import accuracy_score
accuracy_score(y, log_reg.predict(X))
Out[20]:
0.99332220367278801
In [21]:
#another way
log_reg.score( X, y)
Out[21]:
0.99332220367278801

Please recall that its not a good thing to test a model on training dataset. As you can see, we are getting almost 100% accuracy and the reason is we are testing a model on the dataset on which we trained it. Its like you got examples in your test paper same as you practiced during the lecture. So, deifnitely you will get full marks.

In [22]:
#confusion matrix is a table that can be used to evaluate the performance of a classifier
#each row shows actual values and column values shows predicted values
from sklearn.metrics import confusion_matrix
confusion_matrix(y, log_reg.predict(X))
Out[22]:
array([[178,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0, 179,   0,   1,   0,   0,   0,   0,   2,   0],
       [  0,   0, 177,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0, 183,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0, 181,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0, 182,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0, 181,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0, 179,   0,   0],
       [  0,   5,   0,   1,   0,   0,   0,   0, 168,   0],
       [  0,   0,   0,   1,   0,   0,   0,   0,   2, 177]])

Confusion matrix is a table that can be used to evaluate the performance of a classifier. Each row shows actual values and column values shows predicted values. For example, image with digit 9 comes 180 times but our model predicted 177 times.

Deploy a model

In [36]:
#we can use predict method to predict the class
print("Predicted : " , log_reg.predict(digits.data[1].reshape(1,-1)))
print("Actual : ", digits.target[1])
Predicted :  [1]
Actual :  1
In [35]:
#we can also predict the probability of each class
proba = log_reg.predict_proba(digits.data[1].reshape(1,-1)) # second column has the highest probability
print(proba)
np.argmax(proba) #please note index starts with 0
[[  4.75045461e-18   9.99447460e-01   7.00809699e-10   3.72475330e-09
    2.15616661e-06   1.35167550e-09   5.71303497e-10   1.95595337e-13
    5.50377100e-04   5.64607392e-10]]
Out[35]:
1

We can not evaluate our model using training examples, since we have trained it on them and it is highly likely that our model will find correct output. In practice, we divide our dataset into two parts-training and testing part. We use test dataset to evaluate our model. Lets see how to implement it in scikit-learn.

In [25]:
from sklearn.datasets import load_iris #https://en.wikipedia.org/wiki/Iris_flower_data_set
iris = load_iris()
iris.data.shape 
Out[25]:
(150, 4)
In [27]:
#split the dataset into two parts
from sklearn.model_selection import train_test_split
#split dataset into 70-30
X_train, X_test, y_train , y_test = train_test_split(iris.data, iris.target, test_size= 0.3, random_state=42)
#randomstate - to make sure each time we run this code it gives same results
In [28]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(105, 4)
(105,)
(45, 4)
(45,)
In [29]:
#train on training data
model = LogisticRegression()
model.fit(X_train, y_train)
Out[29]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [34]:
#test on test data
model.score(X_test, y_test)
Out[34]:
0.97777777777777775

Overfitting

In theory part, I explained what is overfitting and how it might affect our model so badly. Next, we will implement various methods such as regularizarion and cross-validation to prevent overfitting.

Regularization

By default, scikit-learn uses l2 regularization with C=1. Please note, C is the inverse of regularization strength. We can tweak some parameters to play around.

In [40]:
model = LogisticRegression(penalty="l2", C=1) #default configuration
model.fit(X_train, y_train)
model.score(X_test, y_test) #note, we got same accuracy
Out[40]:
0.97777777777777775
In [42]:
#let us use l1 regularization
model = LogisticRegression(penalty="l1", C=1)
model.fit(X_train, y_train)
model.score(X_test, y_test) #whoa! we got 100% accuracy
Out[42]:
1.0
In [52]:
model = LogisticRegression(penalty="l2", C=0.23)
model.fit(X_train, y_train)
model.score(X_test, y_test) 

#you have to consider various values for this type of parameters (hyperparameters) to find the best one
# we can do this with GridCV, RandomCV- This is beyond the scope of this lab session
Out[52]:
0.91111111111111109
In [ ]:
 
In [ ]:
 

Cross-validation

In [84]:
#we discussed about k-fold cross validation. In which we divide whole dataset into k parts and each time we hold out
#one part and train on k-1 parts

#we'll use boston housing dataset
from sklearn.model_selection import KFold

kfold = KFold(n_splits=5) #k=10

costs = []
for train_index,test_index in kfold.split(data.data):
    X_train, y_train = data.data[train_index], data.target[train_index]
    X_test, y_test = data.data[test_index], data.target[test_index]
    model = LinearRegression()
    model.fit(X_train, y_train)
    costs.append(mean_squared_error(y_test, model.predict(X_test)))
In [85]:
np.mean(costs)
Out[85]:
37.222843637138403
In [75]:
#10 fold cross-validation
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
digits = load_digits()


model = LogisticRegression()
scores = cross_val_score(model,digits.data, digits.target, cv=10, scoring='accuracy' )
scores.mean()
Out[75]:
0.93102983468390121

For classification tasks, it is recommended to use variant of KFold- StratifiedFold which preserves the percentage of samples for each class.

In [91]:
from sklearn.model_selection import StratifiedKFold
digits = load_digits()
skfold = StratifiedKFold(n_splits= 10)
costs = []
for train_index,test_index in skfold.split(digits.data, digits.target):
    X_train, y_train = digits.data[train_index], digits.target[train_index]
    X_test, y_test = digits.data[test_index], digits.target[test_index]
    model = LogisticRegression()
    model.fit(X_train, y_train)
    costs.append(model.score(X_test, y_test))
In [92]:
np.mean(costs)
Out[92]:
0.93102983468390121

If you have any query or feedback kindly write me at vsavan7@gmail.com. You can follow me on twitter (@savanvisalpara7) to get updates on new parts of this series.

Share this post on social media :