In this part, we discussed about what is machine learning, types of machine learning, linear regression, logistic regression, cross validation and overfitting. In this lab session, I will demonstrate these concepts in Python code. Python is widely used programming language in the field of scientific computing. And the reason is the awesome libraries such as numpy, scikit-learn, matplotlib, etc. We are also going to use these libraries in the lab sessions. Check out github repository of this series here.
We will start with very simple algorithm called Linear Regression. In the blog post, I explained in-depth - what is linear regression and how it works. In this session, we will focus on implementation rather than theory. We will follow the standard procedure of training machine learning models.
In practice, most of the time we spent behind getting dataset ready for a model, that is, preprocessing and all stuff. Here, I will use preprocessed dataset.
We will use a python library called scikit-learn which is the widely used machine learning lib. For installation process please visit - scikit-learn website. You can install it with pip - pip install -U scikit-learn
#sklearn comes with few small datasets. We will use one of them called "boston housing". Which is identical to
#to the example we saw in theory part. This dataset has 506 samples with 13 features (columns). Here target variable
#is the price of the house.
#import the libs
from sklearn.datasets import load_boston
#load the dataset
data = load_boston() #returns dictionary-like object, attributes are - data, target, DESCR
#first of all, let's see the shape of the training data
print(data.data.shape)
#shape of a target/labels
print(data.target.shape)
#important info about the dataset
print(data.DESCR)
#how target values look like
data.target[:40]
Since this dataset is already preprocessed, we dont have to do anything in this phase.
from sklearn.linear_model import LinearRegression
#create a linear regression object
lin_reg = LinearRegression()
#train a model
lin_reg.fit(data.data, data.target)
#learned weights
lin_reg.coef_
#learned intercept
lin_reg.intercept_
# we can use a model to predict as follows
lin_reg.predict(data.data[4].reshape(1,-1)) #first sample
#let's see what was the true value
data.target[4] #not good :(
#find mean squared error
from sklearn.metrics import mean_squared_error
mean_squared_error(data.target, lin_reg.predict(data.data))
#let us calculate mse from scratch to make sure its correct
import numpy as np
np.mean((lin_reg.predict(data.data) - data.target) ** 2)
We can use predict method to predict the price of a house.
As you can see, the main benifit of these libraries are we do not have to worry about internal algorithms. It does this work for us. Later in this session, I will make some visualization to make concepts more clear.
Logistic regression is the classification algorithm. In theory session, I explained how it works. Sigmoid function is the core of this algorithm. We can implement this function in numpy as follows:
def sigmoid(x):
return 1/(1+np.exp(x))
numbers = np.linspace(-20,20,50) #generate a list of numbers
numbers
#we will pass each number through sigmoid function
results = sigmoid(numbers)
results[:10] #print few numbers
As you can see, all numbers are squashed between [0,1]
Now, we will implement logistic regression using sklearn.
#this time we will use digit dataset.
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
%matplotlib inline
digits = load_digits()
X = digits.data #input
y = digits.target #output
print(digits.data.shape) #1797 samples * 64 (8*8)pixels
#input is an image and we would like to train a model which can predict the digit that image contains
#each image is of 8 * 8 pixels
#plot few digits ## dont worry if u dont understand it
fig = plt.figure()
plt.gray()
ax1 = fig.add_subplot(231)
ax1.imshow(digits.images[0])
ax2 = fig.add_subplot(232)
ax2.imshow(digits.images[1])
ax3 = fig.add_subplot(233)
ax3.imshow(digits.images[2])
plt.tight_layout()
plt.show()
Since we dont need to preprocess our dataset, we will directly move to third step.
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
#train a model
log_reg.fit(X, y)
#sklearn provides several ways to test a classifier
from sklearn.metrics import accuracy_score
accuracy_score(y, log_reg.predict(X))
#another way
log_reg.score( X, y)
Please recall that its not a good thing to test a model on training dataset. As you can see, we are getting almost 100% accuracy and the reason is we are testing a model on the dataset on which we trained it. Its like you got examples in your test paper same as you practiced during the lecture. So, deifnitely you will get full marks.
#confusion matrix is a table that can be used to evaluate the performance of a classifier
#each row shows actual values and column values shows predicted values
from sklearn.metrics import confusion_matrix
confusion_matrix(y, log_reg.predict(X))
Confusion matrix is a table that can be used to evaluate the performance of a classifier. Each row shows actual values and column values shows predicted values. For example, image with digit 9 comes 180 times but our model predicted 177 times.
#we can use predict method to predict the class
print("Predicted : " , log_reg.predict(digits.data[1].reshape(1,-1)))
print("Actual : ", digits.target[1])
#we can also predict the probability of each class
proba = log_reg.predict_proba(digits.data[1].reshape(1,-1)) # second column has the highest probability
print(proba)
np.argmax(proba) #please note index starts with 0
We can not evaluate our model using training examples, since we have trained it on them and it is highly likely that our model will find correct output. In practice, we divide our dataset into two parts-training and testing part. We use test dataset to evaluate our model. Lets see how to implement it in scikit-learn.
from sklearn.datasets import load_iris #https://en.wikipedia.org/wiki/Iris_flower_data_set
iris = load_iris()
iris.data.shape
#split the dataset into two parts
from sklearn.model_selection import train_test_split
#split dataset into 70-30
X_train, X_test, y_train , y_test = train_test_split(iris.data, iris.target, test_size= 0.3, random_state=42)
#randomstate - to make sure each time we run this code it gives same results
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
#train on training data
model = LogisticRegression()
model.fit(X_train, y_train)
#test on test data
model.score(X_test, y_test)
In theory part, I explained what is overfitting and how it might affect our model so badly. Next, we will implement various methods such as regularizarion and cross-validation to prevent overfitting.
By default, scikit-learn uses l2 regularization with C=1. Please note, C is the inverse of regularization strength. We can tweak some parameters to play around.
model = LogisticRegression(penalty="l2", C=1) #default configuration
model.fit(X_train, y_train)
model.score(X_test, y_test) #note, we got same accuracy
#let us use l1 regularization
model = LogisticRegression(penalty="l1", C=1)
model.fit(X_train, y_train)
model.score(X_test, y_test) #whoa! we got 100% accuracy
model = LogisticRegression(penalty="l2", C=0.23)
model.fit(X_train, y_train)
model.score(X_test, y_test)
#you have to consider various values for this type of parameters (hyperparameters) to find the best one
# we can do this with GridCV, RandomCV- This is beyond the scope of this lab session
#we discussed about k-fold cross validation. In which we divide whole dataset into k parts and each time we hold out
#one part and train on k-1 parts
#we'll use boston housing dataset
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5) #k=10
costs = []
for train_index,test_index in kfold.split(data.data):
X_train, y_train = data.data[train_index], data.target[train_index]
X_test, y_test = data.data[test_index], data.target[test_index]
model = LinearRegression()
model.fit(X_train, y_train)
costs.append(mean_squared_error(y_test, model.predict(X_test)))
np.mean(costs)
#10 fold cross-validation
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
digits = load_digits()
model = LogisticRegression()
scores = cross_val_score(model,digits.data, digits.target, cv=10, scoring='accuracy' )
scores.mean()
For classification tasks, it is recommended to use variant of KFold- StratifiedFold which preserves the percentage of samples for each class.
from sklearn.model_selection import StratifiedKFold
digits = load_digits()
skfold = StratifiedKFold(n_splits= 10)
costs = []
for train_index,test_index in skfold.split(digits.data, digits.target):
X_train, y_train = digits.data[train_index], digits.target[train_index]
X_test, y_test = digits.data[test_index], digits.target[test_index]
model = LogisticRegression()
model.fit(X_train, y_train)
costs.append(model.score(X_test, y_test))
np.mean(costs)
If you have any query or feedback kindly write me at vsavan7@gmail.com. You can follow me on twitter (@savanvisalpara7) to get updates on new parts of this series.
Share this post on social media :