Practical Machine Learning With Python - Part 2¶

In the previous post, I explained what is machine learning, types of machine learning, linear regression, logistic regression, various issues that we need to consider such as overfitting and at last I explained what really learning is in machine learning. In lab session, I explained how to implement algorithms and concepts that I explained in theory session using Python. I have also uploaded jupyter notebooks on github. Check out this github repository.

In this session, I will explain some easy yet powerful machine learning algorithms such as naive bayes, support vector machine and decision trees. From now onwards, I will not make seperate part for theory and lab session. Instead, I will integrate theory with code in jupyter notebook. If you are unfamiliar with Jupyter notebooks, please go through Jupyter Notebook Basics Guide .

Index

Naive Bayes

Support Vector Machines

Decision Tree

Ensemble Methods

Exercise

Naive Bayes¶

Naive Bayes is a supervised learning algorithm which is based on bayes theorem . Naive Bayes is a widely used classification algorithm. Here, word naive comes from the assumption of independence among features. That is, if we have a feature vector (input vector) (x₁, x₂,...,x_n), x_i^'s are conditionally independent given y. We can write bayes theorem as follows :

\begin{align} P( y | x ) = \frac{P(y)P(x | y)}{P(x)} \end{align} where,

P(x) is the prior probability of a feature.
P(x | y) is the probability of a feature given target. It's also known as likelihood.
P(y) is the prior probability of a target or class in case of classification.
p(y | x) is the posterior probability of target given feature.

when we have more than one feature then we can rewrite this equation as :

\begin{align} P( y | x_1,...,x_n) = \frac{P(y)P(x_1,...,x_n | y)}{P(x_1,...,x_n)} \end{align}

Consider an example for spam classification, our input or feature vector will be a set of words and output will be spam or ham (1 or 0). In naive bayes, we calculate probability of each class(spam or ham) given feature vector and class with maximum probability becomes our output. Our task is to solve above equation for each class. Now, let us dig deeper into this equation and see how we can use this equation to find the probability of each class.
Using the naive bayes assumption we can write :

\begin{align} P(x_i | y, x_1,..,x_{i-1},x{i+1},..,x_n) = P(x_i | y) \end{align} We can rewrite bayes theorem as follows :

\begin{align} P( y | x_1,...,x_n) = \frac{P(y)\prod_{i=1}^{n}P(x_i| y)}{P(x_1,...,x_n)} \end{align}

but we know that P(x₁, x₂, .., x_n) is constant given the input. So we can say that

\begin{align} P(y|x_1,...,x_n) \propto P(y) \prod_{i=1}^{n} P(x_i|y)\end{align}
For classification rule (where we want to find the class with maximum probability), we can write equation as :

\begin{align} \hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i|y) \end{align}
Now, we can use Maximum a Posteriori estimation to estimate both P(y) and P(x_i|y). Here, P(y) = samples with class y / total number of sample, in other words, frequency of class y in training data.

We can make several variants of naive bayes by using different distribution for P(x_i|y). Widely used Naive Bayes variants are Gaussian Naive Bayes , Multinomial Naive Bayes and Bernoulli Naive Bayes . Now, we will implement Naive Bayes algorithm in scikit-learn.

#we will use iris dataset
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
import numpy as np

#load the dataset
data = load_iris()

model = GaussianNB()
model.fit(data.data, data.target)

#evalaute
print(model.score(data.data, data.target))

#predict
print(model.predict([4.2, 3, 0.9, 2.1])) #0 = setosa,1 = versicolor, and 2 = virginica

0.96
[1]

C:\Users\savan77\Anaconda3\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

Support Vector Machines¶

Support Vector Machines are supervised learning models which can be used for both classification and regression. SVMs are among the best supervised learning algorithms. It is effective in high dimensional space and it is memory efficient as well. Consider a binary classification problem, where the task is to assign a one of the two labels to given input. We plot each data item as a point in n-dimensional space as follows:

title

We can perform classification by finding the hyperplane that differentiate the two classes very well. As you can see in the above image, we can draw m number of hyperplanes. How do we find the best one? We can find the optimal hyperplane by maximizing the margin .

We define margin as a twice of the distance between the hyperplane and the nearest sample points to the hyperplane. This points are known as support vector. They known as support vectors because they hold up optimal hyperplane. In above figure, support vectors are represented with filled color. Consider a first hyperplane in figure-1 which touches the two sample points(red). Although it classifies all the examples correctly, but the problem is our hyperplane is close to the so many sample points and other red examples might fall on the other side of the hyperplane. This problem can be solved by choosing a hyperplane which is farthest away from the sample points. It turns out that this type of model generalize very well. This optimal hyperplane is also known as maximum margin separator.

We know that we want hyperplane with maximum margin and we also discussed why we want this. Now, let us learn how to find this optimal hyperplane? Before that, please note in case of SVMs, we represent class labels with +1 and -1 instead of 0 and 1(Binary Valued Labels). Here, we represent each hyperplane, the optimal one, negative and positive hyperplane(dashed lines) with linear equations - w^Tx + b = 0, w^Tx+b = -1 and w^Tx + b = +1 respectively. The left most dashed line is negative hyperplane. We represent red points with x_- and blue points with x₊. To derive the equation for a margin let us substract equations of negative and positive hyperplane from each other.

\begin{align} w^T(x_+ - x_-) = 2 \end{align}

Adding length of the vector w to normalize this,
\begin{align} \frac{w^T(x_+ - x_-)}{||w||} = \frac{2}{||w||} \end{align}

where, 2/||w|| is the margin. Now the objective of the SVM becomes maximization of the margin under the constraint that samples are classified correctly. \begin{align} w^T x^{(i)} + b >= 1 \hspace{1cm} if \hspace{1cm} y^{(i)} = +1 \newline w^T x^{(i)} + b < -1 \hspace{1cm} if \hspace{1cm} y^{(i)} = -1 \end{align}

This can also be written more compractly as \begin{align} y^{(i)} ( w_0 + w^T x^{(i)}) >= 1 \forall_i \end{align}

In practice, it is easier to minimize the below given reciprocal term \begin{align} \frac{1}{2} ||w||^2 \end{align}. This is the quadratic programming problem with the linear constraint.

In the case of inherently noisy data, we may not want a linear hyperplane in high-dimensional space. Rather, we'd like a decision surface in low dimensional space that does not clearly seperate the classes, but reflects the reality of the noisy data. That is possible with the soft margin classifier, which allows examples to fall on the wrong side of the decision boundary, but assigns them a penalty proportional to the distance required to move them back on the correct side. In soft margin classifier, we add slack variables to the linear constraint.

\begin{align} y_{(i)} (w^T x_i + b ) >= 1 - \xi \hspace{1cm} for \hspace{1cm} i = 1,..,N \end{align} Now, our objective to minimize is

\begin{align} \frac{1}{2} ||w||^2 + C (\sum_{i} \xi^{(i)}) \end{align} C is the regularization parameter. Small C allows constraint to be easily ignored and results in large margin whereas large C makes constraints hard to ignore and results in narrow margin. This is still a quadratic optimization problem and there is a unique minimum. Now let us implement linear SVM classifier in Python using sklearn. We will use iris dataset

#import the dependencies
from sklearn.datasets import load_iris
from sklearn.svm import SVC
#load dataset
dataset = load_iris()
data = dataset.data
target = dataset.target

In machine learning, we always need to do some preprocessing to make our dataset suitable for the learning algorithm. I will introduce few preprocessing techniques as we go through various algorithms. Here, we will perform feature scaling which is required for optimal performance. Feature scaling is used to standardize the range of features of data.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform(data) #check out preprocessing module of sklearn to learn more about preprocessing in ML

array([[ -9.00681170e-01,   1.03205722e+00,  -1.34127240e+00,
         -1.31297673e+00],
       [ -1.14301691e+00,  -1.24957601e-01,  -1.34127240e+00,
         -1.31297673e+00],
       [ -1.38535265e+00,   3.37848329e-01,  -1.39813811e+00,
         -1.31297673e+00],
       [ -1.50652052e+00,   1.06445364e-01,  -1.28440670e+00,
         -1.31297673e+00],
       [ -1.02184904e+00,   1.26346019e+00,  -1.34127240e+00,
         -1.31297673e+00],
       [ -5.37177559e-01,   1.95766909e+00,  -1.17067529e+00,
         -1.05003079e+00],
       [ -1.50652052e+00,   8.00654259e-01,  -1.34127240e+00,
         -1.18150376e+00],
       [ -1.02184904e+00,   8.00654259e-01,  -1.28440670e+00,
         -1.31297673e+00],
       [ -1.74885626e+00,  -3.56360566e-01,  -1.34127240e+00,
         -1.31297673e+00],
       [ -1.14301691e+00,   1.06445364e-01,  -1.28440670e+00,
         -1.44444970e+00],
       [ -5.37177559e-01,   1.49486315e+00,  -1.28440670e+00,
         -1.31297673e+00],
       [ -1.26418478e+00,   8.00654259e-01,  -1.22754100e+00,
         -1.31297673e+00],
       [ -1.26418478e+00,  -1.24957601e-01,  -1.34127240e+00,
         -1.44444970e+00],
       [ -1.87002413e+00,  -1.24957601e-01,  -1.51186952e+00,
         -1.44444970e+00],
       [ -5.25060772e-02,   2.18907205e+00,  -1.45500381e+00,
         -1.31297673e+00],
       [ -1.73673948e-01,   3.11468391e+00,  -1.28440670e+00,
         -1.05003079e+00],
       [ -5.37177559e-01,   1.95766909e+00,  -1.39813811e+00,
         -1.05003079e+00],
       [ -9.00681170e-01,   1.03205722e+00,  -1.34127240e+00,
         -1.18150376e+00],
       [ -1.73673948e-01,   1.72626612e+00,  -1.17067529e+00,
         -1.18150376e+00],
       [ -9.00681170e-01,   1.72626612e+00,  -1.28440670e+00,
         -1.18150376e+00],
       [ -5.37177559e-01,   8.00654259e-01,  -1.17067529e+00,
         -1.31297673e+00],
       [ -9.00681170e-01,   1.49486315e+00,  -1.28440670e+00,
         -1.05003079e+00],
       [ -1.50652052e+00,   1.26346019e+00,  -1.56873522e+00,
         -1.31297673e+00],
       [ -9.00681170e-01,   5.69251294e-01,  -1.17067529e+00,
         -9.18557817e-01],
       [ -1.26418478e+00,   8.00654259e-01,  -1.05694388e+00,
         -1.31297673e+00],
       [ -1.02184904e+00,  -1.24957601e-01,  -1.22754100e+00,
         -1.31297673e+00],
       [ -1.02184904e+00,   8.00654259e-01,  -1.22754100e+00,
         -1.05003079e+00],
       [ -7.79513300e-01,   1.03205722e+00,  -1.28440670e+00,
         -1.31297673e+00],
       [ -7.79513300e-01,   8.00654259e-01,  -1.34127240e+00,
         -1.31297673e+00],
       [ -1.38535265e+00,   3.37848329e-01,  -1.22754100e+00,
         -1.31297673e+00],
       [ -1.26418478e+00,   1.06445364e-01,  -1.22754100e+00,
         -1.31297673e+00],
       [ -5.37177559e-01,   8.00654259e-01,  -1.28440670e+00,
         -1.05003079e+00],
       [ -7.79513300e-01,   2.42047502e+00,  -1.28440670e+00,
         -1.44444970e+00],
       [ -4.16009689e-01,   2.65187798e+00,  -1.34127240e+00,
         -1.31297673e+00],
       [ -1.14301691e+00,   1.06445364e-01,  -1.28440670e+00,
         -1.44444970e+00],
       [ -1.02184904e+00,   3.37848329e-01,  -1.45500381e+00,
         -1.31297673e+00],
       [ -4.16009689e-01,   1.03205722e+00,  -1.39813811e+00,
         -1.31297673e+00],
       [ -1.14301691e+00,   1.06445364e-01,  -1.28440670e+00,
         -1.44444970e+00],
       [ -1.74885626e+00,  -1.24957601e-01,  -1.39813811e+00,
         -1.31297673e+00],
       [ -9.00681170e-01,   8.00654259e-01,  -1.28440670e+00,
         -1.31297673e+00],
       [ -1.02184904e+00,   1.03205722e+00,  -1.39813811e+00,
         -1.18150376e+00],
       [ -1.62768839e+00,  -1.74477836e+00,  -1.39813811e+00,
         -1.18150376e+00],
       [ -1.74885626e+00,   3.37848329e-01,  -1.39813811e+00,
         -1.31297673e+00],
       [ -1.02184904e+00,   1.03205722e+00,  -1.22754100e+00,
         -7.87084847e-01],
       [ -9.00681170e-01,   1.72626612e+00,  -1.05694388e+00,
         -1.05003079e+00],
       [ -1.26418478e+00,  -1.24957601e-01,  -1.34127240e+00,
         -1.18150376e+00],
       [ -9.00681170e-01,   1.72626612e+00,  -1.22754100e+00,
         -1.31297673e+00],
       [ -1.50652052e+00,   3.37848329e-01,  -1.34127240e+00,
         -1.31297673e+00],
       [ -6.58345429e-01,   1.49486315e+00,  -1.28440670e+00,
         -1.31297673e+00],
       [ -1.02184904e+00,   5.69251294e-01,  -1.34127240e+00,
         -1.31297673e+00],
       [  1.40150837e+00,   3.37848329e-01,   5.35295827e-01,
          2.64698913e-01],
       [  6.74501145e-01,   3.37848329e-01,   4.21564419e-01,
          3.96171883e-01],
       [  1.28034050e+00,   1.06445364e-01,   6.49027235e-01,
          3.96171883e-01],
       [ -4.16009689e-01,  -1.74477836e+00,   1.37235899e-01,
          1.33225943e-01],
       [  7.95669016e-01,  -5.87763531e-01,   4.78430123e-01,
          3.96171883e-01],
       [ -1.73673948e-01,  -5.87763531e-01,   4.21564419e-01,
          1.33225943e-01],
       [  5.53333275e-01,   5.69251294e-01,   5.35295827e-01,
          5.27644853e-01],
       [ -1.14301691e+00,  -1.51337539e+00,  -2.60824029e-01,
         -2.61192967e-01],
       [  9.16836886e-01,  -3.56360566e-01,   4.78430123e-01,
          1.33225943e-01],
       [ -7.79513300e-01,  -8.19166497e-01,   8.03701950e-02,
          2.64698913e-01],
       [ -1.02184904e+00,  -2.43898725e+00,  -1.47092621e-01,
         -2.61192967e-01],
       [  6.86617933e-02,  -1.24957601e-01,   2.50967307e-01,
          3.96171883e-01],
       [  1.89829664e-01,  -1.97618132e+00,   1.37235899e-01,
         -2.61192967e-01],
       [  3.10997534e-01,  -3.56360566e-01,   5.35295827e-01,
          2.64698913e-01],
       [ -2.94841818e-01,  -3.56360566e-01,  -9.02269170e-02,
          1.33225943e-01],
       [  1.03800476e+00,   1.06445364e-01,   3.64698715e-01,
          2.64698913e-01],
       [ -2.94841818e-01,  -1.24957601e-01,   4.21564419e-01,
          3.96171883e-01],
       [ -5.25060772e-02,  -8.19166497e-01,   1.94101603e-01,
         -2.61192967e-01],
       [  4.32165405e-01,  -1.97618132e+00,   4.21564419e-01,
          3.96171883e-01],
       [ -2.94841818e-01,  -1.28197243e+00,   8.03701950e-02,
         -1.29719997e-01],
       [  6.86617933e-02,   3.37848329e-01,   5.92161531e-01,
          7.90590793e-01],
       [  3.10997534e-01,  -5.87763531e-01,   1.37235899e-01,
          1.33225943e-01],
       [  5.53333275e-01,  -1.28197243e+00,   6.49027235e-01,
          3.96171883e-01],
       [  3.10997534e-01,  -5.87763531e-01,   5.35295827e-01,
          1.75297293e-03],
       [  6.74501145e-01,  -3.56360566e-01,   3.07833011e-01,
          1.33225943e-01],
       [  9.16836886e-01,  -1.24957601e-01,   3.64698715e-01,
          2.64698913e-01],
       [  1.15917263e+00,  -5.87763531e-01,   5.92161531e-01,
          2.64698913e-01],
       [  1.03800476e+00,  -1.24957601e-01,   7.05892939e-01,
          6.59117823e-01],
       [  1.89829664e-01,  -3.56360566e-01,   4.21564419e-01,
          3.96171883e-01],
       [ -1.73673948e-01,  -1.05056946e+00,  -1.47092621e-01,
         -2.61192967e-01],
       [ -4.16009689e-01,  -1.51337539e+00,   2.35044910e-02,
         -1.29719997e-01],
       [ -4.16009689e-01,  -1.51337539e+00,  -3.33612130e-02,
         -2.61192967e-01],
       [ -5.25060772e-02,  -8.19166497e-01,   8.03701950e-02,
          1.75297293e-03],
       [  1.89829664e-01,  -8.19166497e-01,   7.62758643e-01,
          5.27644853e-01],
       [ -5.37177559e-01,  -1.24957601e-01,   4.21564419e-01,
          3.96171883e-01],
       [  1.89829664e-01,   8.00654259e-01,   4.21564419e-01,
          5.27644853e-01],
       [  1.03800476e+00,   1.06445364e-01,   5.35295827e-01,
          3.96171883e-01],
       [  5.53333275e-01,  -1.74477836e+00,   3.64698715e-01,
          1.33225943e-01],
       [ -2.94841818e-01,  -1.24957601e-01,   1.94101603e-01,
          1.33225943e-01],
       [ -4.16009689e-01,  -1.28197243e+00,   1.37235899e-01,
          1.33225943e-01],
       [ -4.16009689e-01,  -1.05056946e+00,   3.64698715e-01,
          1.75297293e-03],
       [  3.10997534e-01,  -1.24957601e-01,   4.78430123e-01,
          2.64698913e-01],
       [ -5.25060772e-02,  -1.05056946e+00,   1.37235899e-01,
          1.75297293e-03],
       [ -1.02184904e+00,  -1.74477836e+00,  -2.60824029e-01,
         -2.61192967e-01],
       [ -2.94841818e-01,  -8.19166497e-01,   2.50967307e-01,
          1.33225943e-01],
       [ -1.73673948e-01,  -1.24957601e-01,   2.50967307e-01,
          1.75297293e-03],
       [ -1.73673948e-01,  -3.56360566e-01,   2.50967307e-01,
          1.33225943e-01],
       [  4.32165405e-01,  -3.56360566e-01,   3.07833011e-01,
          1.33225943e-01],
       [ -9.00681170e-01,  -1.28197243e+00,  -4.31421141e-01,
         -1.29719997e-01],
       [ -1.73673948e-01,  -5.87763531e-01,   1.94101603e-01,
          1.33225943e-01],
       [  5.53333275e-01,   5.69251294e-01,   1.27454998e+00,
          1.71090158e+00],
       [ -5.25060772e-02,  -8.19166497e-01,   7.62758643e-01,
          9.22063763e-01],
       [  1.52267624e+00,  -1.24957601e-01,   1.21768427e+00,
          1.18500970e+00],
       [  5.53333275e-01,  -3.56360566e-01,   1.04708716e+00,
          7.90590793e-01],
       [  7.95669016e-01,  -1.24957601e-01,   1.16081857e+00,
          1.31648267e+00],
       [  2.12851559e+00,  -1.24957601e-01,   1.61574420e+00,
          1.18500970e+00],
       [ -1.14301691e+00,  -1.28197243e+00,   4.21564419e-01,
          6.59117823e-01],
       [  1.76501198e+00,  -3.56360566e-01,   1.44514709e+00,
          7.90590793e-01],
       [  1.03800476e+00,  -1.28197243e+00,   1.16081857e+00,
          7.90590793e-01],
       [  1.64384411e+00,   1.26346019e+00,   1.33141568e+00,
          1.71090158e+00],
       [  7.95669016e-01,   3.37848329e-01,   7.62758643e-01,
          1.05353673e+00],
       [  6.74501145e-01,  -8.19166497e-01,   8.76490051e-01,
          9.22063763e-01],
       [  1.15917263e+00,  -1.24957601e-01,   9.90221459e-01,
          1.18500970e+00],
       [ -1.73673948e-01,  -1.28197243e+00,   7.05892939e-01,
          1.05353673e+00],
       [ -5.25060772e-02,  -5.87763531e-01,   7.62758643e-01,
          1.57942861e+00],
       [  6.74501145e-01,   3.37848329e-01,   8.76490051e-01,
          1.44795564e+00],
       [  7.95669016e-01,  -1.24957601e-01,   9.90221459e-01,
          7.90590793e-01],
       [  2.24968346e+00,   1.72626612e+00,   1.67260991e+00,
          1.31648267e+00],
       [  2.24968346e+00,  -1.05056946e+00,   1.78634131e+00,
          1.44795564e+00],
       [  1.89829664e-01,  -1.97618132e+00,   7.05892939e-01,
          3.96171883e-01],
       [  1.28034050e+00,   3.37848329e-01,   1.10395287e+00,
          1.44795564e+00],
       [ -2.94841818e-01,  -5.87763531e-01,   6.49027235e-01,
          1.05353673e+00],
       [  2.24968346e+00,  -5.87763531e-01,   1.67260991e+00,
          1.05353673e+00],
       [  5.53333275e-01,  -8.19166497e-01,   6.49027235e-01,
          7.90590793e-01],
       [  1.03800476e+00,   5.69251294e-01,   1.10395287e+00,
          1.18500970e+00],
       [  1.64384411e+00,   3.37848329e-01,   1.27454998e+00,
          7.90590793e-01],
       [  4.32165405e-01,  -5.87763531e-01,   5.92161531e-01,
          7.90590793e-01],
       [  3.10997534e-01,  -1.24957601e-01,   6.49027235e-01,
          7.90590793e-01],
       [  6.74501145e-01,  -5.87763531e-01,   1.04708716e+00,
          1.18500970e+00],
       [  1.64384411e+00,  -1.24957601e-01,   1.16081857e+00,
          5.27644853e-01],
       [  1.88617985e+00,  -5.87763531e-01,   1.33141568e+00,
          9.22063763e-01],
       [  2.49201920e+00,   1.72626612e+00,   1.50201279e+00,
          1.05353673e+00],
       [  6.74501145e-01,  -5.87763531e-01,   1.04708716e+00,
          1.31648267e+00],
       [  5.53333275e-01,  -5.87763531e-01,   7.62758643e-01,
          3.96171883e-01],
       [  3.10997534e-01,  -1.05056946e+00,   1.04708716e+00,
          2.64698913e-01],
       [  2.24968346e+00,  -1.24957601e-01,   1.33141568e+00,
          1.44795564e+00],
       [  5.53333275e-01,   8.00654259e-01,   1.04708716e+00,
          1.57942861e+00],
       [  6.74501145e-01,   1.06445364e-01,   9.90221459e-01,
          7.90590793e-01],
       [  1.89829664e-01,  -1.24957601e-01,   5.92161531e-01,
          7.90590793e-01],
       [  1.28034050e+00,   1.06445364e-01,   9.33355755e-01,
          1.18500970e+00],
       [  1.03800476e+00,   1.06445364e-01,   1.04708716e+00,
          1.57942861e+00],
       [  1.28034050e+00,   1.06445364e-01,   7.62758643e-01,
          1.44795564e+00],
       [ -5.25060772e-02,  -8.19166497e-01,   7.62758643e-01,
          9.22063763e-01],
       [  1.15917263e+00,   3.37848329e-01,   1.21768427e+00,
          1.44795564e+00],
       [  1.03800476e+00,   5.69251294e-01,   1.10395287e+00,
          1.71090158e+00],
       [  1.03800476e+00,  -1.24957601e-01,   8.19624347e-01,
          1.44795564e+00],
       [  5.53333275e-01,  -1.28197243e+00,   7.05892939e-01,
          9.22063763e-01],
       [  7.95669016e-01,  -1.24957601e-01,   8.19624347e-01,
          1.05353673e+00],
       [  4.32165405e-01,   8.00654259e-01,   9.33355755e-01,
          1.44795564e+00],
       [  6.86617933e-02,  -1.24957601e-01,   7.62758643e-01,
          7.90590793e-01]])

#now let us divide data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, target, random_state=42, test_size=0.3)

#train a model
model = SVC()
model.fit(X_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, model.predict(X_test))

1.0

model.support_vectors_.shape

(39, 4)

model.support_vectors_

array([[ 5.1,  3.8,  1.9,  0.4],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 5.5,  4.2,  1.4,  0.2],
       [ 4.5,  2.3,  1.3,  0.3],
       [ 5.8,  4. ,  1.2,  0.2],
       [ 5.6,  3. ,  4.5,  1.5],
       [ 5. ,  2. ,  3.5,  1. ],
       [ 5.4,  3. ,  4.5,  1.5],
       [ 6.7,  3. ,  5. ,  1.7],
       [ 5.9,  3.2,  4.8,  1.8],
       [ 5.1,  2.5,  3. ,  1.1],
       [ 6. ,  2.7,  5.1,  1.6],
       [ 6.3,  2.5,  4.9,  1.5],
       [ 6.1,  2.9,  4.7,  1.4],
       [ 6.5,  2.8,  4.6,  1.5],
       [ 7. ,  3.2,  4.7,  1.4],
       [ 6.1,  3. ,  4.6,  1.4],
       [ 5.5,  2.6,  4.4,  1.2],
       [ 4.9,  2.4,  3.3,  1. ],
       [ 6.9,  3.1,  4.9,  1.5],
       [ 6.3,  2.3,  4.4,  1.3],
       [ 6.3,  2.8,  5.1,  1.5],
       [ 7.7,  2.8,  6.7,  2. ],
       [ 6.3,  2.7,  4.9,  1.8],
       [ 7.7,  3.8,  6.7,  2.2],
       [ 5.7,  2.5,  5. ,  2. ],
       [ 6. ,  3. ,  4.8,  1.8],
       [ 5.8,  2.7,  5.1,  1.9],
       [ 6.2,  3.4,  5.4,  2.3],
       [ 6.1,  2.6,  5.6,  1.4],
       [ 6. ,  2.2,  5. ,  1.5],
       [ 6.3,  3.3,  6. ,  2.5],
       [ 6.2,  2.8,  4.8,  1.8],
       [ 6.9,  3.1,  5.4,  2.1],
       [ 6.5,  3. ,  5.2,  2. ],
       [ 7.2,  3. ,  5.8,  1.6],
       [ 5.6,  2.8,  4.9,  2. ],
       [ 5.9,  3. ,  5.1,  1.8],
       [ 4.9,  2.5,  4.5,  1.7]])

Till now, we see problems where input data can be seperated by linear hyperplane. But what is data points are not linearly seperable as shown below?

To solve this type of problems where data can not be seperated linearly, we add new feature. For example, let us add new feature z = x² + y². Now, if we plot data points on x and z axis we get :

As you can see, now we can have a linear hyperplane that can seperate data points very well. Do we need to add this additional feature manually? And the answer is no. We use the technique called Kernel Trick. Kernel trick is nothing but a set of functions which takes low-dimensional input space and transform it into high-dimensional space where data points are linearly seperable. These functions are called kernels. Widely used kernels are Radial Basis Function Kernel, Polynomial Kernel, Sigmoid kernel, etc.

Let us implement this in sklearn.

#we have already imported libs and dataset
model2= SVC(kernel="rbf", gamma=0.2)
model2.fit(X_train, y_train)
model2.score(X_test, y_test)

1.0

We can have different decision boundary for different kernels and gamma values. Here is the screenshot from scikit-learn website.

Decision Tree¶

Decision Tree is the supervised learning algorithm which can be used for classification as well as regression problems. Decision Tree is very popular learning algorithm because of its interpretability. In this method, we split population into set of homogeneous sets by asking set of questions. Consider a problem where we want to decide what to do on a particular day. We can design a decision tree as follows : (Source: Python Machine Learning by Sebastian Raschka)

There are two types of nodes in above figure. One with gray background and one with white background. Nodes with gray background color are terminal nodes. Each edge represents a decision, for i.e, weather is sunny or not?. In above figure, we have four possible classes and each terminal node represents one of them. In above example, features are - work(yes/no), weather(i.e sunny), friends busy?(yes/no). Once we have a decision tree(we haven't seen how to construct them yet) we can predict the outcome. For example, if our input is work(no),weather(rainy),freinds busy?(no) then we will opt for Go to movies.

But, how do we construct decision tree? We start with root node and split dataset on a feature that results in largest information gain. We repeat this process of splitting on child node unitl we get nodes which are pure means they contain samples of a once class. We need a way to compute the impurity at each node. Before going into that let us first define information gain. Here, our objective is to maximize the information gain at each split. We define information gain as follows :

\begin{align} IG(D_p,f) = I(D_p) - \sum_{j=1}^{m} \frac{N_j}{N_p} I(D_j) \end{align} Here, f is the feature to perform the split, D_p and D_j are the dataset of the parent and jth child node, I is the impurity measure (i.e entropy, gini index), N_p is the total number of smples at the parent node and N_j is the number of samples in the jth child node. Information gain is the difference between the impurity of parent node and the sum of the child node impurities. However, for simplicity and to reduce the combinatorial search space, most machine learning libraries implement binary decision tree, that is, each parent node has two child nodes. Say two child nodes are, D_left and D_right then information gain will be:

\begin{align} IG(D_p, f) = I(D_p) - \frac{N_{left}}{N_p} I(D_{left}) - \frac{N_{right}}{N_p} I(D_{right}) \end{align}

Commonly used impurity measures or splitting criteria are Gini Impurity , Entropy and Classification Error. Here, we'll only discuss about entropy. For all non-empty classes (p(i | t) not equals to zero), entropy can be defined as

\begin{align} I_H(t) = -\sum_{i=1}^{C} p(i | t) log_2 p(i | t) \end{align}

Where p(i | t) is the part of the samples that belong to the class i for a particular node t and C is the number of classes. The entropy is 0 if all samples at a node belong to the same class and the entropy is maximum if the classes are distributed uniformly. For example, if we have 100 samples and 2 classes then entropy will be maximum if 50 samples belong to one class and remaining belong to second class.

#entropy in numpy
import numpy as np
def entropy(p):
    return -p * np.log2(p) - (1-p) * np.log2((1-p)) #for binary class

title

As you can see, entropy is maximum when p(i=1 | t ) and p(i=0 | t)=0.5. And entropy is minimum when all the samples belong to the same class. We define Gini Impurity as :

\begin{align} I_g(t) = \sum_{i=1}^{C} p(i | t) (1 - p(i | t)) \end{align}

Now let us implement Decision Tree in sklearn.

#import libs and dataset
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

dataset = load_iris()
X = dataset.data
y = dataset.target

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=42)

model = DecisionTreeClassifier(criterion="entropy", max_depth=3, random_state=42)
#decision trees are prone to overfitting thats why we remove some sub-nodes of the tree, that's called "pruning"
#here, we control depth of the tree using max_depth attribute
#other option for criterion is "gini"
#random_state- just to make sure we get same results each time we run this code

model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best')

#test the model
model.score(X_test, y_test)

0.97777777777777775

We can also save the decision tree which our model has built. You need to have Graph Viz and pydotplus installed in your system.

from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO() 
export_graphviz(model, out_file=dot_data) 
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf("iris.pdf")

True

from IPython.display import Image
Image(filename="./images/tree.png")

Some of the advantages of Decision Trees are they are easy to understand because of their interpretability, less data cleaning is required, it can handle both numerical and categorical data. One of the problem with decision tree is overfitting which can be solved by pruning the decision tree.

Ensemble Methods¶

Till now, we discussed methods in which there was only one hypothesis. In other words, we were training only one model for one problem. In ensemble learning, we train a set of models to achieve better accuracy and stability. At prediction time we combine the prediction of all. In case of classification, we take vote of each model and in case of regression, we take the mean of the prediction of each model. We run multiple models on the data and use the aggregate prediction, which is better than a single model.

Bagging (Bootstrap aggregating) is a technique used to reduce the variance of a prediction by combining the result of multiple classifiers. In baggin, first we create multiple datasets from the original dataset. Please note, these datasets have random number of samples and random number of features. Then we build classifier on each dataset. For prediction, we use combine predictions of models.

Random Forest¶

Random Forest is the most common type of Ensemble Learning. Basically, it is a collection of decision trees. We repeatedly select data from the data set with replacement (which is also known as bootstrapping)and build a Decision Tree with each new sample. One thing to note here is that each node of the Decision Tree is limited to only considering splits on random subsets of the features. If features were not chosen randomly, Decision Trees in our forest could become highly correlated. There are plethora of advantages of random forest such as they are fast to train, requires no input preparation. One of the disadvantage of random forest is that our model may become too large.

</p>

We hold out one third of the dataset before generating multiple datasets to use in testing. These samples are called out of bag samples and error estimated with these samples is known as out of bag error.

from sklearn.ensemble import RandomForestClassifier
min_estimator = 30  #min number of trees to be built
max_estimator = 60  #max number of trees to be built
rf = RandomForestClassifier(criterion="entropy", warm_start=True, oob_score=True,random_state=42)
for i in range(min_estimator, max_estimator+1):
    rf.set_params(n_estimators=i)
    rf.fit(X,y)   #do not need to seperate training and testing set
    oob_score = 1 - rf.oob_score_
    print(i, oob_score)

30 0.0333333333333
31 0.0333333333333
32 0.0333333333333
33 0.0333333333333
34 0.0333333333333
35 0.0333333333333
36 0.0333333333333
37 0.04
38 0.04
39 0.04
40 0.04
41 0.0333333333333
42 0.0333333333333
43 0.0333333333333
44 0.04
45 0.0466666666667
46 0.04
47 0.0333333333333
48 0.04
49 0.04
50 0.04
51 0.04
52 0.0333333333333
53 0.0333333333333
54 0.0466666666667
55 0.0466666666667
56 0.04
57 0.04
58 0.04
59 0.04
60 0.04

#rf.score(X,rf.predict(X))

1.0

Boosting¶

Boosting is a technique used to reduce bias as well as variance in supervised learning. Bias means how much on an average are the predicted values different from the actual value and variance means how different will the predictions of the model be at the same point if different samples are taken from the same population. Boosting refers to a set of machine learning algorithms which convert weak learners into strong learner. This was inspired by a question- Can a set of weak learners create a single strong learner? Gradient Boosting, AdaBoost and XGBoost are examples of boosting algorithms. Boosting algorithms such as xgboost are the secret to win machine learning competitions. Here, I'm not going deeper into boosting algorithms. But they are important. See here, here for AdaBoost and see here, here for Gradient Boosting. See here for XGBoost

.

AdaBoost in sklearn¶

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

#load the dataset
dataset = load_iris()

#split the dataset
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, random_state=42, test_size=0.33)
#optioanl - Standardize inputs using StandardScaler
#instantiate a model
model = AdaBoostClassifier(n_estimators=150, random_state=42)
#train a model
model.fit(X_train, y_train)  

print("accuracy: ", model.score(X_test, y_test))

accuracy:  0.9

Gradient Boosting in sklearn¶

#we use most of the algorithms for classification algorithm, for this lets go for regression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

#load the dataset
dataset = load_boston()

#split the dataset
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, random_state=42, test_size=0.33)
#optioanl - Standardize inputs using StandardScaler
#instantiate a model
model = GradientBoostingRegressor(n_estimators=500,learning_rate=0.01, random_state=42)
#train a model
model.fit(X_train, y_train)  
from sklearn.metrics import mean_squared_error
print("Mean squared error: ", mean_squared_error(y_test, model.predict(X_test)))

Mean squared error:  8.85292814297

Exercise¶

In this exercise, we will implement a sentiment analysis model which can detect the sentiment from a text. We will also go through some feature extraction techniques and learn how to use textual data in machine learning models.

If you have any query or feedback kindly write me at vsavan7@gmail.com. You can follow me on twitter (@savanvisalpara7) to get updates on new parts of this series.

BLOG - SAVAN VISALPARA