BLOG - SAVAN VISALPARA

[ Musings on Machine Learning ]
Posted on 10 June, 2017

Predicting survival on the Titanic using Logistic Regression

We will be using Titanic dataset from Kaggle. To download or know more about the dataset click here. If you are just beginner in this field check out my tutorials on Machine learning- Part - 1, Part - 2 and Part - 3. You can find github repository here.

In [1]:
#import dependecies
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
In [2]:
#read data set
df = pd.read_csv("Datasets/Titanic/train.csv")
In [3]:
df.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [4]:
#prepare data set
X = pd.DataFrame()
X['Sex'] = df['Sex']
X['Age'] = df['Age']
X['Survived'] = df['Survived']
X = X.dropna(axis=0)
In [5]:
X.head()
Out[5]:
Sex Age Survived
0 male 22.0 0
1 female 38.0 1
2 female 26.0 1
3 female 35.0 1
4 male 35.0 0
In [6]:
#seperate data and target vars
y = X['Survived'] #don't forget to save target(dependent) var- once we'll drop it we won't be able to get it back
X = X.drop(['Survived'],axis=1)
In [7]:
#let's make sure
X.head()
Out[7]:
Sex Age
0 male 22.0
1 female 38.0
2 female 26.0
3 female 35.0
4 male 35.0
In [8]:
X['Sex'] = pd.get_dummies(X.Sex)['male'] #1 for male or else 1
In [9]:
scaler = StandardScaler()
X =scaler.fit_transform(X)  #why I need to do this? -> ans - http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler
In [19]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)
In [20]:
from sklearn.linear_model import LogisticRegression
In [21]:
model = LogisticRegression()
model.fit(X_train, y_train)
Out[21]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [22]:
#checking accuracy on training dataset
model.score(X_train, y_train)
Out[22]:
0.79158316633266534
In [25]:
pred = model.predict(X_test)
In [26]:
#better metric for binary classification is area under the curve
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test, pred)
print(auc)
0.740190832887
In [27]:
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
#The f1-score gives you the harmonic mean of precision and recall.
#The scores corresponding to every class will tell you the accuracy of the classifier in classifying the data 
#points in that particular class compared to all other classes.
#The support is the number of samples of the true response that lie in that class.
             precision    recall  f1-score   support

          0       0.77      0.82      0.80       126
          1       0.72      0.66      0.69        89

avg / total       0.75      0.75      0.75       215

In [ ]:
 

Excercise

Try using another model such as Random Forest. Change the penalty parameter or regularization strength parameter of the model. You can also perform other preprocessing on the dataset.

In [ ]: