Predicting survival on the Titanic using Logistic Regression¶

We will be using Titanic dataset from Kaggle. To download or know more about the dataset click here. If you are just beginner in this field check out my tutorials on Machine learning- Part - 1, Part - 2 and Part - 3. You can find github repository here.

#import dependecies
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#read data set
df = pd.read_csv("Datasets/Titanic/train.csv")

df.head()

#prepare data set
X = pd.DataFrame()
X['Sex'] = df['Sex']
X['Age'] = df['Age']
X['Survived'] = df['Survived']
X = X.dropna(axis=0)

X.head()

#seperate data and target vars
y = X['Survived'] #don't forget to save target(dependent) var- once we'll drop it we won't be able to get it back
X = X.drop(['Survived'],axis=1)

#let's make sure
X.head()

X['Sex'] = pd.get_dummies(X.Sex)['male'] #1 for male or else 1

scaler = StandardScaler()
X =scaler.fit_transform(X)  #why I need to do this? -> ans - http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

#checking accuracy on training dataset
model.score(X_train, y_train)

0.79158316633266534

pred = model.predict(X_test)

#better metric for binary classification is area under the curve
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test, pred)
print(auc)

0.740190832887

from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
#The f1-score gives you the harmonic mean of precision and recall.
#The scores corresponding to every class will tell you the accuracy of the classifier in classifying the data 
#points in that particular class compared to all other classes.
#The support is the number of samples of the true response that lie in that class.

             precision    recall  f1-score   support

          0       0.77      0.82      0.80       126
          1       0.72      0.66      0.69        89

avg / total       0.75      0.75      0.75       215

Excercise¶

Try using another model such as Random Forest. Change the penalty parameter or regularization strength parameter of the model. You can also perform other preprocessing on the dataset.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	Sex	Age	Survived
0	male	22.0	0
1	female	38.0	1
2	female	26.0	1
3	female	35.0	1
4	male	35.0	0

	Sex	Age
0	male	22.0
1	female	38.0
2	female	26.0
3	female	35.0
4	male	35.0

BLOG - SAVAN VISALPARA

Predicting survival on the Titanic using Logistic Regression¶

Excercise¶