首页 > 代码库 > [kaggle入门] Titanic Machine Learning from Disaster

[kaggle入门] Titanic Machine Learning from Disaster

<style></style>
 

Titanic Data Science Solutions¶

https://www.kaggle.com/startupsci/titanic-data-science-solutions

数据挖掘竞赛七个步骤:¶

  1. Question or problem definition.
  2. Acquire training and testing data.
  3. Wrangle, prepare, cleanse the data.
  4. Analyze, identify patterns, and explore the data.
  5. Model, predict and solve the problem.
  6. Visualize, report, and present the problem solving steps and final solution.
  7. Supply or submit the results.

数据挖掘竞赛的七种目标:¶

  1. Classifying: classify or categorize our samples and may also want to understand the implications or correlation of different classes with our solution goal.
  2. Correlating: Correlating certain features may help in creating, completing, or correcting features.
  3. Converting: For instance converting text categorical values to numeric values.
  4. Completing: Estimate any missing values within a feature.
  5. Correcting: Detect any outliers among our samples or features and may discard a feature if it is not contribting to the analysis or may significantly skew the results.
  6. Creating: Create new features based on an existing feature or a set of features.(correlation, conversion, completeness..)
  7. Charting: Select the right visualization plots and charts
 

Question or problem definition¶

https://www.kaggle.com/c/titanic

  1. The question or problem definition for Titanic Survival competition Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.
  2. Some early understanding about the domain of our problem. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In [1]:
# data analysis and wrangling 数据分析和清洗工具import pandas as pdimport numpy as npimport random as rnd# visualization 数据可视化工具import seaborn as snsimport matplotlib.pyplot as plt%matplotlib inline# machine learning 机器学习模型from sklearn.linear_model import LogisticRegression # 逻辑回归from sklearn.svm import SVC, LinearSVC # 支持向量机from sklearn.ensemble import RandomForestClassifier # 随机森林from sklearn.neighbors import KNeighborsClassifier # K近邻from sklearn.naive_bayes import GaussianNB # 贝叶斯算法from sklearn.linear_model import Perceptron # 感知机from sklearn.linear_model import SGDClassifier # 随机梯度下降分类器from sklearn.tree import DecisionTreeClassifier # 决策树
 

Acquire training and testing data¶

In [2]:
train_df = pd.read_csv(‘data/train.csv‘) # 用pandas的read_csv方法读出DataFrame数据test_df = pd.read_csv(‘data/test.csv‘)combine = [train_df, test_df] # combine为一个数据集,方便对训练集和测试集做相同的数据清洗操作
 

Analyze by describing data¶

https://www.kaggle.com/c/titanic/data

In [3]:
print(train_df.columns.values) # 导出列名:features的名字
 
[‘PassengerId‘ ‘Survived‘ ‘Pclass‘ ‘Name‘ ‘Sex‘ ‘Age‘ ‘SibSp‘ ‘Parch‘ ‘Ticket‘ ‘Fare‘ ‘Cabin‘ ‘Embarked‘]
In [4]:
# preview the datatrain_df.head()  # 默认前5行
Out[4]:
 
<style></style>
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
In [5]:
train_df.tail() # 默认后5行
Out[5]:
 
<style></style>
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88688702Montvila, Rev. Juozasmale27.00021153613.00NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.00B42S
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.45NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.00C148C
89089103Dooley, Mr. Patrickmale32.0003703767.75NaNQ
In [6]:
train_df.info()print(‘_‘*40)test_df.info()
 
<class ‘pandas.core.frame.DataFrame‘>RangeIndex: 891 entries, 0 to 890Data columns (total 12 columns):PassengerId    891 non-null int64Survived       891 non-null int64Pclass         891 non-null int64Name           891 non-null objectSex            891 non-null objectAge            714 non-null float64SibSp          891 non-null int64Parch          891 non-null int64Ticket         891 non-null objectFare           891 non-null float64Cabin          204 non-null objectEmbarked       889 non-null objectdtypes: float64(2), int64(5), object(5)memory usage: 83.6+ KB________________________________________<class ‘pandas.core.frame.DataFrame‘>RangeIndex: 418 entries, 0 to 417Data columns (total 11 columns):PassengerId    418 non-null int64Pclass         418 non-null int64Name           418 non-null objectSex            418 non-null objectAge            332 non-null float64SibSp          418 non-null int64Parch          418 non-null int64Ticket         418 non-null objectFare           417 non-null float64Cabin          91 non-null objectEmbarked       418 non-null objectdtypes: float64(2), int64(4), object(5)memory usage: 36.0+ KB
 
  1. Which features are categorical?
    Categorical: Survived, Sex, and Embarked.
    Ordinal: Pclass.
  2. Which features are numerical?
    Continous: Age, Fare.
    Discrete: SibSp, Parch.
  3. Which features are mixed data types?
    Ticket is a mix of numeric and alphanumeric data types.
    Cabin is alphanumeric.
  4. Which features may contain errors or typos?
    Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.
  5. Which features contain blank, null or empty values?
    Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.
    Cabin > Age are incomplete in case of test dataset.
  6. What are the data types for various features?
    Seven features are integer or floats. Six in case of test dataset.
    Five features are strings (object).
In [7]:
train_df.describe()  # 数据的描述(总数、均值、标准差、最大、最小、25%、50%、75%)# Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.# Review Parch distribution using `percentiles=[.75, .8]`# SibSp distribution `[.68, .69]`# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`
Out[7]:
 
<style></style>
 PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
In [8]:
train_df.describe(include=[‘O‘])  # 找出特征中几个出现的不同值和频率最高
Out[8]:
 
<style></style>
 NameSexTicketCabinEmbarked
count891891891204889
unique89126811473
topCaldwell, Mrs. Albert Francis (Sylvia Mae Harb...male347082B96 B98S
freq157774644
 
  1. What is the distribution of numerical feature values across the samples?
    Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
    Survived is a categorical feature with 0 or 1 values.
    Around 38% samples survived representative of the actual survival rate at 32%.
    Most passengers (> 75%) did not travel with parents or children.
    Nearly 30% of the passengers had siblings and/or spouse aboard.
    Fares varied significantly with few passengers (\<1%) paying as high as 512.
    Few elderly passengers (\<1%) within age range 65-80.
  2. What is the distribution of categorical features?
    Names are unique across the dataset (count=unique=891).
    Sex variable as two possible values with 65% male (top=male, freq=577/count=891).
    Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.
    Embarked takes three possible values. S port used by most passengers (top=S).
    Ticket feature has high ratio (22%) of duplicate values (unique=681).
 

Assumtions based on data analysis¶

Correlating
Completing
Correcting
Creating
Classifying

In [9]:
# 通过groupby找出该特征与目标之间的关联train_df[[‘Pclass‘, ‘Survived‘]].groupby([‘Pclass‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[9]:
 
<style></style>
 PclassSurvived
010.629630
120.472826
230.242363
In [10]:
train_df[["Sex", "Survived"]].groupby([‘Sex‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[10]:
 
<style></style>
 SexSurvived
0female0.742038
1male0.188908
In [11]:
train_df[["SibSp", "Survived"]].groupby([‘SibSp‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[11]:
 
<style></style>
 SibSpSurvived
110.535885
220.464286
000.345395
330.250000
440.166667
550.000000
680.000000
In [12]:
train_df[["Parch", "Survived"]].groupby([‘Parch‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[12]:
 
<style></style>
 ParchSurvived
330.600000
110.550847
220.500000
000.343658
550.200000
440.000000
660.000000
 

Analyze by visualizing data¶

In [13]:
g = sns.FacetGrid(train_df, col=‘Survived‘)g.map(plt.hist, ‘Age‘, bins=20)
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x2a742a46828>
 
技术分享
In [14]:
# grid = sns.FacetGrid(train_df, col=‘Pclass‘, hue=‘Survived‘)grid = sns.FacetGrid(train_df, col=‘Survived‘, row=‘Pclass‘, size=2.2, aspect=1.6)grid.map(plt.hist, ‘Age‘, alpha=.5, bins=20)grid.add_legend();
 
技术分享
In [15]:
# grid = sns.FacetGrid(train_df, col=‘Embarked‘)grid = sns.FacetGrid(train_df, row=‘Embarked‘, size=2.2, aspect=1.6)grid.map(sns.pointplot, ‘Pclass‘, ‘Survived‘, ‘Sex‘, palette=‘deep‘)grid.add_legend()
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x2a7435e7198>
 
技术分享
In [16]:
# grid = sns.FacetGrid(train_df, col=‘Embarked‘, hue=‘Survived‘, palette={0: ‘k‘, 1: ‘w‘})grid = sns.FacetGrid(train_df, row=‘Embarked‘, col=‘Survived‘, size=2.2, aspect=1.6)grid.map(sns.barplot, ‘Sex‘, ‘Fare‘, alpha=.5, ci=None)grid.add_legend()
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x2a7435e7978>
 
技术分享
 

Wrangle, prepare, cleanse the data¶

Correcting by dropping features
drop the Cabin (correcting #2) and Ticket (correcting #1) features

In [17]:
print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)train_df = train_df.drop([‘Ticket‘, ‘Cabin‘], axis=1)test_df = test_df.drop([‘Ticket‘, ‘Cabin‘], axis=1)combine = [train_df, test_df]print("After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)
 
Before (891, 12) (418, 11) (891, 12) (418, 11)After (891, 10) (418, 9) (891, 10) (418, 9)
 

Creating new feature extracting from existing

In [18]:
for dataset in combine:    dataset[‘Title‘] = dataset.Name.str.extract(‘ ([A-Za-z]+)\.‘, expand=False)pd.crosstab(train_df[‘Title‘], train_df[‘Sex‘])
Out[18]:
 
<style></style>
Sexfemalemale
Title  
Capt01
Col02
Countess10
Don01
Dr16
Jonkheer01
Lady10
Major02
Master040
Miss1820
Mlle20
Mme10
Mr0517
Mrs1250
Ms10
Rev06
Sir01
In [19]:
for dataset in combine:    dataset[‘Title‘] = dataset[‘Title‘].replace([‘Lady‘, ‘Countess‘,‘Capt‘, ‘Col‘, 	‘Don‘, ‘Dr‘, ‘Major‘, ‘Rev‘, ‘Sir‘, ‘Jonkheer‘, ‘Dona‘], ‘Rare‘)    dataset[‘Title‘] = dataset[‘Title‘].replace(‘Mlle‘, ‘Miss‘)    dataset[‘Title‘] = dataset[‘Title‘].replace(‘Ms‘, ‘Miss‘)    dataset[‘Title‘] = dataset[‘Title‘].replace(‘Mme‘, ‘Mrs‘)    train_df[[‘Title‘, ‘Survived‘]].groupby([‘Title‘], as_index=False).mean()
Out[19]:
 
<style></style>
 TitleSurvived
0Master0.575000
1Miss0.702703
2Mr0.156673
3Mrs0.793651
4Rare0.347826
In [20]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}for dataset in combine:    dataset[‘Title‘] = dataset[‘Title‘].map(title_mapping)    dataset[‘Title‘] = dataset[‘Title‘].fillna(0)train_df.head()
Out[20]:
 
<style></style>
 PassengerIdSurvivedPclassNameSexAgeSibSpParchFareEmbarkedTitle
0103Braund, Mr. Owen Harrismale22.0107.2500S1
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C3
2313Heikkinen, Miss. Lainafemale26.0007.9250S2
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000S3
4503Allen, Mr. William Henrymale35.0008.0500S1
In [21]:
train_df = train_df.drop([‘Name‘, ‘PassengerId‘], axis=1)test_df = test_df.drop([‘Name‘], axis=1)combine = [train_df, test_df]train_df.shape, test_df.shape
Out[21]:
((891, 9), (418, 9))
In [22]:
for dataset in combine:    dataset[‘Sex‘] = dataset[‘Sex‘].map( {‘female‘: 1, ‘male‘: 0} ).astype(int)train_df.head()
Out[22]:
 
<style></style>
 SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
003022.0107.2500S1
111138.01071.2833C3
213126.0007.9250S2
311135.01053.1000S3
403035.0008.0500S1
In [23]:
# grid = sns.FacetGrid(train_df, col=‘Pclass‘, hue=‘Gender‘)grid = sns.FacetGrid(train_df, row=‘Pclass‘, col=‘Sex‘, size=2.2, aspect=1.6)grid.map(plt.hist, ‘Age‘, alpha=.5, bins=20)grid.add_legend()
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x2a74330acf8>
 
技术分享
In [24]:
guess_ages = np.zeros((2,3))guess_ages
Out[24]:
array([[ 0.,  0.,  0.],       [ 0.,  0.,  0.]])
In [25]:
for dataset in combine:    for i in range(0, 2):        for j in range(0, 3):            guess_df = dataset[(dataset[‘Sex‘] == i) &                                   (dataset[‘Pclass‘] == j+1)][‘Age‘].dropna()            # age_mean = guess_df.mean()            # age_std = guess_df.std()            # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)            age_guess = guess_df.median()            # Convert random age float to nearest .5 age            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5                for i in range(0, 2):        for j in range(0, 3):            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),                    ‘Age‘] = guess_ages[i,j]    dataset[‘Age‘] = dataset[‘Age‘].astype(int)train_df.head()
Out[25]:
 
<style></style>
 SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
003022107.2500S1
1111381071.2833C3
213126007.9250S2
3111351053.1000S3
403035008.0500S1
In [26]:
train_df[‘AgeBand‘] = pd.cut(train_df[‘Age‘], 5)train_df[[‘AgeBand‘, ‘Survived‘]].groupby([‘AgeBand‘], as_index=False).mean().sort_values(by=‘AgeBand‘, ascending=True)
Out[26]:
 
<style></style>
 AgeBandSurvived
0(-0.08, 16.0]0.550000
1(16.0, 32.0]0.337374
2(32.0, 48.0]0.412037
3(48.0, 64.0]0.434783
4(64.0, 80.0]0.090909
In [27]:
for dataset in combine:        dataset.loc[ dataset[‘Age‘] <= 16, ‘Age‘] = 0    dataset.loc[(dataset[‘Age‘] > 16) & (dataset[‘Age‘] <= 32), ‘Age‘] = 1    dataset.loc[(dataset[‘Age‘] > 32) & (dataset[‘Age‘] <= 48), ‘Age‘] = 2    dataset.loc[(dataset[‘Age‘] > 48) & (dataset[‘Age‘] <= 64), ‘Age‘] = 3    dataset.loc[ dataset[‘Age‘] > 64, ‘Age‘]train_df.head()
Out[27]:
 
<style></style>
 SurvivedPclassSexAgeSibSpParchFareEmbarkedTitleAgeBand
00301107.2500S1(16.0, 32.0]
111121071.2833C3(32.0, 48.0]
21311007.9250S2(16.0, 32.0]
311121053.1000S3(32.0, 48.0]
40302008.0500S1(32.0, 48.0]
In [28]:
train_df = train_df.drop([‘AgeBand‘], axis=1)combine = [train_df, test_df]train_df.head()
Out[28]:
 
<style></style>
 SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
00301107.2500S1
111121071.2833C3
21311007.9250S2
311121053.1000S3
40302008.0500S1
In [29]:
for dataset in combine:    dataset[‘FamilySize‘] = dataset[‘SibSp‘] + dataset[‘Parch‘] + 1train_df[[‘FamilySize‘, ‘Survived‘]].groupby([‘FamilySize‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[29]:
 
<style></style>
 FamilySizeSurvived
340.724138
230.578431
120.552795
670.333333
010.303538
450.200000
560.136364
780.000000
8110.000000
In [30]:
for dataset in combine:    dataset[‘IsAlone‘] = 0    dataset.loc[dataset[‘FamilySize‘] == 1, ‘IsAlone‘] = 1train_df[[‘IsAlone‘, ‘Survived‘]].groupby([‘IsAlone‘], as_index=False).mean()
Out[30]:
 
<style></style>
 IsAloneSurvived
000.505650
110.303538
In [31]:
train_df = train_df.drop([‘Parch‘, ‘SibSp‘, ‘FamilySize‘], axis=1)test_df = test_df.drop([‘Parch‘, ‘SibSp‘, ‘FamilySize‘], axis=1)combine = [train_df, test_df]train_df.head()
Out[31]:
 
<style></style>
 SurvivedPclassSexAgeFareEmbarkedTitleIsAlone
003017.2500S10
1111271.2833C30
213117.9250S21
3111253.1000S30
403028.0500S11
In [32]:
for dataset in combine:    dataset[‘Age*Class‘] = dataset.Age * dataset.Pclasstrain_df.loc[:, [‘Age*Class‘, ‘Age‘, ‘Pclass‘]].head(10)
Out[32]:
 
<style></style>
 Age*ClassAgePclass
0313
1221
2313
3221
4623
5313
6331
7003
8313
9002
In [33]:
freq_port = train_df.Embarked.dropna().mode()[0]freq_port
Out[33]:
‘S‘
In [34]:
for dataset in combine:    dataset[‘Embarked‘] = dataset[‘Embarked‘].fillna(freq_port)    train_df[[‘Embarked‘, ‘Survived‘]].groupby([‘Embarked‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[34]:
 
<style></style>
 EmbarkedSurvived
0C0.553571
1Q0.389610
2S0.339009
In [35]:
for dataset in combine:    dataset[‘Embarked‘] = dataset[‘Embarked‘].map( {‘S‘: 0, ‘C‘: 1, ‘Q‘: 2} ).astype(int)train_df.head()
Out[35]:
 
<style></style>
 SurvivedPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
003017.25000103
1111271.28331302
213117.92500213
3111253.10000302
403028.05000116
In [36]:
test_df[‘Fare‘].fillna(test_df[‘Fare‘].dropna().median(), inplace=True)test_df.head()
Out[36]:
 
<style></style>
 PassengerIdPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
08923027.82922116
18933127.00000306
28942039.68752116
38953018.66250113
489631112.28750303
In [37]:
train_df[‘FareBand‘] = pd.qcut(train_df[‘Fare‘], 4)train_df[[‘FareBand‘, ‘Survived‘]].groupby([‘FareBand‘], as_index=False).mean().sort_values(by=‘FareBand‘, ascending=True)
Out[37]:
 
<style></style>
 FareBandSurvived
0(-0.001, 7.91]0.197309
1(7.91, 14.454]0.303571
2(14.454, 31.0]0.454955
3(31.0, 512.329]0.581081
In [38]:
for dataset in combine:    dataset.loc[ dataset[‘Fare‘] <= 7.91, ‘Fare‘] = 0    dataset.loc[(dataset[‘Fare‘] > 7.91) & (dataset[‘Fare‘] <= 14.454), ‘Fare‘] = 1    dataset.loc[(dataset[‘Fare‘] > 14.454) & (dataset[‘Fare‘] <= 31), ‘Fare‘]   = 2    dataset.loc[ dataset[‘Fare‘] > 31, ‘Fare‘] = 3    dataset[‘Fare‘] = dataset[‘Fare‘].astype(int)train_df = train_df.drop([‘FareBand‘], axis=1)combine = [train_df, test_df]    train_df.head(10)
Out[38]:
 
<style></style>
 SurvivedPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
0030100103
1111231302
2131110213
3111230302
4030210116
5030112113
6010330113
7030020400
8131110303
9121021300
In [39]:
test_df.head(10)
Out[39]:
 
<style></style>
 PassengerIdPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
089230202116
189331200306
289420312116
389530110113
489631110303
589730010110
689831102213
789920120102
890031101313
990130120103
In [40]:
X_train = train_df.drop("Survived", axis=1)Y_train = train_df["Survived"]X_test  = test_df.drop("PassengerId", axis=1).copy()X_train.shape, Y_train.shape, X_test.shape
Out[40]:
((891, 8), (891,), (418, 8))
In [41]:
# Logistic Regressionlogreg = LogisticRegression()logreg.fit(X_train, Y_train)Y_pred = logreg.predict(X_test)acc_log = round(logreg.score(X_train, Y_train) * 100, 2)acc_log
Out[41]:
80.359999999999999
In [42]:
coeff_df = pd.DataFrame(train_df.columns.delete(0))coeff_df.columns = [‘Feature‘]coeff_df["Correlation"] = pd.Series(logreg.coef_[0])coeff_df.sort_values(by=‘Correlation‘, ascending=False)
Out[42]:
 
<style></style>
 FeatureCorrelation
1Sex2.201527
5Title0.398234
2Age0.287164
4Embarked0.261762
6IsAlone0.129140
3Fare-0.085150
7Age*Class-0.311199
0Pclass-0.749006
In [43]:
# Support Vector Machinessvc = SVC()svc.fit(X_train, Y_train)Y_pred = svc.predict(X_test)acc_svc = round(svc.score(X_train, Y_train) * 100, 2)acc_svc
Out[43]:
83.840000000000003
In [44]:
knn = KNeighborsClassifier(n_neighbors = 3)knn.fit(X_train, Y_train)Y_pred = knn.predict(X_test)acc_knn = round(knn.score(X_train, Y_train) * 100, 2)acc_knn
Out[44]:
84.739999999999995
In [45]:
# Gaussian Naive Bayesgaussian = GaussianNB()gaussian.fit(X_train, Y_train)Y_pred = gaussian.predict(X_test)acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)acc_gaussian
Out[45]:
72.280000000000001
In [46]:
# Perceptronperceptron = Perceptron()perceptron.fit(X_train, Y_train)Y_pred = perceptron.predict(X_test)acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)acc_perceptron
Out[46]:
78.0
In [47]:
# Linear SVClinear_svc = LinearSVC()linear_svc.fit(X_train, Y_train)Y_pred = linear_svc.predict(X_test)acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)acc_linear_svc
Out[47]:
79.120000000000005
In [48]:
# Stochastic Gradient Descentsgd = SGDClassifier()sgd.fit(X_train, Y_train)Y_pred = sgd.predict(X_test)acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)acc_sgd
Out[48]:
76.879999999999995
In [49]:
# Decision Treedecision_tree = DecisionTreeClassifier()decision_tree.fit(X_train, Y_train)Y_pred = decision_tree.predict(X_test)acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)acc_decision_tree
Out[49]:
86.760000000000005
In [50]:
# Random Forestrandom_forest = RandomForestClassifier(n_estimators=100)random_forest.fit(X_train, Y_train)Y_pred = random_forest.predict(X_test)random_forest.score(X_train, Y_train)acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)acc_random_forest
Out[50]:
86.760000000000005
In [51]:
models = pd.DataFrame({    ‘Model‘: [‘Support Vector Machines‘, ‘KNN‘, ‘Logistic Regression‘,               ‘Random Forest‘, ‘Naive Bayes‘, ‘Perceptron‘,               ‘Stochastic Gradient Decent‘, ‘Linear SVC‘,               ‘Decision Tree‘],    ‘Score‘: [acc_svc, acc_knn, acc_log,               acc_random_forest, acc_gaussian, acc_perceptron,               acc_sgd, acc_linear_svc, acc_decision_tree]})models.sort_values(by=‘Score‘, ascending=False)
Out[51]:
 
<style></style>
 ModelScore
3Random Forest86.76
8Decision Tree86.76
1KNN84.74
0Support Vector Machines83.84
2Logistic Regression80.36
7Linear SVC79.12
5Perceptron78.00
6Stochastic Gradient Decent76.88
4Naive Bayes72.28
In [52]:
submission = pd.DataFrame({        "PassengerId": test_df["PassengerId"],        "Survived": Y_pred    })# submission.to_csv(‘../output/submission.csv‘, index=False)
In [ ]:
 

[kaggle入门] Titanic Machine Learning from Disaster