首页 > 代码库 > [kaggle入门] Titanic Machine Learning from Disaster

[kaggle入门] Titanic Machine Learning from Disaster

2024-10-05 05:53:02 216人阅读

Titanic Data Science Solutions¶

https://www.kaggle.com/startupsci/titanic-data-science-solutions

数据挖掘竞赛七个步骤：¶

Question or problem definition.
Acquire training and testing data.
Wrangle, prepare, cleanse the data.
Analyze, identify patterns, and explore the data.
Model, predict and solve the problem.
Visualize, report, and present the problem solving steps and final solution.
Supply or submit the results.

数据挖掘竞赛的七种目标：¶

Classifying: classify or categorize our samples and may also want to understand the implications or correlation of different classes with our solution goal.
Correlating: Correlating certain features may help in creating, completing, or correcting features.
Converting: For instance converting text categorical values to numeric values.
Completing: Estimate any missing values within a feature.
Correcting: Detect any outliers among our samples or features and may discard a feature if it is not contribting to the analysis or may significantly skew the results.
Creating: Create new features based on an existing feature or a set of features.(correlation, conversion, completeness..)
Charting: Select the right visualization plots and charts

Question or problem definition¶

https://www.kaggle.com/c/titanic

The question or problem definition for Titanic Survival competition Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.
Some early understanding about the domain of our problem. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In [1]:

# data analysis and wrangling 数据分析和清洗工具import pandas as pdimport numpy as npimport random as rnd# visualization 数据可视化工具import seaborn as snsimport matplotlib.pyplot as plt%matplotlib inline# machine learning 机器学习模型from sklearn.linear_model import LogisticRegression # 逻辑回归from sklearn.svm import SVC, LinearSVC # 支持向量机from sklearn.ensemble import RandomForestClassifier # 随机森林from sklearn.neighbors import KNeighborsClassifier # K近邻from sklearn.naive_bayes import GaussianNB # 贝叶斯算法from sklearn.linear_model import Perceptron # 感知机from sklearn.linear_model import SGDClassifier # 随机梯度下降分类器from sklearn.tree import DecisionTreeClassifier # 决策树

Acquire training and testing data¶

In [2]:

train_df = pd.read_csv(‘data/train.csv‘) # 用pandas的read_csv方法读出DataFrame数据test_df = pd.read_csv(‘data/test.csv‘)combine = [train_df, test_df] # combine为一个数据集，方便对训练集和测试集做相同的数据清洗操作

Analyze by describing data¶

https://www.kaggle.com/c/titanic/data

In [3]:

print(train_df.columns.values) # 导出列名：features的名字

[‘PassengerId‘ ‘Survived‘ ‘Pclass‘ ‘Name‘ ‘Sex‘ ‘Age‘ ‘SibSp‘ ‘Parch‘ ‘Ticket‘ ‘Fare‘ ‘Cabin‘ ‘Embarked‘]

In [4]:

# preview the datatrain_df.head()  # 默认前5行

Out[4]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

In [5]:

train_df.tail() # 默认后5行

Out[5]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.00	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.00	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q

In [6]:

train_df.info()print(‘_‘*40)test_df.info()

<class ‘pandas.core.frame.DataFrame‘>RangeIndex: 891 entries, 0 to 890Data columns (total 12 columns):PassengerId    891 non-null int64Survived       891 non-null int64Pclass         891 non-null int64Name           891 non-null objectSex            891 non-null objectAge            714 non-null float64SibSp          891 non-null int64Parch          891 non-null int64Ticket         891 non-null objectFare           891 non-null float64Cabin          204 non-null objectEmbarked       889 non-null objectdtypes: float64(2), int64(5), object(5)memory usage: 83.6+ KB________________________________________<class ‘pandas.core.frame.DataFrame‘>RangeIndex: 418 entries, 0 to 417Data columns (total 11 columns):PassengerId    418 non-null int64Pclass         418 non-null int64Name           418 non-null objectSex            418 non-null objectAge            332 non-null float64SibSp          418 non-null int64Parch          418 non-null int64Ticket         418 non-null objectFare           417 non-null float64Cabin          91 non-null objectEmbarked       418 non-null objectdtypes: float64(2), int64(4), object(5)memory usage: 36.0+ KB

Which features are categorical?
Categorical: Survived, Sex, and Embarked.
Ordinal: Pclass.
Which features are numerical?
Continous: Age, Fare.
Discrete: SibSp, Parch.
Which features are mixed data types?
Ticket is a mix of numeric and alphanumeric data types.
Cabin is alphanumeric.
Which features may contain errors or typos?
Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.
Which features contain blank, null or empty values?
Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.
Cabin > Age are incomplete in case of test dataset.
What are the data types for various features?
Seven features are integer or floats. Six in case of test dataset.
Five features are strings (object).

In [7]:

train_df.describe()  # 数据的描述（总数、均值、标准差、最大、最小、25%、50%、75%）# Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.# Review Parch distribution using `percentiles=[.75, .8]`# SibSp distribution `[.68, .69]`# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`

Out[7]:

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

In [8]:

train_df.describe(include=[‘O‘])  # 找出特征中几个出现的不同值和频率最高

Out[8]:

	Name	Sex	Ticket	Cabin	Embarked
count	891	891	891	204	889
unique	891	2	681	147	3
top	Caldwell, Mrs. Albert Francis (Sylvia Mae Harb...	male	347082	B96 B98	S
freq	1	577	7	4	644

What is the distribution of numerical feature values across the samples?
Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
Survived is a categorical feature with 0 or 1 values.
Around 38% samples survived representative of the actual survival rate at 32%.
Most passengers (> 75%) did not travel with parents or children.
Nearly 30% of the passengers had siblings and/or spouse aboard.
Fares varied significantly with few passengers (\<1%) paying as high as 512.
Few elderly passengers (\<1%) within age range 65-80.
What is the distribution of categorical features?
Names are unique across the dataset (count=unique=891).
Sex variable as two possible values with 65% male (top=male, freq=577/count=891).
Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.
Embarked takes three possible values. S port used by most passengers (top=S).
Ticket feature has high ratio (22%) of duplicate values (unique=681).

Assumtions based on data analysis¶

Correlating
Completing
Correcting
Creating
Classifying

In [9]:

# 通过groupby找出该特征与目标之间的关联train_df[[‘Pclass‘, ‘Survived‘]].groupby([‘Pclass‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)

Out[9]:

	Pclass	Survived
0	1	0.629630
1	2	0.472826
2	3	0.242363

In [10]:

train_df[["Sex", "Survived"]].groupby([‘Sex‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)

Out[10]:

	Sex	Survived
0	female	0.742038
1	male	0.188908

In [11]:

train_df[["SibSp", "Survived"]].groupby([‘SibSp‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)

Out[11]:

	SibSp	Survived
1	1	0.535885
2	2	0.464286
0	0	0.345395
3	3	0.250000
4	4	0.166667
5	5	0.000000
6	8	0.000000

In [12]:

train_df[["Parch", "Survived"]].groupby([‘Parch‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)

Out[12]:

	Parch	Survived
3	3	0.600000
1	1	0.550847
2	2	0.500000
0	0	0.343658
5	5	0.200000
4	4	0.000000
6	6	0.000000

Analyze by visualizing data¶

In [13]:

g = sns.FacetGrid(train_df, col=‘Survived‘)g.map(plt.hist, ‘Age‘, bins=20)

Out[13]:

<seaborn.axisgrid.FacetGrid at 0x2a742a46828>

In [14]:

# grid = sns.FacetGrid(train_df, col=‘Pclass‘, hue=‘Survived‘)grid = sns.FacetGrid(train_df, col=‘Survived‘, row=‘Pclass‘, size=2.2, aspect=1.6)grid.map(plt.hist, ‘Age‘, alpha=.5, bins=20)grid.add_legend();

In [15]:

# grid = sns.FacetGrid(train_df, col=‘Embarked‘)grid = sns.FacetGrid(train_df, row=‘Embarked‘, size=2.2, aspect=1.6)grid.map(sns.pointplot, ‘Pclass‘, ‘Survived‘, ‘Sex‘, palette=‘deep‘)grid.add_legend()

Out[15]:

<seaborn.axisgrid.FacetGrid at 0x2a7435e7198>

In [16]:

# grid = sns.FacetGrid(train_df, col=‘Embarked‘, hue=‘Survived‘, palette={0: ‘k‘, 1: ‘w‘})grid = sns.FacetGrid(train_df, row=‘Embarked‘, col=‘Survived‘, size=2.2, aspect=1.6)grid.map(sns.barplot, ‘Sex‘, ‘Fare‘, alpha=.5, ci=None)grid.add_legend()

Out[16]:

<seaborn.axisgrid.FacetGrid at 0x2a7435e7978>

Wrangle, prepare, cleanse the data¶

Correcting by dropping features
drop the Cabin (correcting #2) and Ticket (correcting #1) features

In [17]:

print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)train_df = train_df.drop([‘Ticket‘, ‘Cabin‘], axis=1)test_df = test_df.drop([‘Ticket‘, ‘Cabin‘], axis=1)combine = [train_df, test_df]print("After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

Before (891, 12) (418, 11) (891, 12) (418, 11)After (891, 10) (418, 9) (891, 10) (418, 9)

Creating new feature extracting from existing

In [18]:

for dataset in combine:    dataset[‘Title‘] = dataset.Name.str.extract(‘ ([A-Za-z]+)\.‘, expand=False)pd.crosstab(train_df[‘Title‘], train_df[‘Sex‘])

Out[18]:

Sex	female	male
Title
Capt	0	1
Col	0	2
Countess	1	0
Don	0	1
Dr	1	6
Jonkheer	0	1
Lady	1	0
Major	0	2
Master	0	40
Miss	182	0
Mlle	2	0
Mme	1	0
Mr	0	517
Mrs	125	0
Ms	1	0
Rev	0	6
Sir	0	1

In [19]:

for dataset in combine:    dataset[‘Title‘] = dataset[‘Title‘].replace([‘Lady‘, ‘Countess‘,‘Capt‘, ‘Col‘, 	‘Don‘, ‘Dr‘, ‘Major‘, ‘Rev‘, ‘Sir‘, ‘Jonkheer‘, ‘Dona‘], ‘Rare‘)    dataset[‘Title‘] = dataset[‘Title‘].replace(‘Mlle‘, ‘Miss‘)    dataset[‘Title‘] = dataset[‘Title‘].replace(‘Ms‘, ‘Miss‘)    dataset[‘Title‘] = dataset[‘Title‘].replace(‘Mme‘, ‘Mrs‘)    train_df[[‘Title‘, ‘Survived‘]].groupby([‘Title‘], as_index=False).mean()

Out[19]:

	Title	Survived
0	Master	0.575000
1	Miss	0.702703
2	Mr	0.156673
3	Mrs	0.793651
4	Rare	0.347826

In [20]:

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}for dataset in combine:    dataset[‘Title‘] = dataset[‘Title‘].map(title_mapping)    dataset[‘Title‘] = dataset[‘Title‘].fillna(0)train_df.head()

Out[20]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Fare	Embarked	Title
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	7.2500	S	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	71.2833	C	3
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	7.9250	S	2
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	53.1000	S	3
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	8.0500	S	1

In [21]:

train_df = train_df.drop([‘Name‘, ‘PassengerId‘], axis=1)test_df = test_df.drop([‘Name‘], axis=1)combine = [train_df, test_df]train_df.shape, test_df.shape

Out[21]:

((891, 9), (418, 9))

In [22]:

for dataset in combine:    dataset[‘Sex‘] = dataset[‘Sex‘].map( {‘female‘: 1, ‘male‘: 0} ).astype(int)train_df.head()

Out[22]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Title
0	0	3	0	22.0	1	7.2500	S	1
1	1	1	1	38.0	1	71.2833	C	3
2	1	3	1	26.0	0	7.9250	S	2
3	1	1	1	35.0	1	53.1000	S	3
4	0	3	0	35.0	0	8.0500	S	1

In [23]:

# grid = sns.FacetGrid(train_df, col=‘Pclass‘, hue=‘Gender‘)grid = sns.FacetGrid(train_df, row=‘Pclass‘, col=‘Sex‘, size=2.2, aspect=1.6)grid.map(plt.hist, ‘Age‘, alpha=.5, bins=20)grid.add_legend()

Out[23]:

<seaborn.axisgrid.FacetGrid at 0x2a74330acf8>

In [24]:

guess_ages = np.zeros((2,3))guess_ages

Out[24]:

array([[ 0.,  0.,  0.],       [ 0.,  0.,  0.]])

In [25]:

for dataset in combine:    for i in range(0, 2):        for j in range(0, 3):            guess_df = dataset[(dataset[‘Sex‘] == i) &                                   (dataset[‘Pclass‘] == j+1)][‘Age‘].dropna()            # age_mean = guess_df.mean()            # age_std = guess_df.std()            # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)            age_guess = guess_df.median()            # Convert random age float to nearest .5 age            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5                for i in range(0, 2):        for j in range(0, 3):            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),                    ‘Age‘] = guess_ages[i,j]    dataset[‘Age‘] = dataset[‘Age‘].astype(int)train_df.head()

Out[25]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Title
0	0	3	0	22	1	7.2500	S	1
1	1	1	1	38	1	71.2833	C	3
2	1	3	1	26	0	7.9250	S	2
3	1	1	1	35	1	53.1000	S	3
4	0	3	0	35	0	8.0500	S	1

In [26]:

train_df[‘AgeBand‘] = pd.cut(train_df[‘Age‘], 5)train_df[[‘AgeBand‘, ‘Survived‘]].groupby([‘AgeBand‘], as_index=False).mean().sort_values(by=‘AgeBand‘, ascending=True)

Out[26]:

	AgeBand	Survived
0	(-0.08, 16.0]	0.550000
1	(16.0, 32.0]	0.337374
2	(32.0, 48.0]	0.412037
3	(48.0, 64.0]	0.434783
4	(64.0, 80.0]	0.090909

In [27]:

for dataset in combine:        dataset.loc[ dataset[‘Age‘] <= 16, ‘Age‘] = 0    dataset.loc[(dataset[‘Age‘] > 16) & (dataset[‘Age‘] <= 32), ‘Age‘] = 1    dataset.loc[(dataset[‘Age‘] > 32) & (dataset[‘Age‘] <= 48), ‘Age‘] = 2    dataset.loc[(dataset[‘Age‘] > 48) & (dataset[‘Age‘] <= 64), ‘Age‘] = 3    dataset.loc[ dataset[‘Age‘] > 64, ‘Age‘]train_df.head()

Out[27]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Title	AgeBand
0	0	3	0	1	1	7.2500	S	1	(16.0, 32.0]
1	1	1	1	2	1	71.2833	C	3	(32.0, 48.0]
2	1	3	1	1	0	7.9250	S	2	(16.0, 32.0]
3	1	1	1	2	1	53.1000	S	3	(32.0, 48.0]
4	0	3	0	2	0	8.0500	S	1	(32.0, 48.0]

In [28]:

train_df = train_df.drop([‘AgeBand‘], axis=1)combine = [train_df, test_df]train_df.head()

Out[28]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Title
0	0	3	0	1	1	7.2500	S	1
1	1	1	1	2	1	71.2833	C	3
2	1	3	1	1	0	7.9250	S	2
3	1	1	1	2	1	53.1000	S	3
4	0	3	0	2	0	8.0500	S	1

In [29]:

for dataset in combine:    dataset[‘FamilySize‘] = dataset[‘SibSp‘] + dataset[‘Parch‘] + 1train_df[[‘FamilySize‘, ‘Survived‘]].groupby([‘FamilySize‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)

Out[29]:

	FamilySize	Survived
3	4	0.724138
2	3	0.578431
1	2	0.552795
6	7	0.333333
0	1	0.303538
4	5	0.200000
5	6	0.136364
7	8	0.000000
8	11	0.000000

In [30]:

for dataset in combine:    dataset[‘IsAlone‘] = 0    dataset.loc[dataset[‘FamilySize‘] == 1, ‘IsAlone‘] = 1train_df[[‘IsAlone‘, ‘Survived‘]].groupby([‘IsAlone‘], as_index=False).mean()

Out[30]:

	IsAlone	Survived
0	0	0.505650
1	1	0.303538

In [31]:

train_df = train_df.drop([‘Parch‘, ‘SibSp‘, ‘FamilySize‘], axis=1)test_df = test_df.drop([‘Parch‘, ‘SibSp‘, ‘FamilySize‘], axis=1)combine = [train_df, test_df]train_df.head()

Out[31]:

	Survived	Pclass	Sex	Age	Fare	Embarked	Title	IsAlone
0	0	3	0	1	7.2500	S	1	0
1	1	1	1	2	71.2833	C	3	0
2	1	3	1	1	7.9250	S	2	1
3	1	1	1	2	53.1000	S	3	0
4	0	3	0	2	8.0500	S	1	1

In [32]:

for dataset in combine:    dataset[‘Age*Class‘] = dataset.Age * dataset.Pclasstrain_df.loc[:, [‘Age*Class‘, ‘Age‘, ‘Pclass‘]].head(10)

Out[32]:

	Age*Class	Age	Pclass
0	3	1	3
1	2	2	1
2	3	1	3
3	2	2	1
4	6	2	3
5	3	1	3
6	3	3	1
7	0	0	3
8	3	1	3
9	0	0	2

In [33]:

freq_port = train_df.Embarked.dropna().mode()[0]freq_port

Out[33]:

‘S‘

In [34]:

for dataset in combine:    dataset[‘Embarked‘] = dataset[‘Embarked‘].fillna(freq_port)    train_df[[‘Embarked‘, ‘Survived‘]].groupby([‘Embarked‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)

Out[34]:

	Embarked	Survived
0	C	0.553571
1	Q	0.389610
2	S	0.339009

In [35]:

for dataset in combine:    dataset[‘Embarked‘] = dataset[‘Embarked‘].map( {‘S‘: 0, ‘C‘: 1, ‘Q‘: 2} ).astype(int)train_df.head()

Out[35]:

	Survived	Pclass	Sex	Age	Fare	Embarked	Title	IsAlone	Age*Class
0	0	3	0	1	7.2500	0	1	0	3
1	1	1	1	2	71.2833	1	3	0	2
2	1	3	1	1	7.9250	0	2	1	3
3	1	1	1	2	53.1000	0	3	0	2
4	0	3	0	2	8.0500	0	1	1	6

In [36]:

test_df[‘Fare‘].fillna(test_df[‘Fare‘].dropna().median(), inplace=True)test_df.head()

Out[36]:

	PassengerId	Pclass	Sex	Age	Fare	Embarked	Title	IsAlone	Age*Class
0	892	3	0	2	7.8292	2	1	1	6
1	893	3	1	2	7.0000	0	3	0	6
2	894	2	0	3	9.6875	2	1	1	6
3	895	3	0	1	8.6625	0	1	1	3
4	896	3	1	1	12.2875	0	3	0	3

In [37]:

train_df[‘FareBand‘] = pd.qcut(train_df[‘Fare‘], 4)train_df[[‘FareBand‘, ‘Survived‘]].groupby([‘FareBand‘], as_index=False).mean().sort_values(by=‘FareBand‘, ascending=True)

Out[37]:

	FareBand	Survived
0	(-0.001, 7.91]	0.197309
1	(7.91, 14.454]	0.303571
2	(14.454, 31.0]	0.454955
3	(31.0, 512.329]	0.581081

In [38]:

for dataset in combine:    dataset.loc[ dataset[‘Fare‘] <= 7.91, ‘Fare‘] = 0    dataset.loc[(dataset[‘Fare‘] > 7.91) & (dataset[‘Fare‘] <= 14.454), ‘Fare‘] = 1    dataset.loc[(dataset[‘Fare‘] > 14.454) & (dataset[‘Fare‘] <= 31), ‘Fare‘]   = 2    dataset.loc[ dataset[‘Fare‘] > 31, ‘Fare‘] = 3    dataset[‘Fare‘] = dataset[‘Fare‘].astype(int)train_df = train_df.drop([‘FareBand‘], axis=1)combine = [train_df, test_df]    train_df.head(10)

Out[38]:

	Survived	Pclass	Sex	Age	Fare	Embarked	Title	IsAlone	Age*Class
0	0	3	0	1	0	0	1	0	3
1	1	1	1	2	3	1	3	0	2
2	1	3	1	1	1	0	2	1	3
3	1	1	1	2	3	0	3	0	2
4	0	3	0	2	1	0	1	1	6
5	0	3	0	1	1	2	1	1	3
6	0	1	0	3	3	0	1	1	3
7	0	3	0	0	2	0	4	0	0
8	1	3	1	1	1	0	3	0	3
9	1	2	1	0	2	1	3	0	0

In [39]:

test_df.head(10)

Out[39]:

	PassengerId	Pclass	Sex	Age	Fare	Embarked	Title	IsAlone	Age*Class
0	892	3	0	2	0	2	1	1	6
1	893	3	1	2	0	0	3	0	6
2	894	2	0	3	1	2	1	1	6
3	895	3	0	1	1	0	1	1	3
4	896	3	1	1	1	0	3	0	3
5	897	3	0	0	1	0	1	1	0
6	898	3	1	1	0	2	2	1	3
7	899	2	0	1	2	0	1	0	2
8	900	3	1	1	0	1	3	1	3
9	901	3	0	1	2	0	1	0	3

In [40]:

X_train = train_df.drop("Survived", axis=1)Y_train = train_df["Survived"]X_test  = test_df.drop("PassengerId", axis=1).copy()X_train.shape, Y_train.shape, X_test.shape

Out[40]:

((891, 8), (891,), (418, 8))

In [41]:

# Logistic Regressionlogreg = LogisticRegression()logreg.fit(X_train, Y_train)Y_pred = logreg.predict(X_test)acc_log = round(logreg.score(X_train, Y_train) * 100, 2)acc_log

Out[41]:

80.359999999999999

In [42]:

coeff_df = pd.DataFrame(train_df.columns.delete(0))coeff_df.columns = [‘Feature‘]coeff_df["Correlation"] = pd.Series(logreg.coef_[0])coeff_df.sort_values(by=‘Correlation‘, ascending=False)

Out[42]:

	Feature	Correlation
1	Sex	2.201527
5	Title	0.398234
2	Age	0.287164
4	Embarked	0.261762
6	IsAlone	0.129140
3	Fare	-0.085150
7	Age*Class	-0.311199
0	Pclass	-0.749006

In [43]:

# Support Vector Machinessvc = SVC()svc.fit(X_train, Y_train)Y_pred = svc.predict(X_test)acc_svc = round(svc.score(X_train, Y_train) * 100, 2)acc_svc

Out[43]:

83.840000000000003

In [44]:

knn = KNeighborsClassifier(n_neighbors = 3)knn.fit(X_train, Y_train)Y_pred = knn.predict(X_test)acc_knn = round(knn.score(X_train, Y_train) * 100, 2)acc_knn

Out[44]:

84.739999999999995

In [45]:

# Gaussian Naive Bayesgaussian = GaussianNB()gaussian.fit(X_train, Y_train)Y_pred = gaussian.predict(X_test)acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)acc_gaussian

Out[45]:

72.280000000000001

In [46]:

# Perceptronperceptron = Perceptron()perceptron.fit(X_train, Y_train)Y_pred = perceptron.predict(X_test)acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)acc_perceptron

Out[46]:

78.0

In [47]:

# Linear SVClinear_svc = LinearSVC()linear_svc.fit(X_train, Y_train)Y_pred = linear_svc.predict(X_test)acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)acc_linear_svc

Out[47]:

79.120000000000005

In [48]:

# Stochastic Gradient Descentsgd = SGDClassifier()sgd.fit(X_train, Y_train)Y_pred = sgd.predict(X_test)acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)acc_sgd

Out[48]:

76.879999999999995

In [49]:

# Decision Treedecision_tree = DecisionTreeClassifier()decision_tree.fit(X_train, Y_train)Y_pred = decision_tree.predict(X_test)acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)acc_decision_tree

Out[49]:

86.760000000000005

In [50]:

# Random Forestrandom_forest = RandomForestClassifier(n_estimators=100)random_forest.fit(X_train, Y_train)Y_pred = random_forest.predict(X_test)random_forest.score(X_train, Y_train)acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)acc_random_forest

Out[50]:

86.760000000000005

In [51]:

models = pd.DataFrame({    ‘Model‘: [‘Support Vector Machines‘, ‘KNN‘, ‘Logistic Regression‘,               ‘Random Forest‘, ‘Naive Bayes‘, ‘Perceptron‘,               ‘Stochastic Gradient Decent‘, ‘Linear SVC‘,               ‘Decision Tree‘],    ‘Score‘: [acc_svc, acc_knn, acc_log,               acc_random_forest, acc_gaussian, acc_perceptron,               acc_sgd, acc_linear_svc, acc_decision_tree]})models.sort_values(by=‘Score‘, ascending=False)

Out[51]:

	Model	Score
3	Random Forest	86.76
8	Decision Tree	86.76
1	KNN	84.74
0	Support Vector Machines	83.84
2	Logistic Regression	80.36
7	Linear SVC	79.12
5	Perceptron	78.00
6	Stochastic Gradient Decent	76.88
4	Naive Bayes	72.28

In [52]:

submission = pd.DataFrame({        "PassengerId": test_df["PassengerId"],        "Survived": Y_pred    })# submission.to_csv(‘../output/submission.csv‘, index=False)

In [ ]:

[kaggle入门] Titanic Machine Learning from Disaster

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

	Survived	Pclass	Sex	Age	Fare	Embarked	Title	IsAlone	Age*Class
0	0	3	0	1	0	0	1	0	3
1	1	1	1	2	3	1	3	0	2
2	1	3	1	1	1	0	2	1	3
3	1	1	1	2	3	0	3	0	2
4	0	3	0	2	1	0	1	1	6
5	0	3	0	1	1	2	1	1	3
6	0	1	0	3	3	0	1	1	3
7	0	3	0	0	2	0	4	0	0
8	1	3	1	1	1	0	3	0	3
9	1	2	1	0	2	1	3	0	0

	PassengerId	Pclass	Sex	Age	Fare	Embarked	Title	IsAlone	Age*Class
0	892	3	0	2	0	2	1	1	6
1	893	3	1	2	0	0	3	0	6
2	894	2	0	3	1	2	1	1	6
3	895	3	0	1	1	0	1	1	3
4	896	3	1	1	1	0	3	0	3
5	897	3	0	0	1	0	1	1	0
6	898	3	1	1	0	2	2	1	3
7	899	2	0	1	2	0	1	0	2
8	900	3	1	1	0	1	3	1	3
9	901	3	0	1	2	0	1	0	3

	Survived	Pclass	Sex	Age	Fare	Embarked	Title	IsAlone	Age*Class
0	0	3	0	1	0	0	1	0	3
1	1	1	1	2	3	1	3	0	2
2	1	3	1	1	1	0	2	1	3
3	1	1	1	2	3	0	3	0	2
4	0	3	0	2	1	0	1	1	6
5	0	3	0	1	1	2	1	1	3
6	0	1	0	3	3	0	1	1	3
7	0	3	0	0	2	0	4	0	0
8	1	3	1	1	1	0	3	0	3
9	1	2	1	0	2	1	3	0	0

	PassengerId	Pclass	Sex	Age	Fare	Embarked	Title	IsAlone	Age*Class
0	892	3	0	2	0	2	1	1	6
1	893	3	1	2	0	0	3	0	6
2	894	2	0	3	1	2	1	1	6
3	895	3	0	1	1	0	1	1	3
4	896	3	1	1	1	0	3	0	3
5	897	3	0	0	1	0	1	1	0
6	898	3	1	1	0	2	2	1	3
7	899	2	0	1	2	0	1	0	2
8	900	3	1	1	0	1	3	1	3
9	901	3	0	1	2	0	1	0	3

首页 > 代码库 > [kaggle入门] Titanic Machine Learning from Disaster

[kaggle入门] Titanic Machine Learning from Disaster

Titanic Data Science Solutions¶

数据挖掘竞赛七个步骤：¶

数据挖掘竞赛的七种目标：¶

Question or problem definition¶

Acquire training and testing data¶

Analyze by describing data¶

Assumtions based on data analysis¶

Analyze by visualizing data¶

Wrangle, prepare, cleanse the data¶

看完仍有疑问？有类似问题直接问程序猿

	Survived	Pclass	Sex	Age	Fare	Embarked	Title	IsAlone	Age*Class
0	0	3	0	1	0	0	1	0	3
1	1	1	1	2	3	1	3	0	2
2	1	3	1	1	1	0	2	1	3
3	1	1	1	2	3	0	3	0	2
4	0	3	0	2	1	0	1	1	6
5	0	3	0	1	1	2	1	1	3
6	0	1	0	3	3	0	1	1	3
7	0	3	0	0	2	0	4	0	0
8	1	3	1	1	1	0	3	0	3
9	1	2	1	0	2	1	3	0	0

	PassengerId	Pclass	Sex	Age	Fare	Embarked	Title	IsAlone	Age*Class
0	892	3	0	2	0	2	1	1	6
1	893	3	1	2	0	0	3	0	6
2	894	2	0	3	1	2	1	1	6
3	895	3	0	1	1	0	1	1	3
4	896	3	1	1	1	0	3	0	3
5	897	3	0	0	1	0	1	1	0
6	898	3	1	1	0	2	2	1	3
7	899	2	0	1	2	0	1	0	2
8	900	3	1	1	0	1	3	1	3
9	901	3	0	1	2	0	1	0	3