Exploratory Data Analysis of Titanic Dataset

Exploratory data analysis (EDA) is an important pillar of data science, a important step required to complete every project regardless of type of data you are working with. Exploratory analysis gives us a sense of what additional work should be performed to quantify and extract insights from our data.

In this post I have performed Exploratory Data analysis on Titanic Dataset.

Let’s start with importing required libraries.

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Now I will read titanic dataset using Pandas read_csv method and explore first 5 rows of the data set.

titanic_df = pd.read_csv('titanic-data.csv')
titanic_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Data Description

(from https://www.kaggle.com/c/titanic)

survival: Survival (0 = No; 1 = Yes)
pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name: Name
sex: Sex
age: Age
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
ticket: Ticket Number
fare: Passenger Fare
cabin: Cabin
embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

Now let’s see some statistical summary of the imported dataset using pandas.describe() method.

titanic_df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

The output DataFrame index depends on the requested dtypes:

For numeric dtypes, it will include: count, mean, std, min, max, and lower, 50, and upper percentiles.

From above table I see that mean of survived column is 0.38, but since this is not complete dataset we cannot conclude on that.

Count for ‘Age’ column is 714, it means dataset has some missing values. I will have to cleanup the data before I start exploring.

Data cleanup

Now let’s get some info on datatypes in the dataset using pandas.info() method. It will give us concise summary of a DataFrame.

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

I see that there are some missing values in ‘Age’, ‘Cabin’ and ‘Embarked’ columns. I’ll not use ‘Cabin’ which is the most missing and will ignore it. There are some columns which are not required in my analysis so I will drop them. For the missing ‘Ages’ and ‘Embarked’ I will omit those rows when I use the data.

titanic_cleaned = titanic_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
titanic_cleaned.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	0	3	male	22.0	1	7.2500	S
1	1	1	female	38.0	1	71.2833	C
2	1	3	female	26.0	0	7.9250	S
3	1	1	female	35.0	1	53.1000	S
4	0	3	male	35.0	0	8.0500	S

titanic_cleaned.describe()

	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

titanic_cleaned.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

Data exploration:

As part this project I want to explore answers to following questions.

How Survival is correlated to other attributes of the dataset ? Findout Pearson’s r.
Did Sex play a role in Survival ?
Did class played role in survival ?
How fare is related to Age, Class and Port of Embarkation ?
How Embarkation varied across different ports ?

Lets start with Q1.

Q1. How Survival is correlated to other attributes of the dataset ? Findout Pearson’s r.

I will compute pairwise correlation of columns(excluding NA/null values) using pandas.DataFrame.corr method. I will use ‘pearson’ standard correlation coefficient for the calculation.

titanic_cleaned.corr(method='pearson')

	Survived	Pclass	Age	SibSp	Parch	Fare
Survived	1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307
Pclass	-0.338481	1.000000	-0.369226	0.083081	0.018443	-0.549500
Age	-0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067
SibSp	-0.035322	0.083081	-0.308247	1.000000	0.414838	0.159651
Parch	0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225
Fare	0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000

From above correlation table we can see that Survival is inversly correlated to Pclass value. In our case since Class 1 has lower numerical value, it had better survival rate compared to other classes.

We also see that Age and Survival are slighltly correlated.

We will try to visualize these corelation below.

This brings us to Q2

Q2. Did Sex play a role in Survival ?

Lets pull a histogram of ‘Survived’ column.

#titanic_cleaned.groupby(['Survived']).hist()

sns.factorplot('Survived', data=titanic_df, kind='count')
sns.plt.title('Count of Passengers who survived')

Let’s see agewise distribution of the passenger aboard the Titanic.

#Histogram of Age of the given data set(sample)
#plt.hist(titanic_cleaned['Age'].dropna())
sns.distplot(titanic_cleaned['Age'].dropna(), bins=15, kde=False)
sns.plt.ylabel('Count')
sns.plt.title('Agewise distribution of the passenger aboard the Titanic')

Many passensgers are of age 15-40 yrs. But again this is not complete dataset.

Now I would like to see agewise distribution of passsengers for both Genders. I will do this by plotting the rows where ‘Sex’ is Male and Female respectively.

#Age wise Distribution of Male and Female passengers
sns.plt.hist(titanic_cleaned['Age'][(titanic_cleaned['Sex'] == 'female')].dropna(), bins=7, label='Female', histtype='stepfilled')
sns.plt.hist(titanic_cleaned['Age'][(titanic_cleaned['Sex'] == 'male')].dropna(), bins=7, label='Male', alpha=.7, histtype='stepfilled')
sns.plt.xlabel('Age')
sns.plt.ylabel('Count')
sns.plt.title('Age wise Distribution of Male and Female passengers')
sns.plt.legend()

There were many male passengers aboared compared to female passengers.

I will do a agewise distribution plot for passenges who Survived across both Genders by filtering out rows where ‘Survived’ = 1.

#Age wise Distribution of Male and Female survivors
sns.plt.hist(titanic_cleaned['Age'][(titanic_cleaned['Sex'] == 'female') & (titanic_cleaned['Survived'] == 1)].dropna(), bins=7, label='Female', histtype='stepfilled')
sns.plt.hist(titanic_cleaned['Age'][(titanic_cleaned['Sex'] == 'male') & (titanic_cleaned['Survived'] == 1)].dropna(), bins=7, label='Male', alpha=.7, histtype='stepfilled')
sns.plt.xlabel('Age')
sns.plt.ylabel('Count')
sns.plt.title('Age wise Distribution of Male and Female survivors')
sns.plt.legend()

From above visualization, it is evident that Women had better survival chance. One can do an Hypothesis test to verify this.

Lets take a look for youngest and oldest passenger to survive.

yougest_survive = titanic_cleaned['Age'][(titanic_cleaned['Survived'] == 1)].min()
youngest_die = titanic_cleaned['Age'][(titanic_cleaned['Survived'] == 0)].min()
oldest_survive = titanic_cleaned['Age'][(titanic_cleaned['Survived'] == 1)].max()
oldest_die = titanic_cleaned['Age'][(titanic_cleaned['Survived'] == 0)].max()

print "Yougest to survive: {} \nYoungest to die: {} \nOldest to survive: {} \nOldest to die: {}".format(yougest_survive, youngest_die, oldest_survive, oldest_die)

Yougest to survive: 0.42 
Youngest to die: 1.0 
Oldest to survive: 80.0 
Oldest to die: 74.0

Q3. Did class played role in survival ?

Next, let’s look at survival based on passenger’s class for both genders.

We can do this by grouping the dataframe with respect to Pclass, Survived and Sex.

#sns.plt.hist(titanic_cleaned.groupby(['Pclass', 'Survived', 'Sex']).size())
grouped_by_pclass = titanic_cleaned.groupby(['Pclass', 'Survived', 'Sex'])
grouped_by_pclass.size()

Pclass  Survived  Sex   
1       0         female      3
                  male       77
        1         female     91
                  male       45
2       0         female      6
                  male       91
        1         female     70
                  male       17
3       0         female     72
                  male      300
        1         female     72
                  male       47
dtype: int64

titanic_cleaned.groupby(['Pclass', 'Sex']).describe()

			Age	Fare	Parch	SibSp	Survived
Pclass	Sex
1	female	count	85.000000	94.000000	94.000000	94.000000	94.000000
		mean	34.611765	106.125798	0.457447	0.553191	0.968085
		std	13.612052	74.259988	0.728305	0.665865	0.176716
		min	2.000000	25.929200	0.000000	0.000000	0.000000
		25%	23.000000	57.244800	0.000000	0.000000	1.000000
		50%	35.000000	82.664550	0.000000	0.000000	1.000000
		75%	44.000000	134.500000	1.000000	1.000000	1.000000
		max	63.000000	512.329200	2.000000	3.000000	1.000000
	male	count	101.000000	122.000000	122.000000	122.000000	122.000000
		mean	41.281386	67.226127	0.278689	0.311475	0.368852
		std	15.139570	77.548021	0.658853	0.546695	0.484484
		min	0.920000	0.000000	0.000000	0.000000	0.000000
		25%	30.000000	27.728100	0.000000	0.000000	0.000000
		50%	40.000000	41.262500	0.000000	0.000000	0.000000
		75%	51.000000	78.459375	0.000000	1.000000	1.000000
		max	80.000000	512.329200	4.000000	3.000000	1.000000
2	female	count	74.000000	76.000000	76.000000	76.000000	76.000000
		mean	28.722973	21.970121	0.605263	0.486842	0.921053
		std	12.872702	10.891796	0.833930	0.642774	0.271448
		min	2.000000	10.500000	0.000000	0.000000	0.000000
		25%	22.250000	13.000000	0.000000	0.000000	1.000000
		50%	28.000000	22.000000	0.000000	0.000000	1.000000
		75%	36.000000	26.062500	1.000000	1.000000	1.000000
		max	57.000000	65.000000	3.000000	3.000000	1.000000
	male	count	99.000000	108.000000	108.000000	108.000000	108.000000
		mean	30.740707	19.741782	0.222222	0.342593	0.157407
		std	14.793894	14.922235	0.517603	0.566380	0.365882
		min	0.670000	0.000000	0.000000	0.000000	0.000000
		25%	23.000000	12.331250	0.000000	0.000000	0.000000
		50%	30.000000	13.000000	0.000000	0.000000	0.000000
		75%	36.750000	26.000000	0.000000	1.000000	0.000000
		max	70.000000	73.500000	2.000000	2.000000	1.000000
3	female	count	102.000000	144.000000	144.000000	144.000000	144.000000
		mean	21.750000	16.118810	0.798611	0.895833	0.500000
		std	12.729964	11.690314	1.237976	1.531573	0.501745
		min	0.750000	6.750000	0.000000	0.000000	0.000000
		25%	14.125000	7.854200	0.000000	0.000000	0.000000
		50%	21.500000	12.475000	0.000000	0.000000	0.500000
		75%	29.750000	20.221875	1.000000	1.000000	1.000000
		max	63.000000	69.550000	6.000000	8.000000	1.000000
	male	count	253.000000	347.000000	347.000000	347.000000	347.000000
		mean	26.507589	12.661633	0.224784	0.498559	0.135447
		std	12.159514	11.681696	0.623404	1.288846	0.342694
		min	0.420000	0.000000	0.000000	0.000000	0.000000
		25%	20.000000	7.750000	0.000000	0.000000	0.000000
		50%	25.000000	7.925000	0.000000	0.000000	0.000000
		75%	33.000000	10.008300	0.000000	0.000000	0.000000
		max	74.000000	69.550000	5.000000	8.000000	1.000000

I would also like to see the survival rate across all the class. I can do this by taking sum of survived passengers for each class and divide it by totla number of passenger for that class and multiplying by 100. I will use pandas groupby function to segregate passengers according to their class.

titanic_cleaned.groupby(['Pclass'])['Survived'].sum()/titanic_cleaned.groupby(['Pclass'])['Survived'].count()*100

Pclass
1    62.962963
2    47.282609
3    24.236253
Name: Survived, dtype: float64

I see Class did play role in survival of the passengers. Now let’s visualize the same.

sns.factorplot('Survived', col='Pclass', data=titanic_cleaned, kind='count', size=7, aspect=.8)
plt.subplots_adjust(top=0.9)
sns.plt.suptitle('Class wise segregation of passengers', fontsize=16)

Above visualization compares passengers who survived the tragedy and who did not, across three classes. We can also drill down further to visualize survival of passengers of both genders across 3 classes.

sns.factorplot('Survived', col='Pclass', hue='Sex', data=titanic_cleaned, kind='count', size=7, aspect=.8)
plt.subplots_adjust(top=0.9)
sns.plt.suptitle('Class and gender wise segregation of passengers', fontsize=16)

From above visualization we can see that class played important for Survival of Male and Female passengers. This brings us to my next questions.

Q5. How Embarkation varied across different ports ?

Let’s see how Fare varies with respect to Age and Port of Embarkation. I will do a scatterplot of passengers from 3 classes for Age and Fare on X and Y axis.

sns.lmplot('Age', 'Fare', data=titanic_cleaned, fit_reg=False, hue="Pclass", scatter_kws={"marker": ".", "s": 20})
sns.plt.title('Scatterplot of passengers w.r.t Fare and Age')

I can segregate the passengers according to thier Port of Embarkation and then compare Fare v/s Age across 3 classes.

sns.lmplot('Age', 'Fare', data=titanic_cleaned, fit_reg=False, hue="Pclass", col="Embarked", scatter_kws={"marker": ".", "s": 20})
plt.subplots_adjust(top=0.9)
sns.plt.suptitle('Scatterplot of passengers w.r.t Fare and Age for diff. ports', fontsize=16)

From above visualization we can see that Fare is quite uniform for Class 2 and 3 across all ages. Fare varies for Class 1 across all ages, but we cannot conclude why it varies. We need more attributes to our data points to drill down to the reason for variation. We can also observe that lot of passengers embarked from port of Southampton.

Conclusions

From my exploratory analysis of Titanic dataset we conclude that, women had higher chances of survival. We can do a t test to come up with chances(probability) of survival. I also see that Class(Socio-Economic status) of the passengers had played a role in their survival. I also compared fare across different classes and found that it varied a lot for Class 1 passenger, although I could not conclude as to why it varried diffrently for Class 1 due to insufficient data.

There were some limitation for this dataset such as missing values for some attributes of passesngers. This is not in any form a exhaustive study. More can be done on this data set.

Reference:

Websites:

Books: Python Data Science Handbook, By - Jake VanderPlas