Titanic - Machine Learning from Disaster

Gangavelli Ruthwik
Oct 28, 2023
3 min read

In this we use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we are asked to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data that we have (ie name, age, gender, socio-economic class, etc).

Here we have total 3 Data Sets(gender_submission.csv, test.csv, train.csv)

Gender_submission.csv file contains the details like passenger Id and Survival

status (1 - Survived , 0 for deceased).

Test.csv file contains the details like Passenger Id, Passenger Class, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin and Embarked.

Train.csv file cantains the details like contains the details like Passenger Id, Passenger Class, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin and Embarked

now with all this data we can find the no of women and men survived becasuse we have passenger id as common in both gender_submission.csv and train.csv and can identify which passenger is male or female.

for Female:

for male:

From the above two images we can understand that neaarly 75% of all the females and 19% of all the males who boarded the ship has survived and can easily conclude that there is a strong indication from gender_submision.csv file when we consider only a

single column but to get we more accurate prediction we need to use multiple columns

So here we will build a machine learning model which is known as RANDOM FOREST

This model is constructed of several "trees" that will individually consider each passenger's data and vote on whether the individual survived. Then, the random forest model makes a democratic decision: the outcome with the most votes wins!

Now using random forest model we will generate a code and ru it to get our model prediction value.

From the above image looks for patterns in four different columns ("Pclass", "Sex", "SibSp", and "Parch") of the data. It constructs the trees in the random forest model based on patterns in the train.csv file, before generating predictions for the passengers in test.csv. The code also saves these new predictions in a CSV file submission.csv.

After the submission we will get the new prediction value which is calculated over multiple columns.

MY CONTRIBUTION:

To determine if a passenger survived the incident or not, pipeline will be used to preprocess data, train the random forest classifier, and analyze the results. The preprocessing is split into two phases: one for categorical characteristics and one for numerical features. The numeric transformer scales the data with a mean of 0 and a standard deviation of 1, and fills in all the missing values with the median of the column. The most frequent value in the column that is absent is filled by the categorical transformer before the categorical variables are encoded.

A Column Transformer object is used by the preprocessor variable to combine the two transformers. The category transformer should be applied to the Pclass, Sex, and Embarked columns, while the numeric transformer should be applied to the Age, SibSp, Parch, and fare columns.

After doing some modifications and run the code again and submitted the predicition i had slightly imporoved the value.

IMPROVEMENT:

After doing some modifications to the code and running it and submitting the model predicition i was able to improve the prediction value.