Would you survive the Titanic?
Introduction
Analyze what kind of people are likely to survive. In particular, the use of machine learning tools to predict which passengers survived the Titanic
disaster. In fact, we have several features and two datasets, one is training and the other is testing. In this report, I try to investigate this issue and answer the following questions by using the logistic regression method, which is
fully described algorithms and Python codes in the machine learning classes.
Training Model
At this point for our training model with loss function look for the best rated learn and iteration number.
Using Train dataset for 100000 iterations. Depending on the result, the values in the thousandth range are better than the 10 thousandth range, so run the function again in a range (1e-3 and 4e-3) and you can see after approximately 50000 iterations, it converged at 0.004(second figure)
Analyze our model
After training, our model we obtain weight for all features and biases:
Class= -1.1468791 Sex= 2.80207082
Age= -0.03980783 Number Sibling= -0.32565276
Number Children= -0.09992111 Fare=0.00564653
Biases= 2.242518787043841
By observing the weights obtained, we can conclude that the higher class for the passenger, the lower the chance of survival (first-class passengers had more chance for survival than2end or third classes) because the value is
negative for class. The gender is positive therefore chance the of females was more than males (male =0 and female=1 in the dataset.) Age is negative, so young people had more chance of surviving The number of sibling and children are negative therefore who had fewer siblings and children
had a chance more of surviving(statistically) and Fare is positive and it shows that each person who buys a more expensive ticket had more chance of living. Another thing that we can understand in this model is two the biggest number is for class and sex therefore, we can guess most of the people who survived were female and were in the first classes. However, the weight of the sex
parameter is more than age but we must be careful about the range of these parameters (sex is just 0 and 1, age is more). We can see a scatter plot showing the distribution of the two classes in the plane defined by the two most influential features
Yellow is the color of people who survive, and we can tell that most of the women who survived were from class 1 and most of the men who died were from class 3. On the other hand, in the chart below, almost no one was alive after 65 years.
In the end, we check the accuracy of our model and the result is almost 80%
After interpreting our model, we can predict the test dataset with the model.
Evaluate the model
By loading the experimental data set, model evaluation became possible. By keeping the threshold at 0.5, the test accuracy reached 78.5%. This value is a little less than the accuracy of the training, the model does not seem to have an overfitting or under-fitting problem, but to improve the model, I decided to
remove the features that weigh less and fit the model again. After removing features from the training dataset and calculating weight and biases age and using, them for test data set accuracy increased by 0.5 and obtain 79%