Titanic challenge on Kaggle with decision trees (party) and SVMs (kernlab)

titanic-iconThe Titanic challenge on Kaggle is about inferring from a number of personal details whether a passenger survived the disaster or did not. I gave two algorithms a try, which are decision trees using R package party and SVMs using R package kernlab. I chose to use party for the decision trees over the more prominent rpart because the authors of party make a very good point why their approach is likely to outperform it and other approaches in terms of generalization.

Recursive binary partitioning

All “recursive binary partitioning” algorithms producing a decision tree will iteratively repeat two steps for every generated node realizing a split:

  1. Determine a covariate to split at
  2. Determine a split within the set of values of that covariate

Because step two will split a covariate’s set of values into two parts the method is called “binary partitioning” and because those two steps are traversing a tree from the root node to the terminal nodes (leafs) in the usual recursive manner it is called “recursive binary partitioning”.

What makes the party approach special?

Both steps and the stopping criteria for the algorithm are based on permutation tests instead of optimizing an information measure. So essentially a hypothesis test is performed for all covariates separately against the null hypothesis “The covariate X is independent of the repsonse variable.”. If this hypothesis cannot be rejected at a configured significance level then the algorithm stops because no split is assumed to improve the classification. Otherwise the covariate is chosen for the next split. Optimizing an information level instead (as it is done in rpart) will be performed until a specified level of information purity is reached – but because this approach disregards statistical meaningfulness it is prone to overfitting. For that reason the reasearcher is expected to perform a pruning of the tree – cutting off branches which are assumed to impair generalization.

Further advantages stem from having statistical tests guide the unfolding of the tree. The authors of party assess for example that the bias towards selecting a covariate with many possible splits is highly reduced compared to other common approaches. For the gory details and benchmarks I highly recommend the quite well written vignette provided for the party package.

Representing the data

The initial data set contains for every passenger the following information:

  • class in {1,2,3}
  • sex
  • name (contains titles – {Mr., Mrs., Miss., Master., Dr., …}
  • age (partially missing)
  • number of people on board being sibling or spouse
  • number of people on board being parent or child
  • ticket number
  • fare
  • cabin (mostly missing)
  • embarked in {S,Q,C}

I decided to use the following representations:

  • class (as is)
  • sex (as is)
  • title in {Mister, Doctor, Soldier, Reverend, Missis, Miss, Master}
  • age category in {0-4, 5-10, 11-15, 16-59, >59}
  • number of people on board being sibling or spouse
  • number of people on board being parent or child
  • ticket (only the letters in the beginning or ‘123’ for just digits)
  • cabin (the first letter or “unknown”)
  • embarked

In the end the formula to be modelled with SVMs and decision trees was in both cases as follows:

 Finding a good tree

To find a good tree I performed a 100-fold cross validated grid search on five settings of party::ctree():

  • testtype in {Bonferroni, Univariate, Teststatistic} (no “MC”)
  • teststat in {quad} (no “max”)
  • mincriterion in {90%, 95%, 99%}
  • minsplit in {60, 40, 20, 10}
  • minbucket in {30, 20, 10, 5}
  • maxdepth in {0, 4, 5, 6}

A quick initial check led me to disregard “max” because it didn’t seem to improve the performance and “MonteCarlo” because it took easily 10 times and more longer to come up with a tree while not showing interesting performance.

At the top of the results are multiple trees calculated using Univariate for the test type and a significance level of either 10% or 5%. The top trees of the grid search also showed the best performance for the test set evaluated on Kaggle of getting 79.4% of the cases right.

  The winning tree

Univariate-95-10-5-0Well I do think that looks pretty cool! Nonetheless the quality of the insights is lower than it could be given that only the training data set has been used for learning. Of course a tree based on the full data set would be pretty awesome.

Most of what the tree tells us is no news – women and children show a higher survival rate and the lower the class the worse the prospects. Beyond that the tree seems to indicate that among women and children travelling third class the survival rate seems to be lower for those being companied by more than two people being their spouse or sibling. I would rather assume that the more people you are close to on board the higher your chance of survival.

Let’s give SVMs a try

To choose a promising SVM set up I realized a 100-fold cross validated grid search on parameters for three kernels using kernlab::ksvm() and the same data set as above:

  • rbfdot for C x sigma = {10^-2,…,10^5} x {10^-5,…,10^1}
  • tanhdot for C x sigma x r ={10^-2,…,10^5} x {10^-5,…,10^1} x {-2.4,-1.8,…,2.4}
  • polydot for C x degree x offset = {10^-2,…,10^2} x {1,…,5} x {0,1}

And here you find the results of the grid search. One SVM got my out-of-sample accuracy as high as 81.3% which is almost 2p.p. higher than the best performance I got with trees. But – and that is a big but – in a real world scenario one does not have the luxury of so easily assessing an estimation of the out-of-sample error (which I get by submitting my predictions for the test set to Kaggle). This means that practically one would have chosen the kernel configuration which performed best in the grid search – and the big surprise is that the top SVMs of the grid search do show – surprise, surprise – exactly the same out-of-sample accuracy as the best performing tree – 79.4%.

 So in this case the match SVM vs. tree ends with a solid tie!

Of course I expected the SVM to leave the tree model behind and I guess it makes sense to consider the tie an indication of that 79.4% is the best to take from my initially chosen data representation. But that does not seem to be necessarily the right conclusion at all given that despite the identical measured out-of-sample performance the top SVM and the top tree – they only agree in 91% of the cases! Most of those cases apparently belong to terminal node 11 which is pretty much a 50/50 classification for the tree as it can be seen on the tree plot. Given that about 44% – less than 50% – of the passengers falling into this node did no survive, the tree exclusively predicts a non-survival. The SVM on the other hand predicts – for whatever mysterious reason, instead a survival for almost all of those passengers!

(original article published on www.joyofdata.de)