Decision trees

What are decision trees?

Decision trees are classification models built using a tree where a particular feature is selected for branching at different levels down the tree.

Decision tree

The primary choice is which feature to use for splitting first?

We use info gain or Gini to make that decision.

Info gain:


but what does the info gain represent?

Info gain of a particular feature(work experience in above example) represents the predictability gain as we go down one level using that feature.

So, How is this predictability gain calculated?

This is measured by difference in predictability at parent node and expected sum of predictabilities in child nodes.

What is predictability at a node?

If number of positive and negative examples at a particular node is almost equal, then it is equally likely for a new example at that node to be either positive or negative.

If number of positive examples is far more than the number of negative examples at a particular node, then we can say with much more confidence that a new example at this node is more likely to be positive.

Second case is more predictable and contains less surprise, whereas first case is less predictable and has more surprise.

Now we use Entropy to quantify surprise at any node. How?


  1. How to train?
    1. How to select the next best feature for branching?
      1. Calculate info-gain for all the features and select the one with maximum.
  2. How to predict?
    1. Just pass the test example down the tree and you have your classification at the leaf node.
  3. How to write a decision tree in python using scikit?
    1. writing the classifierdtclf
    2. visualizing the decision treevizdt

When to use decision trees?

  1. well-suited for categorical data.

How much training data do you need?

Nature of decision boundary

Decision trees are interpretable unlike neural networks. Here you can exactly understand why a classifier makes a decision.

They are weak learners and hence good for ensemble methods.

They can handle unbalanced data sets, 1 % positive examples and 99 % negative examples.

This is Univariate – does not combine features.