## Decision trees

What are decision trees?

Decision trees are classification models built using a tree where a particular feature is selected for branching at different levels down the tree.

They help you to create non-linear boundaries with series of linear questions.

The primary choice to make is which feature to use for splitting first?

The objective in choosing the order of split is to further narrow down the possibilities faster. For example, Animal? Person? famous? living? singer? female? Michael jackson?

If you have started with Michael, if I said yes, then that’s good, but if I said no, you effectively narrowed down almost nothing.

We use info gain or Gini to make that decision.

Info gain:

but what does the info gain represent?

Info gain of a particular feature(work experience in above example) represents the predictability gain as we go down one level using that feature.

So, How is this predictability gain calculated?

This is measured by difference in predictability at parent node and expected sum of predictabilities in child nodes.

What is predictability at a node?

If number of positive and negative examples at a particular node is almost equal, then it is equally likely for a new example at that node to be either positive or negative.

If number of positive examples is far more than the number of negative examples at a particular node, then we can say with much more confidence that a new example at this node is more likely to be positive.

Second case is more predictable and contains less surprise, whereas first case is less predictable and has more surprise.

Now we use Entropy to quantify surprise at any node. How?

If all the samples belong to the same class, entropy is 0. If the samples are evenly distributed among all classes, then the entropy is 1.

1. How to train?
1. How to select the next best feature for branching?
1. Calculate info-gain for all the features and select the one with maximum.
2. How to predict?
1. Just pass the test example down the tree and you have your classification at the leaf node.
3. How to write a decision tree in python using scikit?

When to use decision trees?

1. well-suited for categorical data.

How much training data do you need?

Nature of decision boundary

Decision trees are interpretable, unlike neural networks. Here you can exactly understand why a classifier makes a decision.

They are weak learners and hence good for ensemble methods.

They can handle unbalanced data sets, 1 % positive examples and 99 % negative examples.

This is Univariate – does not combine features.

Expressiveness among all possible decision trees in n-attributes:

How to deal with continuous attributes? Age,weight,distance

It does make sense to repeat an attribute along a path of a tree if attributes are discrete. If attributes are continuous, then you ask different questions for same attributes.

What to do to avoid overfitting in decision tree?

1. like usually, using cross validation.
2. Checking with cross validation at every time you expand the tree and the cv score is less than a particular threshold, then stop the expansion.
3. Or first, expand the whole tree and start pruning back the leaves and check cv score at each step.

Algorithms:

ID3: Top down

Tuning parameters in sklearn:

1. min_samples_split(default=2):  If node contains only n samples, then it will not be split any further.
2. criterion=gini,entropy

Decision trees are generally prone to overfitting.

You can build bigger classifiers like ensemble methods using decision trees.

Glossary:

References:

https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/#one

https://zyxo.wordpress.com/2010/09/17/why-decision-trees-is-the-best-data-mining-algorithm/