## Or how, in a hypothetical world tormented by zombies, decision trees could make the difference between being out of the woods or not

*Outside the garage, the growls and snarls didn’t stop. He couldn’t imagine that the Zombie Apocalypse he had watched again and again in series and flicks was finally on his front porch. He could wait hidden within the garage for a while but had to come back out eventually. Should he take an axe with him or would the rifle be enough? He could try to seek out some food but, should he go alone? He tried to recollect all of the zombie movies he had seen but couldn’t agree on a single strategy. If he only had a way of remembering every scene where a personality is killed by zombies, would that be enough to extend his probabilities of survival? If he just had a call guide the whole lot can be simpler…*

Have you ever ever watched one among those zombie apocalypse movies where there’s one character that all the time seems to know where the zombies are hidden or if it is healthier to fight with them or run away? Does this person really know what’s going to occur? Did someone tell him/her beforehand? Possibly there’s nothing magical about this. Possibly this person has read quite a lot of comics about zombies and it is de facto good at knowing what to do in each case and learning from others’ mistakes. How essential it’s to seek out one of the simplest ways of using the events of the past as a guide for our decisions! This guide, also generally known as a call tree, is a widely used supervised learning algorithm. This text is an introductory discussion about decision trees, easy methods to construct them and why lots of them create a random forest.

You might be in the midst of the zombie mayhem and you would like to know easy methods to increase your probabilities of survival. At this point, you simply have information from 15 of your mates. For every one among them you recognize in the event that they were alone, in the event that they had a vehicle or a weapon or in the event that they were trained to fight. Most significantly, you recognize in the event that they were in a position to survive or not. How will you use this information to your advantage?

Table 1 summarizes the outcomes and characteristics of your 15 friends. You would like to be just like the 3 of them that survived in the long run. What do these 3 friends have in common? A straightforward inspection of the table will tell us that the three survivors had all these items in common: they weren’t alone, they were trained to fight, they usually had a vehicle and a weapon. So, will you find a way to survive in the event you had all these 4 things? Past experiences are telling us that you simply might! When you had to determine what to take with you and whether to be on your individual or not, at the least you now have some historical data to support your decision.

Zombie apocalypses are never so simple as they appear. Let’s say that as a substitute of the 15 friends from the previous example, you may have the next friends:

This time, reaching a conclusion only by visual inspection just isn’t that easy. The one thing we all know needless to say is that if you would like to survive, you higher have someone by your side. The 5 those who survived weren’t alone (Figure 1). Besides this, it’s difficult to see if there’s a specific combination of things that may lead you to survival. Some people were in a position to survive although they were alone. How did they do? When you know you will probably be alone, what else are you able to do to extend your probabilities of surviving? Is there anything like a call roadmap?

We will find some answers to the previous questions in a decision tree. A call tree is a model of the expected end result we will find in keeping with the selections we make. This model is built using previous experiences. In our example, we will construct a call tree using the characteristics of the 15 friends and their outcomes. A call tree consists of multiple decision nodes or branches. In each one among these nodes, we make a call that may take us to the next node until we reach an end result.

## Growing a call tree

If someone asks you to attract a genealogical tree, you’d probably start together with your grandparents or great-grandparents. From there, the tree will grow through your parents, uncles and cousins, until it reaches you. In the same way, to grow a call tree you mostly start from a node that does the perfect separation of your data. From that time, the tree will start growing in keeping with the feature that best divides the information. There are many algorithms you should use to grow a call tree. This text explains easy methods to use the information gain and the Shannon entropy.

Let’s give attention to Table 2. We will see that there are 5 individuals who survived and 10 who died. Which means that the probability of surviving is 5/15 = ⅓ and the probability of dying is ⅔. With this information, we will calculate the entropy of this distribution. On this case, entropy refers to the typical level of surprise or uncertainty on this distribution. To calculate the entropy we use the next equation:

Note that this equation can be expressed when it comes to one among the possibilities since *p(surv)*+*p(die)*=1. If we plot this function you possibly can see how the entropy has the very best value of 1 when each *p(surv)* and *p(die)* are equal to 0.5. Quite the opposite, if the entire distribution corresponds to individuals who all survived or all died, the entropy is zero. So, the very best the entropy, the very best the uncertainty. The bottom the entropy, the more homogeneous the distribution is and the less “surprised” we will probably be in regards to the end result.

In our case, the variety of survivors is lower than half of all the population. It could be reasonable to think that almost all people don’t survive the zombie apocalypse. The entropy on this case is 0.92 which is what you get within the blue curve of Figure 2 while you seek for x=⅓ or ⅔ or while you apply the next equation:

Now that we all know the entropy or the degree of uncertainty of all the distribution, what should we do? The following step is to seek out easy methods to divide the information so we will keep that level of uncertainty.

The premise of the knowledge gain consists of selecting the choice node that may reduce the less the extent of entropy of the previous node. At this point, we are attempting to seek out which is the perfect first separation of the information. Is it the undeniable fact that we’re alone, that we all know easy methods to fight or that we’ve got a vehicle or a weapon? To know the reply we will calculate what’s the knowledge gain of every one among these decisions after which determine which one among them has the most important gain. Keep in mind that we are attempting to attenuate the change within the entropy which is the heterogeneity or level of surprise within the end result distribution.

## Are you trained to fight?

Is that this the primary query you need to ask yourself on this case? Will this query minimize the change in entropy of the end result distribution? To know this, let’s calculate the entropy of every one among these two cases: we all know easy methods to fight and we have no idea easy methods to fight. Figure 3 shows that of the 9 those who knew easy methods to fight, only 5 of them survived. Quite the opposite, all the 6 individuals who weren’t trained to fight didn’t survive.

To calculate the entropy of the previous cases we will apply the identical equation we used before. Figure 3 shows how the entropy for the case through which people were trained to fight is 0.99 whereas in the opposite case, the entropy is zero. Keep in mind that an entropy of zero means no surprise, homogeneous distribution, that is what is definitely happening on this case since all of the individuals who weren’t trained to fight, didn’t survive. At this point, it can be crucial to notice that the calculation of the entropy on this second scenario comprises an undefined calculation since we are going to find yourself with a logarithm of zero. In these cases, you possibly can all the time apply L’Hôpital’s rule because it is explained on this article.

We now have to calculate the knowledge gain from this decision. This is similar as asking how much the uncertainty in all my decisions would change if I made a decision to separate all of the outcomes in keeping with this query. The data gain is calculated by subtracting the entropy of every decision from the entropy of the major node. A vital thing to note is how this operation is weighted in keeping with the number of people on each decision. So a giant entropy can have a small effect on the knowledge gain calculation if the variety of those who took that call is small. For this instance, the knowledge gain from the power to fight is 0.32 because it is shown in Figure 3.

## Are you able to survive the zombies all by yourself?

We will do the same evaluation of the potential for surviving the zombie apocalypse alone or with another person. Figure 4 shows the calculation. On this case, the knowledge gain is 0.52. Note how on this case, the potential for being alone never led to survival, whereas in cases where the person was not alone, it survived in 5 of the 7 cases.

## What about having a vehicle or a weapon?

For these two cases, we will calculate the knowledge gain as we did before (Figure 5). You possibly can see how the knowledge gains are smaller than those calculated previously. Which means that, at this point, it is healthier to divide our data in keeping with the 2 previous features than to those ones. Keep in mind that the most important information gain corresponds to the feature through which the entropy reduction is the smallest. Once we’ve got calculated all the knowledge gains for each feature we will determine what the primary node of the choice tree will probably be.

## The primary node

Table 3 shows the knowledge gains for every feature. The largest information gain corresponds to the actual fact of being alone or having a companion. This node takes us to the primary decision in our tree: you won’t survive alone. The 8 those who were alone didn’t make it no matter whether or not they had a weapon, a automotive or were trained to fight. So that is the very first thing we will infer from our evaluation which supports what we had concluded by just inspecting the information.

At this point, the choice tree looks like Figure 5. We all know that there aren’t any possibilities of surviving if we’re alone (bearing in mind the information we’ve got). If we are usually not alone, then we’d survive but not in all cases. Since we will calculate the entropy at the precise node in Figure 5, which is 0.86 (this calculation is shown in Figure 4), then we may calculate the knowledge gain from the opposite three features and judge which the following decision node will probably be.

## The second node

Figure 5 shows that the most important information gain at this point comes from the weapon feature in order that is the following decision node as is shown in Figure 6. Note how all of the those who weren’t alone and had a weapon survived and that’s the reason the left side of the weapon node finishes in a survival decision.

## The tree is complete

There are still 3 those who weren’t alone and didn’t have a weapon that we’d like to categorize. If we follow the identical process explained previously, we are going to find that the following feature with the most important information gain is the vehicle. So we will add an extra node to our tree through which we ask if a specific person had a vehicle. That can divide the remaining 3 people into a gaggle of two those who did have a vehicle but didn’t survive and one single person with no vehicle who survived. The ultimate decision tree is presented in Figure 7.

## The issue with decision trees

As you possibly can see, the choice tree is a model built with previous experiences. Depending on the variety of features in your data you’ll encounter multiple questions that may guide you to the ultimate answer. It is necessary to note how on this case one among the features just isn’t represented in the choice tree. The flexibility to fight was never chosen as a call node for the reason that other features all the time had a much bigger information gain. Which means that, in keeping with the input data, being trained on easy methods to fight just isn’t essential to survive a zombie apocalypse. Nonetheless, this might also mean that we didn’t have enough samples to find out if the power to fight was essential or not. The important thing here is to do not forget that a call tree is nearly as good because the input data we’re using to construct it. On this case, a sample of 15 people won’t be enough to have estimation of the importance of being trained to fight. That is one among the issues of the choice trees.

As with other supervised learning approaches, decision trees are usually not perfect. On the one hand, they rely heavily on the input data. Which means that a small change within the input data can result in essential changes in the ultimate tree. Decision trees are usually not really good at generalizing. However, they have a tendency to have overfitting problems. In other words, we will find yourself with a posh decision tree that works perfectly with the input data but it’s going to dramatically fail once we use a test set. This might also affect the outcomes if we’re using the choice tree with continuous variables as a substitute of categorical ones like the instance presented.

A technique of creating decision trees more efficient is to prune them. This implies stopping the algorithm before reaching a pure node just because the ones we reached in our example. This may lead to the removal of a branch that just isn’t providing any improvement within the accuracy of the choice tree. Pruning gives the choice tree more generalization power. Nonetheless, if we determine to prune our decision tree then we’d start asking additional questions corresponding to: when is the precise moment to stop the algorithm? Should we stop after we reach a minimum variety of samples? Or after a predefined variety of nodes? Methods to determine these numbers? Pruning can definitely help us to avoid overfitting nevertheless it may result in additional questions that are usually not so easy to reply.

What if as a substitute of a single decision tree, we’ve got multiple decision trees? They are going to change in keeping with the portion of the input data they take, the features they read and their pruning characteristics. We are going to find yourself with many decision trees and many alternative answers but we will all the time go together with the bulk within the case of a classification task or a mean if we’re working on regressions. This may help us to generalize the distribution of our data higher. We would think that one decision tree is misclassifying but when we discover 10 or 20 trees that reach the identical conclusion, then that is telling us that there is likely to be no misclassification in any case. Principally, we’re letting the bulk determine as a substitute of guiding ourselves by one single decision tree. This technique is known as Random Forest.

The concept of Random Forests will likely be related to the concept of Bagging which is a process where a random sample of information in a training set is chosen with substitute. Which means that individual data points will be chosen greater than once. Within the Random Forest methodology, we will select a random variety of points, construct a call tree, after which do that again until we’ve got multiple trees. Then, the ultimate decision will come from all of the answers that were obtained from the trees.

Random Forests is a widely known ensemble method used for classification and regression problems. This method has been applied in lots of industries corresponding to finance, healthcare and e-commerce [1]. Although the unique idea of Random Forests was slowly developed by many researchers, Leo Breiman is usually generally known as the creator of this technique [2]. His personal webpage comprises an in depth description of Random Forests and an in depth explanation of how and why it really works. It’s an extended but worthy read.

A further and essential thing to grasp about random forests is the best way through which they work with the features of the dataset. At each node, the random forest will randomly select a pre-defined variety of features as a substitute of all of them to determine easy methods to split each node. Keep in mind that within the previous example, we analyzed the knowledge gain from each feature at each level of the choice tree. Quite the opposite, a random forest will only analyze the knowledge gain from a subset of the features at each node. So, the random forest mixes Bagging with the random variable selection at each node.

Let’s return to the zombies! The previous example was really easy, we had data from 15 people and we only knew 4 things about each one among them. Let’s make this harder! Let’s say that now we’ve got a dataset with multiple thousand entries and for every one among them we’ve got 10 features. This dataset was randomly generated in Excel and doesn’t belong to any industrial or private repository, you possibly can access it from this GitHub page.

As is common with all these methodologies, it’s idea to separate all the dataset right into a training and testing set. We are going to use the training set to construct the choice tree and random forest models after which we are going to evaluate them with the test set. For this purpose, we are going to use the scikit-learn libraries. This Jupyter Notebook comprises an in depth explanation of the dataset, easy methods to load it and easy methods to construct the models using the library.

The complete dataset comprises 1024 entries of which 212 (21%) correspond to survivals and 812 (79%) to deaths. We divided this dataset right into a training set that corresponds to 80% of the information (819 entries) and a testing set which comprises 205 entries. Figure 8 shows how the relation between survivals and deaths is maintained in all sets.

Regarding the features, this time we’ve got 6 additional characteristics for every individual:

- Do you may have a radio?
- Do you may have food?
- Have you ever taken a course in outdoor survival?
- Have you ever taken a primary aid course?
- Have you ever had a zombie encounter before?
- Do you may have a GPS?

These 6 features combined with the 4 features we already had, represent 10 different characteristics for every individual or entry. With this information, we will construct a call tree following the previously explained steps. The Jupyter Notebook uses the function DecisionTreeClassifier to generate a Decision Tree. Note that this function just isn’t meant to work for categorical variables. On this case, we’ve got converted all of the answers for every category to -1 or +1. Which means that each time we see a -1 in the outcomes it means “No” whereas a +1 means “Yes”. This is healthier explained within the Jupyter Notebook.

The Notebook explains easy methods to load the information, call the choice tree function and plot the outcomes. Figure 9 shows the choice tree that was built with the 819 entries that corresponded to the training set (click here for a much bigger picture). The dark blue boxes correspond to final decision nodes through which the reply was survival whereas the dark orange boxes represent final decision nodes where the ultimate answer was not survival. You possibly can see how the primary decision node corresponds to the vehicle and from there, the tree starts growing in keeping with different features.

We will evaluate how good this tree is that if we use the test set inputs to predict the ultimate categories after which compare these results with the unique results. Table 4 shows a confusion matrix with the variety of times the choice tree misclassified an entry. We will see that the test set had 40 cases that represented survival and the choice tree only classified accurately 25 of them. However, from the 165 cases that didn’t survive, the choice tree misclassified 11. The relation between the proper classifications and all the dataset of 205 points is 0.87 which will likely be generally known as the prediction accuracy rating.

87% of accuracy doesn’t look bad but, can we improve this using a random forest? The following section of the Jupyter Notebook comprises an implementation of a random forest using the sklearn function RandomForestClassifier. This random forest will contain 10 decision trees which are built using all of the entries but only considering 3 features at each split. Each of the choice trees within the random forest considers 682 entries which represent 84% of the total training set. So, simply to be clear, the random forest process will:

- Take a random subset of 682 entries from the training set
- Construct a call tree that considers 3 randomly chosen features at each node
- Repeat the previous steps 9 additional times
- The predictions will correspond to the bulk vote over the ten decision trees

Table 5 shows the confusion matrix for the outcomes coming from the random forest. We will see that these results are higher than what we were getting before with a single decision tree. This random forest misclassifies 11 entries and has a prediction accuracy rating of 0.95 which is higher than the choice tree.

It is necessary to keep in mind that the random forest methodology just isn’t only nearly as good because the input data we’ve got but additionally nearly as good because the collection of parameters that we use. The variety of decision trees we construct and the variety of parameters we analyze at each split may have a crucial effect on the end result. So, as is the case of many other supervised learning algorithms, it’s mandatory to spend a while tuning the parameters until we found the perfect possible result.

Going through this text is similar to that guy in the films that managed to flee from the zombie that was chasing him because a tree branch fell on the zombie’s head just at the precise time! This just isn’t the one zombie he’ll encounter and he is certainly not out of the woods yet! There are lots of things about Random Forests and Decision Trees that weren’t even mentioned in this text. Nonetheless, it’s enough to grasp the usage and applicability of this method. Currently, there are multiple libraries and programs that construct these models in seconds. So you almost certainly don’t have to undergo the entropy and data gain calculation again. Still, it can be crucial to grasp what is occurring behind the scenes and easy methods to accurately interpret the outcomes. In a world where topics corresponding to “Machine Learning”, “Ensemble Methods” and “Data Analytics” are each day more common, it can be crucial to have a transparent idea of what these methods are and easy methods to apply them to on a regular basis problems. Unlike the zombie apocalypse survival movies, being ready doesn’t occur by probability.

- IBM. What’s random forest?
- Louppe, Gilles (2014). Understanding Random Forests. PhD dissertation. University of Liege