While decision trees can often be effective as interpretable models (they’re quite comprehensible), they depend on a greedy approach to construction that may end up in sub-optimal trees. In this text, we show learn how to generate classification decision trees of the identical (small) size that could be generated by a typical algorithm, but that may have significantly higher performance.
This text continues a series of articles on interpretable AI that also includes discussions of ikNN, AdditiveDecisionTrees, and PRISM Rules.
It’s often useful in machine learning to make use of interpretable models for prediction problems. Interpretable models provide no less than two major benefits over black-box models. First, with interpretable models, we understand why the particular predictions are made as they’re. And second, we will determine if the model is secure to be used on future (unseen) data. Interpretable models are sometimes preferred over black-box models, for instance, in high-stakes or highly-regulated environments where there is simply too much risk in using black-box models.
Decision trees, no less than when constrained to reasonable sizes, are quite comprehensible and are excellent interpretable models once they are sufficiently accurate. Nevertheless, it shouldn’t be at all times the case that they achieve sufficient accuracy and decision trees can often be fairly weak, particularly in comparison with stronger models for tabular data corresponding to CatBoost, XGBoost, and LGBM (that are themselves boosted ensembles of decision trees).
As well, where decision trees are sufficiently accurate, this accuracy is usually achieved by allowing the trees to grow to large sizes, thereby eliminating any interpretability. Where a choice tree has a depth of, say, 6, it has 2⁶ (64) leaf nodes, so effectively 64 rules (though the foundations overlap, so the cognitive load to know these isn’t necessarily as large as with 64 completely distinct rules) and every rule has 6 conditions (lots of which are sometimes irrelevant or misleading). Consequently, a tree this size could probably not be considered interpretable — though could also be borderline depending on the audience. Actually anything much larger wouldn’t be interpretable by any audience.
Nevertheless, any reasonably small tree, corresponding to with a depth of three or 4, may very well be considered quite manageable for many purposes. In truth, shallow decision trees are likely as interpretable as some other model.
Given how effective decision trees could be as interpretable models (even when high accuracy and interpretability isn’t at all times realized in practice), and the small variety of other options for interpretable ML, it’s natural that much of the research into interpretable ML (including this text) pertains to making decision trees which are simpler as interpretable models. This comes right down to making them more accurate at smaller sizes.
In addition to creating interpretable models, it’s also often useful in machine learning to make use of interpretable models as something called proxy models.
For instance, we will create, for some prediction problem, possibly a CatBoost or neural network model that appears to perform well. However the model might be (if CatBoost, neural network, or most other modern model types) inscrutable: we won’t understand its predictions. That’s, testing the model, we will determine if it’s sufficiently accurate, but won’t give you the option to find out why it’s making the predictions it’s.
Given this, it might or might not be workable to place the model into production. What we will do, though, is create a tool to attempt to estimate (and explain in a transparent way) why the model is making the predictions it’s. One technique for that is to create what’s called a proxy model.
We are able to create a less complicated, interpretable model, corresponding to a Decision Tree, rule list, GAM, or ikNN, to predict the behavior of the black-box model. That’s, the proxy model predicts what the black-box model will predict. Decision Trees could be very useful for this purpose.
If the proxy model could be made sufficiently accurate (it estimates well what the black-box model will predict) but in addition interpretable, it provides some insight into the behavior of the black-box model, albeit only roughly: it will probably help explain why the black-box model makes the predictions it does, though might not be fully accurate and should not give you the option to predict the black-box’s behavior on unusual future data. Nevertheless, where only an approximate explanation is crucial, proxy models could be quite useful to assist understand black-box models.
For the rest of this text, we assume we’re creating an interpretable model for use because the actual model, though making a proxy model to approximate one other model would work in the identical way, and can be a crucial application of making more accurate small decision trees.
Normally when constructing a choice tree, we start at the foundation node and discover one of the best initial split, creating two child nodes, that are then themselves split in two, and so forth until some stopping condition is met.
Each node in a choice tree, during training, represents some portion of the training data. The foundation node covers the complete training set. This can have two child nodes that every represent some subset of the training data (such that the 2 subsets don’t overlap, and canopy the complete set of coaching data from their parent node).
The set of rows covered by each internal node are split into two subsets of rows (typically not of the identical sizes) based on some condition referring to one in all the features. Within the figure below, the foundation node splits on feature A > 10.4, so the left node will represent all rows within the training data where feature A < 10.4 and the suitable node will represent all rows within the training data where feature A ≥ 10.4.
To seek out the split condition at each internal node, this process selects one feature and one split point that may maximize what’s often known as the information gain, which pertains to the consistency within the goal values. That’s: assuming a classification problem, we start (for the foundation node) with the complete dataset. The goal column will include some proportion of every goal class. We try to separate the dataset into two subsets such that the 2 subsets are each as consistent with respect to the goal classes as possible.
For instance, in the complete dataset we can have 1000 rows, with 300 rows for sophistication A, 300 for sophistication B, and 400 for sophistication C. We may split these into two subsets such that the 2 subsets have:
- left subset: 160 class A, 130 class B, 210 class C
- right subset: 140 class A, 170 class B, 190 class C
Here now we have the proportions of the three classes almost the identical within the two child nodes as in the complete dataset, so there isn’t any (or almost no) information gain. This could be a poor alternative of split.
However, if we split the info such that now we have:
- left subset: 300 class A, 5 class B, 3 class C
- right subset: 0 class A, 295 class B, 397 class C
On this case, now we have way more consistency by way of the goal within the two child nodes than in the complete dataset. The left child has almost only class A, and the suitable child has only classes B and C. So, this can be a excellent split, with high information gain.
The most effective split could be then be chosen, as possibly the second example here, or, if possible, a split leading to even higher information gain (with much more consistency within the goal classes within the two child nodes).
The method is then repeated in each of those child nodes. Within the figure above we see the left child node is then split on feature B > 20.1 after which its left child node is split on feature F > 93.3.
This is mostly an inexpensive approach to constructing trees, but on no account guarantees finding one of the best tree that’s possible. Each decision is made in isolation, considering only the info covered by that node and never the tree as a complete.
Further, with standard decision trees, the choice of the feature and threshold at each node is a one-time decision (that’s, it’s a greedy algorithm): decision trees are limited to the alternatives made for split points once these splits are chosen. While the trees can (at lower levels) compensate for poor modeling selections higher within the tree, this may normally lead to extra nodes, or in splits which are harder to know, so will reduce interpretability, and should not fully mitigate the consequences of the alternatives of split points above.
Though the greedy approach utilized by Decision Trees is usually quite sub-optimal, it does allow trees to be constructed in a short time. Historically, this was more vital given lower-powered computer systems (evaluating every possible split point in every feature at each node is definitely a considerable amount of labor, even when very fast on modern hardware). And, in a contemporary context, the speed allowed with a greedy algorithm may also be very useful, because it allows quickly constructing many trees in models based on large ensembles of decision trees.
Nevertheless, to create a single decision tree that’s each accurate and interpretable (of a fairly small size), using a greedy algorithm could be very limiting. It is usually possible to construct a choice tree of a limited size that may achieve each an excellent level of accuracy, and a substantially higher level of accuracy than could be found with a greedy approach.
Before taking a look at decision trees specifically, we’ll go over quickly genetic algorithms generally. They’re used broadly in computer science and are sometimes very effective at developing solutions to problems. They work by generating many potential solutions to a give problem and finding one of the best through trial and error, though in a guided, efficient way, simulating real-world genetic processes.
Genetic algorithms typically proceed by starting with plenty of candidate solutions to an issue (normally created randomly), then iterating repeatedly, with each round choosing the strongest candidates, removing the others, and making a latest set of candidate solutions based on one of the best (to this point) existing solutions. This will be done either by mutating (randomly modifying) an existing solution or by combining two or more right into a latest solution, simulating reproduction as seen in real-world evolutionary processes.
In this fashion, over time, a set of progressively stronger candidates tends to emerge. Not every latest solution created is any stronger than the previously-created solutions, but at each step, some fraction likely might be, even when only barely.
During this process, it’s also possible to commonly generate completely latest random solutions. Although these won’t have had the advantage of being mutations or mixtures of strong solutions (also initially created randomly), they might nevertheless, by likelihood, be as strong as some more evolved-solutions. Though that is increasingly less likely, because the candidates which are developed through the genetic process (and chosen as amongst one of the best solutions so far) develop into increasingly evolved and well-fit to the issue.
Applied to the development of decision trees, genetic algorithms create a set of candidate decision trees, select one of the best of those, mutate and mix these (with some latest trees possibly doing each: deriving latest offspring from multiple existing trees and mutating these offspring at the identical time). These steps could also be repeated any variety of times.
Every time a brand new tree is generated from a number of existing trees, the brand new tree might be quite much like the previous tree(s), but barely different. Normally most internal nodes might be the identical, but one (or a small variety of) internal nodes might be modified: changing either the feature and threshold, or just the edge. Modifications might also include adding, removing, or rearranging the prevailing internal nodes. The predictions within the leaf nodes must even be re-calculated every time internal nodes are modified.
This process could be slow, requiring many iterations before substantial improvements in accuracy are seen, but within the case covered in this text (creating interpretable decision trees), we will assume all decision trees are reasonably small (by necessity for interpretability), likely with a maximum depth of about 2 to five. This enables progress to be made substantially faster than where we try to evolve large decision trees.
There have been, over time, plenty of proposals for genetic algorithms for decision trees. The answer covered in this text has the advantage of providing python code on github, but is way from the primary and lots of other solutions may go higher in your projects. There are several other projects on github as well to use genetic algorithms to constructing decision trees, which could also be value investigating as well. But the answer presented here is simple and effective, and value taking a look at where interpretable ML is beneficial.
Besides genetic algorithms, other work looking for to make Decision Trees more accurate and interpretable (accurate at a constrained size) include Optimal Sparce Decision Trees, oblique decision trees, oblivious decision trees, and AdditiveDecisionTrees. The last of those I’ve covered in one other Medium article, and can hopefully cover the others in subsequent articles.
As well, there may be a body of labor related to creating interpretable rules including imodels and PRISM-Rules. While rules will not be quite corresponding to decision trees, they might often be utilized in the same way and offer similar levels of accuracy and interpretability. And, trees can at all times be trivially converted to rules.
Some tools corresponding to autofeat, ArithmeticFeatures, FormulaFeatures, and RotationFeatures might also be combined with standard or GeneticDecisionTrees to create models which are more accurate still. These take the approach of making more powerful features in order that fewer nodes inside a tree are crucial to attain a high level of accuracy: there may be some loss in interpretability because the features are more complex, however the trees are sometimes substantially smaller, leading to an overall gain (sometimes a really large gain) in interpretability.
Decision Trees could be fairly sensitive to the info used for training. Decision Trees are notoriously unstable, often resulting in several internal representations with even small changes within the training data. This will not affect their accuracy significantly, but could make it questionable how well they capture the true function between the features and goal.
The tendency to high variance (variability based on small changes within the training data) also often results in overfitting. But with the GeneticDecisionTree, we make the most of this to generate random candidate models.
Under the hood, GeneticDecisionTree generates a set of scikit-learn decision trees, that are then converted into one other data structure used internally by GeneticDecisionTrees (which makes the following mutation and combination operations simpler). To create these scikit-learn decision trees, we simply fit them using different bootstrap samples of the unique training data (together with various the random seeds used).
We also vary the scale of the samples, allowing for further diversity. The sample sizes are based on a logarithmic distribution, so we’re effectively choosing a random order of magnitude for the sample size. Given this, smaller sizes are more common than larger, but occasionally larger sizes are used as well. This is restricted to a minimum of 128 rows and a maximum of two times the complete training set size. For instance, if the dataset has 100,000 rows, we allow sample sizes between 128 and 200,000, uniformly sampling a random value between log(128) and log(200,000), then taking the exponential of this random value because the sample size.
The algorithm starts by making a small set of decision trees generated in this fashion. It then iterates a specified variety of times (five by default). Each iteration:
- It randomly mutates the top-scored trees created to this point (those best fit to the training data). In the primary iteration, this uses the complete set of trees created prior to iterating. From each top-performing tree, a lot of mutations are created.
- It combines pairs of the top-scored trees created to this point. This is completed in an exhaustive manner over all pairs of the highest performing trees that could be combined (details below).
- It generates additional random trees using scikit-learn and random bootstrap samples (less of those are generated each iteration, because it becomes harder to compete with the models which have experienced mutating and/or combining).
- It selects the top-performing trees before looping back for the subsequent iteration. The others are discarded.
Each iteration, a major number of latest trees are generated. Each is evaluated on the training data to find out the strongest of those, in order that the subsequent iteration starts with only a small variety of well-performing trees and every iteration tends to enhance on the previous.
In the long run, after executing the required variety of iterations, the one top performing tree is chosen and is used for prediction.
As indicated, standard decision trees are constructed in a purely greedy manner, considering only the knowledge gain for every possible split at each internal node. With Genetic Decision Trees, alternatively, the development of every latest tree could also be partially or entirely random (the development done by scikit-learn is essentially non-random, but relies on random samples; the mutations are purely random; the mixtures are purely deterministic). However the vital decisions made during fitting (choosing one of the best models generated to this point) relate to the fit of the tree as a complete to the available training data. This tends to generate a that matches the training higher than a greedy approach allows.
Despite the utility of the genetic process, an interesting finding is that: even while not performing mutations or mixtures each iteration (with each iteration simply generating random decision trees), GeneticDecisionTrees are likely to be more accurate than standard decision trees of the identical (small) size.
The mutate and mix operations are configurable and should be set to False to permit faster execution times — on this case, we simply generate a set of random decision trees and choose the one that most closely fits the training data.
That is as we’d expect: just by trying many sets of choices for the inner nodes in a choice tree, some will perform higher than the one tree that’s constructed in the conventional greedy fashion.
That is, though, a really interesting finding. And likewise very practical. It means: even without the genetic processes, simply trying many potential small decision trees to suit a training set, we will almost at all times find one that matches the info higher than a small decision tree of the identical size grown in a greedy manner. Often substantially higher. This will, in actual fact, be a more practical approach to constructing near-optimal decision trees than specifically looking for to create the optimal tree, no less than for the small sizes of trees appropriate for interpretable models.
Where mutations and mixtures are enabled, though, generally after one or two iterations, nearly all of the top-scored candidate decision trees (the trees that fit the training data one of the best) might be based on mutating and/or combining other strong models. That’s, enabling mutating and mixing does are likely to generate stronger models.
Assuming we create a choice tree of a limited size, there may be a limit to how strong the model could be — there may be (though in practice it might not be actually found), some tree that could be created that best matches the training data. For instance, with seven internal nodes (a root, two child nodes, and 4 grandchild nodes), there are only seven decisions to be made in fitting the tree: the feature and threshold utilized in each of those seven internal nodes.
Although a typical decision tree shouldn’t be prone to find the best set of seven internal nodes, a random process (especially if accompanied by random mutations and mixtures) can approach this ideal fairly quickly. Though still unlikely to achieve the best set of internal nodes, it will probably come close.
An alternate method to create a near-optimal decision tree is to create and test trees using each possible set of features and thresholds: an exhaustive search of the possible small trees.
With even a really small tree (for instance, seven internal nodes), nevertheless, that is intractable. With, for instance, ten features, there are 10⁷ selections only for the features in each node (assuming features can appear any variety of times within the tree). There are, then, an infinite variety of selections for the thresholds for every node.
It could be possible to pick out the thresholds using information gain (at each node holding the feature constant and picking the edge that maximizes information gain). With just ten features, this may occasionally be feasible, however the variety of mixtures to pick out the feature for every node still quickly explodes given more features. At 20 features, 20⁷ selections is over a billion.
Using some randomness and a genetic process to some extent can improve this, but a completely exhaustive search is, in just about all cases, infeasible.
The algorithm presented here is way from exhaustive, but does lead to an accurate decision tree even at a small size.
The gain in accuracy, though, does come at the fee of time and this implementation has had only moderate performance optimizations (it does allow internally executing operations in parallel, for instance) and is way slower than standard scikit-learn decision trees, particularly when executing over many iterations.
Nevertheless, it within reason efficient and testing has found using just 3 to five iterations is generally sufficient to understand substantial improvements for classification as in comparison with scikit-learn decision trees. For many practical applications, the performance is sort of reasonable.
For many datasets, fitting remains to be only about 1 to five minutes, depending on the scale of the info (each the variety of rows and variety of columns are relevant) and the parameters specified. This is sort of slow in comparison with training standard decision trees, which is usually under a second. Nevertheless, a couple of minutes can often be well-warranted to generate an interpretable model, particularly when creating an accurate, interpretable model can often be quite difficult.
Where desired, limiting the variety of iterations to just one or 2 can reduce the training time and might often still achieve strong results. As would likely be expected, there are diminishing returns over time using additional iterations, and a few increase in the possibility of overfitting. Using the verbose setting, it is feasible to see the progress of the fitting process and determine when the gains appear to have plateaued.
Disabling mutations and/or mixtures, though, is probably the most significant means to scale back execution time. Mutations and mixtures allow the tool to generate variations on existing strong trees, and are sometimes quite useful (they produce trees different than would likely be produced by scikit-learn), but are slower processes than simply generating random trees based on bootstrap samples of the training data: a big fraction of mutations are of low accuracy (regardless that a small fraction could be higher accuracy than could be found otherwise), while those produced based on random samples will all be no less than viable trees.
That’s, with mutations, we might have to provide and evaluate a big number before very strong ones emerge. That is less true of mixtures, though, that are fairly often stronger than either original tree.
As suggested, it might be reasonable in some cases to disable mutations and mixtures and as a substitute generate only a series of random trees based on random bootstrap samples. This approach couldn’t be considered a genetic algorithm — it simply produces a lot of small decision trees and selects the best-performing of those. Where sufficient accuracy could be achieved in this fashion, though, this may occasionally be all that’s crucial, and it will probably allow faster training times.
It’s also possible to begin with this as a baseline after which test if additional improvements could be found by enabling mutations and/or mixtures. Where these are used, the model needs to be set to execute no less than a couple of iterations, to offer it a likelihood to progressively improve over the randomly-produced trees.
We should always highlight here as well, the similarity of this approach (creating many similar but random trees, not using any genetic process) to making a RandomForest — RandomForests are also based on a set of decision trees, each trained on a random bootstrap sample. Nevertheless, RandomForests will use all decision trees created and can mix their predictions, while GeneticDecisionTree retains only the one, strongest of those decision trees.
We’ll now describe in additional detail specifically how the mutating and mixing processes work with GeneticDecisionTree.
The mutating process currently supported by GeneticDecisionTree is sort of easy. It allows only modifying the thresholds utilized by internal nodes, keeping the features utilized in all nodes the identical.
During mutation, a well-performing tree is chosen and a brand new copy of that tree is created, which might be the identical aside from the edge utilized in one internal node. The inner node to be modified is chosen randomly. The upper within the tree it’s, and the more different the brand new threshold is from the previous threshold, the more effectively different from the unique tree the brand new tree might be.
That is surprisingly effective and might often substantially change the training data that’s utilized in the 2 child nodes below it (and consequently the 2 sub-trees below the chosen node).
Prior to mutation, the trees each start with the thresholds assigned by scikit-learn, chosen based purely on information gain (not considering the tree as a complete). Even keeping the rest of the tree the identical, modifying these thresholds can effectively induce quite different trees, which frequently perform preferably. Though nearly all of mutated trees don’t improve over the unique tree, an improvement can normally be identified by trying a moderate variety of variations on each tree.
Future versions might also allow rotating nodes inside the tree, but testing thus far has found this not as effective as simply modifying the thresholds for a single internal node. Nevertheless, more research might be done on other mutations that will prove effective and efficient.
The opposite type of modification currently supported is combining two well-performing decision trees. To do that, we take the highest twenty trees found through the previous iteration and try to mix each pair of those. A mix is feasible if the 2 trees use the identical feature of their root nodes.
For instance, assume Tree 1 and Tree 2 (the 2 trees in the highest row within the figure below) are among the many top-performing trees found to this point.
The figure shows 4 trees in all: Tree 1, Tree 2, and the 2 trees created from these. The inner nodes are shown as circles and the leaf nodes as squares.
Tree 1 has a split in its root node on Feature A > 10.4 and Tree 2 has a split in its root node on Feature A> 10.8. We are able to, then, mix the 2 trees: each use Feature A of their root node.
We then create two latest trees. In each latest trees, the split in the foundation node is taken as the common of that within the two original trees, so in this instance, each latest trees (shown in the underside row of the figure) can have Feature A > 10.6 of their root nodes.
The primary latest tree can have Tree 1’s left sub-tree (the left sub-tree under Tree 1’s root node, drawn in blue) and Tree 2’s right sub tree (drawn in pink). The opposite latest tree can have Tree 2’s left sub-tree (purple) and Tree 1’s right sub-tree (green).
In this instance, Tree 1 and Tree 2 each have only 3 levels of internal nodes. In other examples, the subtrees could also be somewhat larger, but in that case, likely just one or two additional layers deep. The concept is similar no matter the scale or shapes of the subtrees.
Combining in this fashion effectively takes, aside from the foundation, half of 1 tree and half of one other, with the concept that:
- If each trees are strong, then (though not necessarily) likely the common alternative of feature in the foundation node is robust. Further, a split point between those chosen by each could also be preferable. Within the above example we used 10.6, which is halfway between the ten.4 and 10.8 utilized by the parent trees.
- While each trees are strong, neither could also be optimal. The difference, if there may be one, is within the two subtrees. It may very well be that Tree 1 has each the stronger left sub-tree and the stronger right sub-tree, wherein case it shouldn’t be possible to beat Tree 1 by combining with Tree 2. Similarly if Tree 2 has each the stronger left and right sub-trees. But, if Tree 1 has the stronger left sub-tree and Tree 2 the stronger right sub-tree, then making a latest tree to make the most of this may produce a tree stronger than either Tree 1 or Tree 2. Similarly for the converse.
There are other ways we could conceivably mix two trees, and other tools to generate decision trees through genetic algorithms use other methods to mix trees. But simply taking a subtree from one tree and one other subtree from one other tree is a really straightforward approach and appealing in this fashion.
Future versions will allow combining using nodes aside from the foundation, though the consequences are smaller in these cases — we’re then keeping the majority of 1 tree and replacing a smaller portion from other tree, so producing a brand new tree less distinct from the unique. That is, though, still a helpful type of combination and might be supported in the long run.
Decision Trees commonly overfit and GeneticDecisionTrees may as well. Like most models, GeneticDecisionTree attempts to suit to the training data in addition to is feasible, which can cause it to generalize poorly in comparison with other decision trees of the identical size.
Nevertheless, overfitting is restricted because the tree sizes are generally quite small, and the trees cannot grow beyond the required maximum depth. Each candidate decision tree produced can have equal complexity (or nearly equal — some paths may not extend to the complete maximum depth allowed, so some trees could also be barely smaller than others), so are roughly equally prone to overfit.
As with all model, though, it’s beneficial to tune GeneticDecisionTrees to search out the model that appears to work best together with your data.
GeneticDecisionTrees support each classification and regression, but are rather more appropriate for classification. Typically, regression functions are very difficult to model with shallow decision trees, because it’s crucial to predict a continuous numeric value and every leaf node predicts only a single value.
For instance, a tree with eight leaf nodes can predict only eight unique values. This is usually quite sufficient for classification problems (assuming the variety of distinct goal classes is under eight) but can produce only very approximate predictions with regression. With regression problems, even with easy functions, generally very deep trees are crucial to provide accurate results. Going deeper into the trees, the trees are in a position to fine-tune the predictions increasingly precisely.
Using a small tree with regression is viable only where the info has only a small variety of distinct values within the goal column, or where the values are in a small variety of clusters, with the range of every being fairly small.
GeneticDecisionTrees can work setting the utmost depth to a really high level, allowing accurate models, often substantially higher than standard decision trees, however the trees won’t, then, be interpretable. And the accuracy, while often strong, will still likely not be competitive with strong models corresponding to XGBoost, LGBM, or CatBoost. Given this, GeneticDecisionTrees for regression (or any attempts to create accurate shallow decision trees for regression), is usually infeasible.
The github page for GeneticDecisionTrees is: https://github.com/Brett-Kennedy/GeneticDecisionTree
To put in, you possibly can simply download the one genetic_decision_tree.py file and import it into your projects.
The github page also includes some example notebooks, however it needs to be sufficient to undergo the Easy Examples notebook to see learn how to use the tool and a few examples of the APIs. The github page also documents the APIs, but these are relatively easy, providing the same, though smaller, signature than scikit-learn’s DecisionTreeClassifier.
The next example is taken from the Simple_Examples notebook provided on the github page. This loads a dataset, does a train-test split, suits a GeneticDecisionTree, creates predictions, and outputs the accuracy, here using the F1 macro rating.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.datasets import load_wine
from genetic_decision_tree import GeneticDecisionTreedata = load_wine()
df = pd.DataFrame(data.data)
df.columns = data.feature_names
y_true = data.goal
X_train, X_test, y_train, y_test = train_test_split(df, y_true, test_size=0.3, random_state=42)
gdt = GeneticDecisionTree(max_depth=2, max_iterations=5, allow_mutate=True, allow_combine=True, verbose=True)
gdt.fit(X_train, y_train)
y_pred = gdt.predict(X_test)
print("Genetic DT:", f1_score(y_test, y_pred, average='macro'))
GeneticDecisionTree is a single class used for each classification and regression. It infers from the goal data the info type and handles the distinctions between regression and classification internally. As indicated, it’s a lot better fitted to classification, but is straight-forward to make use of for regression where desired as well.
Much like scikit-learn’s decision tree, GeneticDecisionTree provides an export_tree() API. Used with the wine dataset, using a depth of two, GeneticDecisionTree was in a position to achieve an F1 macro rating on a hold-out test set of 0.97, in comparison with 0.88 for the scikit-learn decision tree. The tree produced by GeneticDecisionTree is:
IF flavanoids < 1.4000
| IF color_intensity < 3.7250
| | 1
| ELSE color_intensity > 3.7250
| | 2
ELSE flavanoids > 1.4000
| IF proline < 724.5000
| | 1
| ELSE proline > 724.5000
| | 0
The github page provides an intensive test of GeneticDecisionTrees. This tests with a lot of test sets from OpenML and for every creates a typical (scikit-learn) Decision Tree and 4 GeneticDecisionTrees: each combination of allowing mutations and allowing mixtures (supporting neither, mutations only, mixtures only, and each). In all cases, a max depth of 4 was used.
In just about all cases, no less than one, and typically all 4, variations of the GeneticDecisionTree strongly out-perform the usual decision tree. These tests used F1 macro scores to check the models. A subset of that is shown here:
Normally, enabling either mutations or mixtures, or each, improves over simply producing random decision trees.
Given the big variety of cases tested, running this notebook is sort of slow. Additionally it is not a definitive evaluation: it uses only a limited set of test files, uses only default parameters aside from max_depth, and tests only the F1 macro scores. It does, nevertheless, display the GeneticDecisionTrees could be effective and interpretable models in lots of cases.
There are plenty of cases where it’s preferable to make use of an interpretable model (or a black-box model together with an interpretable proxy model for explainability) and in these cases, a shallow decision tree can often be amongst one of the best selections. Nevertheless, standard decision trees could be generated in a sub-optimal way, which can lead to lower accuracy, particularly for trees where we limit the scale.
The straightforward process demonstrated here of generating many decision trees based on random samples of the training data and identifying the tree that matches the training data best can provide a major advantage over this.
In truth, the most important finding was that generating a set of decision trees based on different random samples could be almost as affective because the genetic methods included here. This finding, though, may not proceed to carry as strongly as further mutations and mixtures are added to the codebase in future versions, or where large numbers of iterations are executed.
Beyond generating many trees, allowing a genetic process, where the training executes over several iterations, every time mutating and mixing the best-performing trees which were discovered thus far, can often further improve this.
The techniques demonstrated listed here are easy to duplicate and enhance to fit your needs. Additionally it is possible to easily use the GeneticDecisionTree class provided on github.
Where it is sensible to make use of decision trees for a classification project, it likely also is sensible to try GeneticDecisionTrees. They may almost at all times work as well, and sometimes substantially higher, albeit with some increase in fitting time.
All images by the creator