Learn how to Set the Variety of Trees in Random Forest

-

Scientific publication

T. M. Lange, M. Gültas, A. O. Schmitt & F. Heinrich (2025). optRF: Optimising random forest stability by determining the optimal variety of trees. , 26(1), 95.

Follow this LINK to the unique publication.

Forest — A Powerful Tool for Anyone Working With Data

What’s Random Forest?

Have you ever ever wished you might make higher decisions using data — like predicting the chance of diseases, crop yields, or spotting patterns in customer behavior? That’s where machine learning is available in and one of the crucial accessible and powerful tools on this field is something called Random Forest.

So why is random forest so popular? For one, it’s incredibly flexible. It really works well with many varieties of data whether numbers, categories, or each. It’s also widely utilized in many fields — from predicting patient outcomes in healthcare to detecting fraud in finance, from improving shopping experiences online to optimising agricultural practices.

Despite the name, random forest has nothing to do with trees in a forest — nevertheless it does use something called to make smart predictions. You’ll be able to consider a choice tree as a flowchart that guides a series of yes/no questions based on the information you give it. A random forest creates a complete bunch of those trees (hence the “forest”), each barely different, after which combines their results to make one final decision. It’s a bit like asking a gaggle of experts for his or her opinion after which going with the bulk vote.

But until recently, one query was unanswered: What number of decision trees do I really want? If each decision tree can result in different results, averaging many trees would lead to higher and more reliable results. But what number of are enough? Luckily, the optRF package answers this query!

So let’s have a take a look at find out how to optimise Random Forest for predictions and variable selection!

Making Predictions with Random Forests

To optimise and to make use of random forest for making predictions, we are able to use the open-source statistics programme R. Once we open R, we’ve got to put in the 2 R packages “ranger” which allows to make use of random forests in R and “optRF” to optimise random forests. Each packages are open-source and available via the official R repository CRAN. With the intention to install and cargo these packages, the next lines of R code could be run:

> install.packages(“ranger”)
> install.packages(“optRF”)
> library(ranger)
> library(optRF)

Now that the packages are installed and loaded into the library, we are able to use the functions that these packages contain. Moreover, we can even use the information set included within the optRF package which is free to make use of under the GPL license (just because the optRF package itself). This data set called SNPdata accommodates in the primary column the yield of 250 wheat plants in addition to 5000 genomic markers (so called single nucleotide polymorphisms or SNPs) that may contain either the worth 0 or 2.

> SNPdata[1:5,1:5]
            Yield SNP_0001 SNP_0002 SNP_0003 SNP_0004
  ID_001 670.7588        0        0        0        0
  ID_002 542.5611        0        2        0        0
  ID_003 591.6631        2        2        0        2
  ID_004 476.3727        0        0        0        0
  ID_005 635.9814        2        2        0        2

This data set is an example for genomic data and could be used for genomic prediction which is an important tool for breeding high-yielding crops and, thus, to fight world hunger. The thought is to predict the yield of crops using genomic markers. And exactly for this purpose, random forest could be used! That signifies that a random forest model is used to explain the connection between the yield and the genomic markers. Afterwards, we are able to predict the yield of wheat plants where we only have genomic markers.

Subsequently, let’s imagine that we’ve got 200 wheat plants where we all know the yield and the genomic markers. That is the so-called training data set. Let’s further assume that we’ve got 50 wheat plants where we all know the genomic markers but not their yield. That is the so-called test data set. Thus, we separate the information frame SNPdata in order that the primary 200 rows are saved as training and the last 50 rows without their yield are saved as test data:

> Training = SNPdata[1:200,]
> Test = SNPdata[201:250,-1]

With these data sets, we are able to now have a take a look at find out how to make predictions using random forests!

First, we got to calculate the optimal variety of trees for random forest. Since we intend to make predictions, we use the function opt_prediction from the optRF package. Into this function we’ve got to insert the response from the training data set (on this case the yield), the predictors from the training data set (on this case the genomic markers), and the predictors from the test data set. Before we run this function, we are able to use the set.seed function to make sure reproducibility regardless that this isn’t essential (we’ll see later why reproducibility is a difficulty here):

> set.seed(123)
> optRF_result = opt_prediction(y = Training[,1], 
+                               X = Training[,-1], 
+                               X_Test = Test)
  Beneficial variety of trees: 19000

All the outcomes from the opt_prediction function are actually saved in the thing optRF_result, nevertheless, crucial information was already printed within the console: For this data set, we must always use 19,000 trees.

With this information, we are able to now use random forest to make predictions. Subsequently, we use the ranger function to derive a random forest model that describes the connection between the genomic markers and the yield within the training data set. Also here, we’ve got to insert the response within the y argument and the predictors within the x argument. Moreover, we are able to set the write.forest argument to be TRUE and we are able to insert the optimal variety of trees within the num.trees argument:

> RF_model = ranger(y = Training[,1], x = Training[,-1], 
+                   write.forest = TRUE, num.trees = 19000)

And that’s it! The article RF_model accommodates the random forest model that describes the connection between the genomic markers and the yield. With this model, we are able to now predict the yield for the 50 plants within the test data set where we’ve got the genomic markers but we don’t know the yield:

> predictions = predict(RF_model, data=Test)$predictions
> predicted_Test = data.frame(ID = row.names(Test), predicted_yield = predictions)

The information frame predicted_Test now accommodates the IDs of the wheat plants along with their predicted yield:

> head(predicted_Test)
      ID predicted_yield
  ID_201        593.6063
  ID_202        596.8615
  ID_203        591.3695
  ID_204        589.3909
  ID_205        599.5155
  ID_206        608.1031

Variable Selection with Random Forests

A special approach to analysing such a knowledge set could be to seek out out which variables are most vital to predict the response. On this case, the query could be which genomic markers are most vital to predict the yield. Also this could be done with random forests!

If we tackle such a task, we don’t need a training and a test data set. We will simply use the complete data set SNPdata and see which of the variables are crucial ones. But before we try this, we must always again determine the optimal variety of trees using the optRF package. Since we’re insterested in calculating the variable importance, we use the function opt_importance:

> set.seed(123)
> optRF_result = opt_importance(y=SNPdata[,1], 
+                               X=SNPdata[,-1])
  Beneficial variety of trees: 40000

One can see that the optimal variety of trees is now higher than it was for predictions. This is definitely often the case. Nonetheless, with this variety of trees, we are able to now use the ranger function to calculate the importance of the variables. Subsequently, we use the ranger function as before but we modify the variety of trees within the num.trees argument to 40,000 and we set the importance argument to “permutation” (other options are “impurity” and “impurity_corrected”). 

> set.seed(123) 
> RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], 
+                   write.forest = TRUE, num.trees = 40000,
+                   importance="permutation")
> D_VI = data.frame(variable = names(SNPdata)[-1], 
+                   importance = RF_model$variable.importance)
> D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]

The information frame D_VI now accommodates all of the variables, thus, all of the genomic markers, and next to it, their importance. Also, we’ve got directly ordered this data frame in order that crucial markers are on the highest and the least essential markers are at the underside of this data frame. Which suggests that we are able to have a take a look at crucial variables using the pinnacle function:

> head(D_VI)
  variable importance
  SNP_0020   45.75302
  SNP_0004   38.65594
  SNP_0019   36.81254
  SNP_0050   34.56292
  SNP_0033   30.47347
  SNP_0043   28.54312

And that’s it! We’ve got used random forest to make predictions and to estimate crucial variables in a knowledge set. Moreover, we’ve got optimised random forest using the optRF package!

Why Do We Need Optimisation?

Now that we’ve seen how easy it’s to make use of random forest and the way quickly it will probably be optimised, it’s time to take a better take a look at what’s happening behind the scenes. Specifically, we’ll explore how random forest works and why the outcomes might change from one run to a different.

To do that, we’ll use random forest to calculate the importance of every genomic marker but as a substitute of optimising the variety of trees beforehand, we’ll follow the default settings within the ranger function. By default, ranger uses 500 decision trees. Let’s try it out:

> set.seed(123) 
> RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1], 
+                   write.forest = TRUE, importance="permutation")
> D_VI = data.frame(variable = names(SNPdata)[-1], 
+                   importance = RF_model$variable.importance)
> D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]
> head(D_VI)
  variable importance
  SNP_0020   80.22909
  SNP_0019   60.37387
  SNP_0043   50.52367
  SNP_0005   43.47999
  SNP_0034   38.52494
  SNP_0015   34.88654

As expected, all the pieces runs easily — and quickly! Actually, this run was significantly faster than once we previously used 40,000 trees. But what happens if we run the very same code again but this time with a special seed?

> set.seed(321) 
> RF_model2 = ranger(y=SNPdata[,1], x=SNPdata[,-1], 
+                    write.forest = TRUE, importance="permutation")
> D_VI2 = data.frame(variable = names(SNPdata)[-1], 
+                    importance = RF_model2$variable.importance)
> D_VI2 = D_VI2[order(D_VI2$importance, decreasing=TRUE),]
> head(D_VI2)
  variable importance
  SNP_0050   60.64051
  SNP_0043   58.59175
  SNP_0033   52.15701
  SNP_0020   51.10561
  SNP_0015   34.86162
  SNP_0019   34.21317

Once more, all the pieces appears to work nice but take a better take a look at the outcomes. In the primary run, SNP_0020 had the best importance rating at 80.23, but within the second run, SNP_0050 takes the highest spot and SNP_0020 drops to the fourth place with a much lower importance rating of 51.11. That’s a big shift! So what modified?

The reply lies in something called . Random forest, because the name suggests, involves numerous randomness: it randomly selects data samples and subsets of variables at various points during training. This randomness helps prevent overfitting nevertheless it also signifies that results can vary barely every time you run the algorithm — even with the very same data set. That’s where the set.seed() function is available in. It acts like a bookmark in a shuffled deck of cards. By setting the identical seed, you be certain that the random decisions made by the algorithm follow the identical sequence each time you run the code. But whenever you change the seed, you’re effectively changing the random path the algorithm follows. That’s why, in our example, crucial genomic markers got here out otherwise in each run. This behavior — where the identical process can yield different results resulting from internal randomness — is a classic example of non-determinism in machine learning.

As we just saw, random forest models can produce barely different results each time you run them even when using the identical data resulting from the algorithm’s built-in randomness. So, how can we reduce this randomness and make our results more stable?

One in every of the only and handiest ways is to extend the variety of trees. Each tree in a random forest is trained on a random subset of the information and variables, so the more trees we add, the higher the model can “average out” the noise attributable to individual trees. Consider it like asking 10 people for his or her opinion versus asking 1,000 — you’re more prone to get a reliable answer from the larger group.

With more trees, the model’s predictions and variable importance rankings are inclined to change into more stable and reproducible even without setting a selected seed. In other words, adding more trees helps to tame the randomness. Nonetheless, there’s a catch. More trees also mean more computation time. Training a random forest with 500 trees might take just a few seconds but training one with 40,000 trees could take several minutes or more, depending on the scale of your data set and your computer’s performance.

Nonetheless, the connection between the soundness and the computation time of random forest is . While going from 500 to 1,000 trees can significantly improve stability, going from 5,000 to 10,000 trees might only provide a tiny improvement in stability while doubling the computation time. In some unspecified time in the future, you hit a plateau where adding more trees gives diminishing returns — you pay more in computation time but gain little or no in stability. That’s why it’s essential to seek out the suitable balance: Enough trees to make sure stable results but not so many who your evaluation becomes unnecessarily slow.

And this is precisely what the optRF package does: it analyses the connection between the soundness and the variety of trees in random forests and uses this relationship to find out the optimal variety of trees that results in stable results and beyond which adding more trees would unnecessarily increase the computation time.

Above, we’ve got already used the opt_importance function and saved the outcomes as optRF_result. This object accommodates the data in regards to the optimal variety of trees nevertheless it also accommodates information in regards to the relationship between the soundness and the variety of trees. Using the plot_stability function, we are able to visualise this relationship. Subsequently, we’ve got to insert the name of the optRF object, which measure we’re concerned with (here, we’re concerned with the “importance”), the interval we would like to visualise on the X axis, and if the advisable variety of trees must be added:

> plot_stability(optRF_result, measure="importance", 
+                from=0, to=50000, add_recommendation=FALSE)
R graph that visualises the stability of random forest depending on the number of decision trees
The output of the plot_stability function visualises the soundness of random forest depending on the variety of decision trees

This plot clearly shows the non-linear relationship between stability and the variety of trees. With 500 trees, random forest only results in a stability of around 0.2 which explains why the outcomes modified drastically when repeating random forest after setting a special seed. With the advisable 40,000 trees, nevertheless, the soundness is near 1 (which indicates an ideal stability). Adding greater than 40,000 trees would get the soundness further to 1 but this increase could be only very small while the computation time would further increase. That’s the reason 40,000 trees indicate the optimal variety of trees for this data set.

The Takeaway: Optimise Random Forest to Get the Most of It

Random forest is a strong ally for anyone working with data — whether you’re a researcher, analyst, student, or data scientist. It’s easy to make use of, remarkably flexible, and highly effective across a wide selection of applications. But like several tool, using it well means understanding what’s happening under the hood. On this post, we’ve uncovered certainly one of its hidden quirks: The randomness that makes it strong can even make it unstable if not fastidiously managed. Fortunately, with the optRF package, we are able to strike the proper balance between stability and performance, ensuring we get reliable results without wasting computational resources. Whether you’re working in genomics, medicine, economics, agriculture, or another data-rich field, mastering this balance will aid you make smarter, more confident decisions based in your data.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x