The Greedy Boruta Algorithm: Faster Feature Selection Without Sacrificing Recall

Feature selection stays one of the vital critical yet computationally expensive steps within the machine learning pipeline. When working with high-dimensional datasets, identifying which features truly contribute to predictive power can mean the difference between an interpretable, efficient model and an overfit, sluggish one.

In this text, I present the Greedy Boruta algorithm—a modification to the Boruta algorithm [1] that, in our tests, reduces computation time by 5-40× while mathematically provably maintaining or improving recall. Through theoretical evaluation and simulation experiments, I reveal how an easy rest of the confirmation criterion provides guaranteed convergence in iterations, where is the boldness level of the binomial tests, in comparison with the vanilla algorithm’s unbounded runtime.

The Boruta algorithm has long been a favourite amongst data scientists for its “all-relevant” approach to feature selection and its statistical framework. Unlike minimal-optimal methods, similar to Minimum Redundancy Maximum Relevance (mRMR) and recursive feature elimination (RFE), that seek the smallest set of features for prediction, Boruta goals to discover all features that carry useful information. This philosophical difference matters tremendously when the goal is knowing a phenomenon reasonably than merely making predictions, as an illustration.

Nevertheless, Boruta’s thoroughness comes at a high computational cost. In real-world applications with lots of or 1000’s of features, the algorithm can take prohibitively long to converge. That is where the Greedy Boruta Algorithm enters the image.

Understanding the vanilla Boruta algorithm

Before examining the modification, let’s recap how the vanilla Boruta algorithm works.

Boruta’s genius lies in its elegant approach to determining feature importance. Slightly than counting on arbitrary thresholds or p-values directly from a model, it creates a competitive benchmark using shadow features.

Here’s the method:

Shadow feature creation: For every feature within the dataset, Boruta creates a “shadow” copy by randomly shuffling its values. This destroys any relationship the unique feature had with the response (or goal) variable while preserving its distribution.
Importance computation: A Random Forest is trained on the combined dataset and have importances are calculated for all features. Although Boruta was initially proposed for Random Forest estimators, the algorithm can work with some other tree-based ensemble that gives feature importance scores (e.g., Extra Trees [2], XGBoost [3], LightGBM [4]).
Hit registration: For every non-shadow feature, Boruta checks whether the importance of the feature is bigger than the utmost importance of the shadows. Non-shadow features which are more necessary than the utmost shadow are assigned a “hit” and those which are less necessary are assigned “no-hit”.
Statistical testing: Based on the lists of hits and no-hits for every of the non-shadow features, Boruta performs a binomial test to find out if its importance is significantly greater than the utmost importance amongst shadow features across multiple iterations.
Decision making: Features that consistently outperform one of the best shadow feature are marked as “confirmed.” Features that consistently underperform are “rejected.” Features in the center (people who will not be statistically significantly different from one of the best shadow) remain “tentative”.
Iteration: Steps 2–5 repeat until all features are classified as confirmed or rejected. In this text, I say that the Boruta algorithm “has converged” when all features are either confirmed or rejected or when a maximum variety of iterations has been reached.

The binomial test: Boruta’s decision criterion

The vanilla Boruta uses a rigorous statistical framework. After multiple iterations, the algorithm performs a binomial test on the hits for every of the non-shadow features:

Null hypothesis: The feature isn’t any higher than one of the best shadow feature (50% likelihood of beating shadows by random likelihood).
Alternative hypothesis: The feature is best than one of the best shadow feature.
Confirmation criterion: If the binomial test p-value is below (typically between 0.05–0.01), the feature is confirmed.

This same process can also be done to reject features:

Null hypothesis: The feature is best than one of the best shadow (50% likelihood of not beating shadows by random likelihood).
Alternative hypothesis: The feature isn’t any higher than one of the best shadow feature.
Rejection criterion: If the binomial test p-value is below , the feature is rejected.

This approach is statistically sound and conservative; nonetheless, it requires many iterations to build up sufficient evidence, especially for features which are relevant but only marginally higher than noise.

The convergence problem

The vanilla Boruta algorithm faces two major convergence issues:

Long runtime: Since the binomial test requires many iterations to succeed in statistical significance, the algorithm might require lots of of iterations to categorise all features, especially when using small values for prime confidence. Moreover, there are not any guarantees or estimates for convergence, that’s, there isn’t a option to determine what number of iterations might be required for all of the features to be categorized into “confirmed” or “rejected”.

Tentative features: Even after reaching a maximum variety of iterations, some features may remain within the “tentative” category, leaving the analyst with incomplete information.

These challenges motivated the event of the Greedy Boruta Algorithm.

The Greedy Boruta algorithm

The Greedy Boruta Algorithm introduces a fundamental change to the confirmation criterion that dramatically improves convergence speed while maintaining high recall.

The name comes from the algorithm’s eager approach to confirmation. Like greedy algorithms that make locally optimal selections, Greedy Boruta immediately accepts any feature that shows promise, without waiting to build up statistical evidence. This trade-off favors speed and sensitivity over specificity.

Relaxed confirmation

As an alternative of requiring statistical significance through a binomial test, the Greedy Boruta confirms any feature that has beaten the utmost shadow importance no less than once across all iterations, while keeping the identical rejection criterion.

The rationale behind this rest is that in “all-relevant” feature selection, because the name suggests, we typically prioritize retaining all relevant features over eliminating all irrelevant features. The further removal of the non-relevant features might be done with “minimal-optimal” feature selection algorithms downstream within the machine learning pipeline. Subsequently, this rest is practically sound and produces the expected results from an “all-relevant” feature selection algorithm.

This seemingly easy change has several necessary implications:

Maintained recall: Because we’re relaxing the confirmation criterion (making it easier to substantiate features), we will never have lower recall than the vanilla Boruta. Any feature that’s confirmed by the vanilla method will even be confirmed by the greedy version. This might be easily proven because it is not possible for a feature to be deemed more necessary than one of the best shadow within the binomial test with out a single hit.
Guaranteed Convergence in K iterations: As might be shown below, this transformation makes it in order that it is feasible to compute what number of iterations are required until all features are either confirmed or rejected.
Faster convergence: As a direct consequence of the purpose above, the Greedy Boruta algorithm needs far less iterations than the vanilla Boruta for all features to be sorted. More specifically, the minimum variety of iterations for the vanilla algorithm to sort its “first batch” of features is similar at which the greedy version finishes running.
Hyperparameter Simplification: One other consequence of the guaranteed convergence is that a number of the parameters utilized in the vanilla Boruta algorithm, similar to max_iter (maximum variety of iterations), early_stopping (boolean determining whether the algorithm should stop earlier if no change is seen across various iterations) and n_iter_no_change (minimum variety of iterations with no change before early stopping is triggered), might be entirely removed without loss in flexibility. This simplification improves the algorithm’s usability and makes the feature selection process easier to administer.

The modified algorithm

The Greedy Boruta algorithm follows this process:

Shadow feature creation: The exact same because the vanilla Boruta. Shadow features are created based on each of the features of the dataset.
Importance computation: The exact same because the vanilla Boruta. Feature importance scores are computed based on any tree-based ensemble machine learning algorithm.
Hit registration: The exact same because the vanilla Boruta. Assigns hits to non-shadow features which are more necessary than an important shadow feature.
Statistical testing: Based on the lists of no-hits for every of the non-shadow feature, Greedy Boruta performs a binomial test to find out whether its importance just isn’t significantly greater than the utmost importance amongst shadow features across multiple iterations.
Decision making [Modified]: Features with no less than one hit are confirmed. Features that consistently underperform in relation to one of the best shadow are “rejected.” Features with zero hits remain “tentative”.
Iteration: Steps 2–5 repeat until all features are classified as confirmed or rejected.

This greedy version is predicated on the unique [5] implementation with a number of tweaks, so most things are kept the identical as this implementation, aside from the changes mentioned above.

Statistical insight on convergence guarantee

Probably the most elegant properties of the Greedy Boruta Algorithm is its guaranteed convergence inside a specified variety of iterations that is determined by the chosen value.

Due to relaxed confirmation criterion, we all know that any feature with a number of hits is confirmed and we don’t must run binomial tests for confirmation. Conversely, we all know that each tentative feature has zero hits. This fact greatly simplifies the equation representing the binomial test required to reject features.

More specifically, the binomial test is simplified as follows. Considering the one-sided binomial test described above for rejection within the vanilla Boruta algorithm, with being and being , the is calculated as:

This formula sums the possibilities of observing k successes for all values from k = 0 as much as the observed x. Now, given the known values on this scenario ( = 0.5 and x = 0), the formula simplifies to:

To reject at significance level , we want:

Substituting our simplified :

Taking the reciprocal (and reversing the inequality):

Taking logarithms base 2 of either side:

Subsequently, the sample size required is:

This means that at most iterations of the Greedy Boruta algorithm are run until all features are sorted into either “confirmed” or “rejected” and convergence has been achieved. Which means that the Greedy Boruta algorithm has complexity.

One other consequence of all tentative features having 0 hits is the undeniable fact that we will further optimize the algorithm by not running any statistical tests across iterations.

More specifically, given , it is feasible to find out the utmost variety of iterations required to reject a variable. Subsequently, for each iteration , if a variable has successful, it’s confirmed, and if it doesn’t, it’s tentative (for the reason that p-value for all iterations might be greater than ). Then, at exactly iteration , all variables which have 0 hits might be moved into the rejected category with no binomial tests being run, since we all know that the will all be smaller than at this point.

This also signifies that, for a given , the whole variety of iterations run by the Greedy Boruta algorithm is the same as the minimum variety of iterations it takes for the vanilla implementation to either confirm or reject any feature!

Finally, it’s important to notice that the implementation uses False Discovery Rate (FDR) correction to account for the increased likelihood of false positives when performing multiple hypothesis tests. In practice, the required value of just isn’t exactly as shown within the equation above, however the complexity in relation to remains to be logarithmic.

The table below comprises the variety of required iterations for various values with the correction applied:

Simulation experiments

To empirically evaluate the Greedy Boruta Algorithm, I conducted experiments using synthetic datasets where the bottom truth is thought. This approach allows precise measurement of the algorithm’s performance.

Methodology

Synthetic data generation: I created datasets with a known set of necessary and unimportant features using sklearn’s make_classification function, allowing for direct computation of selection performance metrics. Moreover, these datasets include ‘redundant features’—linear combos of informative features that carry predictive information but will not be strictly mandatory for prediction. Within the ‘all-relevant’ paradigm, these features should ideally be identified as necessary since they do contain signal, even when that signal is redundant. The evaluation due to this fact considers informative features and redundant features together because the ‘ground truth relevant set’ when computing recall.

Metrics: Each algorithms are evaluated on:

Recall (Sensitivity): What quantity of truly necessary features were appropriately identified?
Specificity: What quantity of truly unimportant features were appropriately rejected?
F1 Rating: The harmonic mean of precision and recall, balancing the trade-off between appropriately identifying necessary features and avoiding false positives
Computational time: Wall-clock time to completion

Experiment 1 – Various α

Dataset characteristics

X_orig, y_orig = sklearn.make_classification(
    n_samples=1000,
    n_features=500,
    n_informative=5,
    n_redundant=50, # LOTS of redundant features correlated with informative
    n_repeated=0,
    n_clusters_per_class=1,
    flip_y=0.3, # Some label noise
    class_sep=0.0001,
    random_state=42
)

This constitutes a “hard” feature selection problem due to the high dimensionality (500 variables), with a small sample size (1000 samples), small variety of relevant features (sparse problem, with around 10% of the features being relevant in any way) and fairly high label noise. It is necessary to create such a “hard” problem to effectively compare the performances of the methods, otherwise, each methods achieve near-perfect results after only a number of iterations.

Hyperparameters used

On this experiment, we assess how the algorithms perform with different , so we evaluated each methods using from the list [0.00001, 0.0001, 0.001, 0.01, 0.1, 0.2].

Regarding the hyperparameters of the Boruta and Greedy Boruta algorithms, each use an sklearn ExtraTreesClassifier because the estimator with the next parameters:

ExtraTreesClassifier(
    n_estimators: 500, 
    max_depth: 5, 
    n_jobs: -1, 
    max_features: 'log2'
)

The Extra Trees classifier was chosen because the estimator due to its fast fitting time and the undeniable fact that it’s more stable when considering feature importance estimation tasks [2].

Finally, the vanilla Boruta uses no early stopping (this parameter is mindless within the context of Greedy Boruta).

Variety of trials

The vanilla Boruta algorithm is configured to run at most 512 iterations but with a early stopping condition. Which means that if no changes are seen in X iterations (n_iter_no_change), the run stops. For every , a price of n_iter_no_change is defined as follows:

Early stopping is enabled because a careful user of the vanilla Boruta algorithm would set this if the wall-clock time of the algorithm run is high-enough, and is a more sensible use of the algorithm overall.

These early stopping thresholds were chosen to balance computational cost with convergence likelihood: smaller thresholds for larger values (where convergence is quicker) and bigger thresholds for smaller values (where statistical significance takes more iterations to build up). This reflects how a practical user would configure the algorithm to avoid unnecessarily long runtimes.

Results: performance comparison

Key finding: As presented within the left-most panel of figure 1, Greedy Boruta achieves recall that is bigger than or equal to that of the vanilla Boruta across all experimental conditions. For the 2 smallest values, the recall is equal and for the others, the Greedy Boruta implementation has barely greater recall, confirming that the relaxed confirmation criterion doesn’t miss features that the vanilla method would catch.

Observed trade-off: Greedy Boruta shows modestly lower specificity in some settings, confirming that the relaxed criterion does lead to more false positives. Nevertheless, the magnitude of this effect is smaller than expected, leading to only 2-6 additional features being chosen on this dataset with 500 variables. This increased false-positive rate is suitable in most downstream pipelines for 2 reasons: (1) absolutely the variety of additional features is small (2-6 features on this 500-feature dataset), and (2) subsequent modeling steps (e.g., regularization, cross-validation, or minimal-optimal feature selection) can filter these features in the event that they don’t contribute to predictive performance.

Speedup: Greedy Boruta consistently requires 5-15× less time in comparison to the vanilla implementation, with the speedup increasing for more conservative values. For = 0.00001, the development approaches 15x. It’s also expected that even smaller values would result in increasingly larger speedups. It is necessary to notice that for many scenarios with < 0.001, the vanilla Boruta implementation “doesn't converge” (not all features are confirmed or rejected) and without early-stopping, they'd run for for much longer than this.

Convergence: We may also evaluate how briskly each of the tactic “converges” by analysing the status of the variables at each iteration, as shown within the plot below:

On this scenario, using = 0.00001, we will observe the behavior mentioned above: the primary confirmation/rejection of the vanilla algorithm occurs on the last iteration of the greedy algorithm (hence the entire overlap of the lines within the rejection plot).

Due to logarithmic growth of the utmost variety of iterations by the Greedy Boruta by way of , we may also explore extreme values for when using the greedy version:

Experiment 2 – Exploring maximum variety of iterations

Parameters

On this experiment, the identical dataset and hyperparameters as described within the last experiment were used, aside from which was fixed at , and the utmost variety of iterations (for the vanilla algorithm) modified across runs. The utmost numbers of iterations analyzed are [16, 32, 64, 128, 256, 512]. Also, early stopping was disabled for this experiment as a way to showcase one in all the weaknesses of the vanilla Boruta algorithm.

It is necessary to notice that for this experiment there is just one data point for the Greedy Boruta method for the reason that maximum variety of iterations just isn’t a parameter by itself on the modified version, because it is uniquely defined by the used.

Results: Performance Comparison

Once more, we observe that the Greedy Boruta achieves higher recall than the vanilla Boruta algorithm while having barely decreased specificity, across all of the variety of iterations considered. On this scenario, we also observe that the Greedy Boruta achieves recall levels much like those of the vanilla algorithm in ~4x less time.

Moreover, because within the vanilla algorithm there isn’t a “guarantee of convergence” in a given variety of iterations, the user must define a maximum variety of iterations for which the algorithm will run. In practice, it’s difficult to find out this number without knowing the bottom truth for necessary features and the possible accompanying variety of iterations to trigger early stopping. Considering this difficulty, a very conservative user may run the algorithm for a lot too many iterations with out a significant improvement within the feature selection quality.

On this specific case, using a maximum variety of iterations equal to 512 iterations, without early stopping, achieves a recall very much like that achieved with 64, 128 and 256 iterations. When comparing the greedy version to the 512 iterations of the vanilla algorithm, we see that a 40x speedup is achieved, while having a rather greater recall.

When to make use of Greedy Boruta?

The Greedy Boruta Algorithm is especially invaluable in specific scenarios:

High-dimensional data with limited time: When working with datasets that contain lots of or 1000’s of features, the computational cost of the vanilla Boruta might be prohibitive. If quick results are required for exploratory evaluation or rapid prototyping, Greedy Boruta offers a compelling speed-accuracy trade-off.
All-relevant feature selection goals: In case your objective aligns with Boruta’s original “all-relevant” philosophy—finding every feature that contributes with some information reasonably than the minimal optimal set—then Greedy Boruta’s high recall is precisely what you would like. The algorithm favors inclusion, which is suitable when feature removal is dear (e.g., in scientific discovery or causal inference tasks).
Iterative evaluation workflows: In practice, feature selection isn’t a one-shot decision. Data scientists often iterate, experimenting with different feature sets and models. Greedy Boruta enables rapid iteration by providing fast initial results that might be refined in subsequent analyses. Moreover, other feature selection methods might be used to further reduce the dimensionality of the feature set.
Just a few extra features are OK: The vanilla Boruta’s strict statistical testing is invaluable when false positives are particularly costly. Nevertheless, in lots of applications, including a number of extra features is preferable to missing necessary ones. Greedy Boruta is right when the downstream pipeline can handle barely larger feature sets but advantages from faster processing.

Conclusion

The Greedy Boruta algorithm is an extension/modification to a well-established feature selection method, with significantly different properties. By relaxing the confirmation criterion from statistical significance to a single hit, we achieve:

5-40x faster run times compared to straightforward Boruta, for the scenarios explored.
Equal or greater recall, ensuring no relevant features are missed.
Guaranteed convergence with all features classified.
Maintained interpretability and theoretical grounding.

The trade-off—a modest increase within the false positive rate—is suitable in lots of real-world applications, particularly when working with high-dimensional data under time constraints.

For practitioners, the Greedy Boruta algorithm offers a invaluable tool for rapid, high-recall feature selection in exploratory evaluation, with the choice to follow up with more conservative methods if needed. For researchers, it demonstrates how thoughtful modifications to established algorithms can yield significant practical advantages by fastidiously considering the actual requirements of real-world applications.

The algorithm is most appropriate when your philosophy aligns with finding “all relevant” features reasonably than a minimal set, when speed matters, and when false positives might be tolerated or filtered in downstream evaluation. In these common scenarios, Greedy Boruta provides a compelling alternative to the vanilla algorithm.

References

[1] Kursa, M. B., & Rudnicki, W. R. (2010). Feature Selection with the Boruta Package. , 36(11), 1–13.

[2] Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. , 63(1), 3–42.

[3] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. , 785–794.

[4] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. , 3146–3154.

[5] BorutaPy implementation: https://github.com/scikit-learn-contrib/boruta_py

Code availability

The whole implementation of Greedy Boruta is accessible at GreedyBorutaPy.

Greedy Boruta can also be available as a PyPI package at greedyboruta.

The Greedy Boruta Algorithm: Faster Feature Selection Without Sacrificing Recall

Understanding the vanilla Boruta algorithm