Empowering Efficient BO Transfer with Neural Acquisition Process (NAP) General Objectives & Results: From Bayesian Optimisation to Meta-Bayesian Optimisation: Neural Acquisition Processes (NAP): Cool Properties:

Artificial Intelligence

Empowering Efficient BO Transfer with Neural Acquisition Process (NAP) General Objectives & Results: From Bayesian Optimisation to Meta-Bayesian Optimisation: Neural Acquisition Processes (NAP): Cool Properties:

admin

June 7, 2023

Empowering Efficient BO Transfer with Neural Acquisition Process (NAP)
General Objectives & Results:
From Bayesian Optimisation to Meta-Bayesian Optimisation:
Neural Acquisition Processes (NAP):
Cool Properties:

Our primary objective is to reinforce the effectiveness of Bayesian Optimisation (BO) by leveraging meta-learning to transfer knowledge across different problem domains, thereby significantly improving sample efficiency.

In pursuit of this goal, we introduce the Neural Acquisition Process (NAP), an revolutionary end-to-end architecture based on Transformer models designed explicitly for BO. NAP learns acquisition functions directly and provides a comprehensive framework for optimising various tasks inside the BO paradigm.

Through extensive experiments, we show the outstanding performance of NAP, achieving state-of-the-art results across diverse domains. Specifically, our framework exhibits remarkable success in antibody design, EDA logic synthesis sequence optimisation, and hyperparameter optimisation tasks, surpassing existing approaches in effectiveness and efficiency.

Bayesian optimisation, a widely recognised paradigm in machine learning for efficient optimisation of black-box functions, has gained significant traction in diverse domains. This approach relies on two fundamental components to realize its objectives.

The initial component involves constructing a surrogate model, wherein a Gaussian process (GP) is usually employed as a consequence of its probabilistic nature. GPs offer the advantage of generating predictions with calibrated uncertainties, making them highly suitable for capturing and representing complex patterns in the info. Moreover, GPs exhibit sample efficiency, enabling effective utilisation of limited data.

Once the surrogate model has been established using observed data, the next component determines probably the most promising regions to explore inside the search space. That is completed by utilising an acquisition function, which considers the uncertainty estimated by the model. By effectively balancing the exploration-exploitation trade-off, the acquisition function guides the optimisation process towards regions of the search space prone to yield probably the most significant performance improvement.

Despite its success, the present setup of Bayesian optimisation has certain limitations. Firstly, the Gaussian process (GP) model encounters well-known computational challenges and infrequently must represent high-dimensional spaces adequately. Addressing these limitations typically requires problem-specific techniques and expert knowledge.

Secondly, the present approach treats the acquisition function and process as separate entities, functioning independently. This disjoint treatment may only partially exploit the potential synergies between these components.

Lastly, Bayesian optimisation traditionally operates in a “tabula rasa” manner, meaning that every recent problem starts from scratch without leveraging prior knowledge or experience.

Meta-Bayesian optimisation could also be used to beat a few of those limitations. The target of employing meta-learning is to amass transferable knowledge from similar tasks, which classical GP models and acquisition strategies often struggle with. By learning from related tasks, the meta-learning approach goals to reinforce the adaptability and efficiency of BO, enabling it to leverage existing knowledge and experience when confronted with recent problems.

In Meta-Bayesian optimisation, we assume that we have now observed data from previous optimisation tasks (source tasks) and at the moment are confronted with a recent function to optimise (test task). The related literature has a few approaches to tackling this problem:

Learn a Meta-Model: FSBO acquires a neural feature extractor that encompasses all source tasks. Subsequently, these features are employed in a GP model, and a traditional acquisition function is utilised. This model, generally known as a Deep Kernel GP, is already established within the literature. The notable advantage lies in learning the deep kernel across source tasks and applying it to a test task.
Learn a Meta-Acquisition Function: Quite the opposite, MetaBO adopts a standard GP model but substitutes the acquisition function with a neural network. This neural acquisition function is subsequently trained using reinforcement learning (RL) techniques.
Learn a Sequential Model: Optformer adopts a singular methodology by training a sequential model to forecast the next points, dimension by dimension, and directly predict the upcoming y values. That is completed by applying a notably large transformer model, utilising exclusively supervised learning techniques. The sequential model employed in Optformer enables a meticulous evaluation of the info’s underlying patterns and dynamics, facilitating precise predictions at each step.

These approaches excel in transferring information to recent tasks and enhancing the sample efficiency of Bayesian optimisation. Nevertheless, they still encounter challenges stemming from using a GP model and the disjoint nature of the 2 components.

Now, we present our method NAP, which embodies an architecture that boasts the next benefits:

Eliminates the necessity for a Gaussian process (GP).
Utilises a Transformer architecture that encompasses each the model and acquisition components.
Facilitates end-to-end differentiability, enabling seamless optimisation and learning throughout the complete framework.

Training: Since we don’t have access to acquisition function labels within the datasets of source tasks, we utilise reinforcement learning (RL) to coach our architecture. On this approach, the reward for the trajectory is decided by the achieved regret. Nevertheless, we observe a logarithmic sparsity pattern within the rewards, which hampers training effectiveness. We introduce an inductive bias as an auxiliary loss, incorporating supervised information to deal with this. Notably, within the algorithm presented below, our architecture includes two losses through which the gradient flows back, facilitating comprehensive training.

NAP possesses desirable properties derived from its specific architecture based on Transformers.

Property 1: Invariance to History Order: Unlike a classical Transformer, NAP doesn’t employ positional encoding. Consequently, we are able to treat the observed points’ history as an unordered set, where the sequence during which the points were observed becomes inconsequential. This aspect holds critical significance in Bayesian optimisation since predictions regarding the subsequent point to question mustn’t be influenced by the order during which the previous points were observed.

Property 2: Query Invariance: Referring to the given attention mask, it’s noteworthy that every freshly explored point, known as the query point, only has visibility of itself and the observed history. The opposite query points, nevertheless, are concealed from its view. Consequently, the predictions generated show conditional independence regarding the history and remain unswayed by the sequence during which the queries were conducted. This attribute aligns harmoniously with the expected behaviour in Bayesian optimisation, because the predictions for brand spanking new points must remain unaffected by the concurrent exploration of other points.

Comparison to OptFormer: OptFormer encounters limitations as a consequence of its reliance on purely supervised training, which ends up in dependencies on each the order of variables and the order of dimensions inside each variable. In contrast, our model exhibits several benefits: it’s smaller, occupying only 10% of the unique model’s capability, utilises a reduced memory footprint of 40%, and requires a mere 2% of the compute time. Despite these efficiency gains, our model achieves similar regret results to OptFormer.