Home Artificial Intelligence Meet Gorilla: UC Berkeley and Microsoft’s API-Augmented LLM Outperforms GPT-4, Chat-GPT and Claude. The Dataset The Architecture Gorilla in Motion

Meet Gorilla: UC Berkeley and Microsoft’s API-Augmented LLM Outperforms GPT-4, Chat-GPT and Claude. The Dataset The Architecture Gorilla in Motion

2
Meet Gorilla: UC Berkeley and Microsoft’s API-Augmented LLM Outperforms GPT-4, Chat-GPT and Claude.
The Dataset
The Architecture
Gorilla in Motion

The model is augmented with APIs from Torch Hub, TensorFlow Hub and HuggingFace.

Image Credit: UC Berkeley

I recently began an AI-focused educational newsletter, that already has over 160,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to maintain you up so far with machine learning projects, research papers, and ideas. Please give it a try by subscribing below:

Recent advancements in large language models (LLMs) have revolutionized the sector, equipping them with latest capabilities like natural dialogue, mathematical reasoning, and program synthesis. Nonetheless, LLMs still face inherent limitations. Their ability to store information is constrained by fixed weights, and their computation capabilities are limited to a static graph and narrow context. Moreover, because the world evolves, LLMs need retraining to update their knowledge and reasoning abilities. To beat these limitations, researchers have began empowering LLMs with tools. By granting access to extensive and dynamic knowledge bases and enabling complex computational tasks, LLMs can leverage search technologies, databases, and computational tools. Leading LLM providers have begun integrating plugins that allow LLMs to invoke external tools through APIs. This transition from a limited set of hand-coded tools to accessing an unlimited array of cloud APIs has the potential to rework LLMs into the first interface for computing infrastructure and the online. Tasks corresponding to booking vacations or hosting conferences may very well be so simple as conversing with an LLM that has access to flight, automobile rental, hotel, catering, and entertainment web APIs.

Recently, researchers from UC Berkeley and Microsoft unveiled Gorilla, a LLaMA-7B model designed specifically for API calls. Gorilla relies on self-instruct fine-tuning and retrieval techniques to enable LLMs to pick accurately from a big and evolving set of tools expressed through their APIs and documentation. The authors construct a big corpus of APIs, called APIBench, by scraping machine learning APIs from major model hubs corresponding to TorchHub, TensorHub, and HuggingFace. Using self-instruct, they generate pairs of instructions and corresponding APIs. The fine-tuning process involves converting the info to a user-agent chat-style conversation format and performing standard instruction finetuning on the bottom LLaMA-7B model.

Image Credit: UC Berkeley

API calls often include constraints, adding complexity to the LLM’s comprehension and categorization of the calls. For instance, a prompt may require invoking a picture classification model with specific parameter size and accuracy constraints. These challenges highlight the necessity for LLMs to know not only the functional description of an API call but in addition reason in regards to the embedded constraints.

The tech-focused dataset at hand encompasses three distinct domains: Torch Hub, Tensor Hub, and HuggingFace. Each domain contributes a wealth of data, shedding light on the various nature of the dataset. Torch Hub, as an illustration, offers 95 APIs, providing a solid foundation. Compared, Tensor Hub takes it a step further with an in depth collection of 696 APIs. Lastly, HuggingFace leads the pack with a whopping 925 APIs, making it essentially the most comprehensive domain.

To amplify the worth and value of the dataset, an extra endeavor has been undertaken. Each API within the dataset is accompanied by a set of 10 meticulously crafted and uniquely tailored instructions. These instructions function indispensable guides for each training and evaluation purposes. This initiative ensures that each API goes beyond mere representation, enabling more robust utilization and evaluation.

Gorilla introduces the notion of retriever-aware training, where the instruction-tuned dataset includes an extra field with retrieved API documentation for reference. This approach goals to show the LLM to parse and answer questions based on the provided documentation. The authors display that this method allows the LLM to adapt to changes in API documentation, improves performance, and reduces hallucination errors.

During inference, users provide prompts in natural language. Gorilla can operate in two modes: zero-shot and retrieval. In zero-shot mode, the prompt is directly fed to the Gorilla LLM model, which returns the beneficial API call to perform the duty or goal. In retrieval mode, the retriever (either BM25 or GPT-Index) retrieves the most recent API documentation from the API Database. This documentation is concatenated with the user prompt, together with a message indicating the reference to the API documentation. The concatenated input is then passed to Gorilla, which outputs the API to be invoked. Prompt tuning just isn’t performed beyond the concatenation step in this method.

Image Credit: UC Berkeley

Inductive program synthesis has achieved success in various domains by synthesizing programs that meet specific test cases. Nonetheless, relating to evaluating API calls, relying solely on test cases falls short because it becomes difficult to confirm the semantic correctness of the code. Let’s consider the instance of image classification, where there are greater than 40 different models available for the duty. Even when we narrow it all the way down to a selected family, corresponding to Densenet, there are 4 possible configurations. Consequently, multiple correct answers exist, making it difficult to find out if the API getting used is functionally corresponding to the reference API through unit tests. To judge the performance of the model, a comparison of their functional equivalence is made using the collected dataset. To discover the API called by the LLM within the dataset, an AST (Abstract Syntax Tree) tree-matching strategy is employed. By checking if the AST of a candidate API call is a sub-tree of the reference API call, it becomes possible to trace which API is being utilized.

Identifying and defining hallucinations poses a major challenge. The AST matching process is leveraged to discover hallucinations directly. On this context, a hallucination refers to an API call that just isn’t a sub-tree of any API within the database, essentially invoking a wholly imagined tool. It’s essential to notice that this definition of hallucination differs from invoking an API incorrectly, which is defined as an error.

AST sub-tree matching plays an important role in identifying the precise API being called throughout the dataset. Since API calls can have multiple arguments, each of those arguments must be matched. Moreover, considering that Python allows for default arguments, it is crucial to define which arguments to match for every API within the database.

Image Credit: UC Berkeley

Along with the paper, the researchers open sourced a version of Gorilla. The discharge features a notebook with many examples. Moreover, the next video clearly shows a few of the magic of Gorilla.

Gorilla is probably the most interesting approaches within the tool-augmented LLM space. Hopefully, we are going to see the model distributed in a few of the principal ML hubs within the space.

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here