Supercharged Searching on the 🤗 Hub

The huggingface_hub library is a light-weight interface that gives a programmatic approach to exploring the hosting endpoints Hugging Face provides: models, datasets, and Spaces.

Up until now, searching on the Hub through this interface was tricky to drag off, and there have been many points of it a user needed to “just know” and get accustomed to.

In this text, we might be taking a look at a number of exciting recent features added to huggingface_hub to assist lower that bar and supply users with a friendly API to look for the models and datasets they wish to use without leaving their Jupyter or Python interfaces.

Before we start, should you don’t have the most recent version of the huggingface_hub library in your system, please run the next cell:

!pip install huggingface_hub -U

Situating the Problem:

First, we could say the scenario you might be in. You would like to seek out all models hosted on the Hugging Face Hub for Text Classification, were trained on the GLUE dataset, and are compatible with PyTorch.

You could simply just open https://huggingface.co/models and use the widgets on there. But this requires leaving your IDE and scanning those results, all of which requires a number of button clicks to get you the knowledge you would like.

What if there have been an answer to this without having to go away your IDE? With a programmatic interface, it also might be easy to see this being integrated into workflows for exploring the Hub.

That is where the huggingface_hub is available in.

For those acquainted with the library, chances are you’ll already know that we will seek for these form of models. Nevertheless, getting the query right is a painful technique of trial and error.

Could we simplify that? Let’s discover!

Finding what we’d like

First we’ll import the HfApi, which is a category that helps us interact with the backend hosting for Hugging Face. We will interact with the models, datasets, and more through it. Together with this, we’ll import a number of helper classes: the ModelFilter and ModelSearchArguments

from huggingface_hub import HfApi, ModelFilter, ModelSearchArguments

api = HfApi()

These two classes may help us frame an answer to our above problem. The ModelSearchArguments class is a namespace-like one which incorporates each valid parameter we will seek for!

Let’s take a peek:

>>> model_args = ModelSearchArguments()

>>> model_args

Available Attributes or Keys:
 * writer
 * dataset
 * language
 * library
 * license
 * model_name
 * pipeline_tag

We will see a wide range of attributes available to us (more on how this magic is finished later). If we were to categorize what we wanted, we could likely separate them out as:

pipeline_tag (or task): Text Classification
dataset: GLUE
library: PyTorch

Given this separation, it will make sense that we might find them inside our model_args we have declared:

>>> model_args.pipeline_tag.TextClassification

'text-classification'

>>> model_args.dataset.glue

'dataset:glue'

>>> model_args.library.PyTorch

'pytorch'

What we start to note though is among the convience wrapping we perform here. ModelSearchArguments (and the complimentary DatasetSearchArguments) have a human-readable interface with formatted outputs the API wants, akin to how the GLUE dataset ought to be searched with dataset:glue.

This is vital because without this “cheat sheet” of knowing how certain parameters ought to be written, you’ll be able to very easily sit in frustration as you are trying to look for models with the API!

Now that we all know what the proper parameters are, we will search the API easily:

>>> models = api.list_models(filter = (
>>>     model_args.pipeline_tag.TextClassification, 
>>>     model_args.dataset.glue, 
>>>     model_args.library.PyTorch)
>>> )
>>> print(len(models))

We discover that there have been 140 matching models that fit our criteria! (on the time of writing this). And if we take a more in-depth have a look at one, we will see that it does indeed look right:

>>> models[0]

    ModelInfo: {
        modelId: Jiva/xlm-roberta-large-it-mnli
        sha: c6e64469ec4aa17fedbd1b2522256f90a90b5b86
        lastModified: 2021-12-10T14:56:38.000Z
        tags: ['pytorch', 'xlm-roberta', 'text-classification', 'it', 'dataset:multi_nli', 'dataset:glue', 'arxiv:1911.02116', 'transformers', 'tensorflow', 'license:mit', 'zero-shot-classification']
        pipeline_tag: zero-shot-classification
        siblings: [ModelFile(rfilename=".gitattributes"), ModelFile(rfilename="README.md"), ModelFile(rfilename="config.json"), ModelFile(rfilename="pytorch_model.bin"), ModelFile(rfilename="sentencepiece.bpe.model"), ModelFile(rfilename="special_tokens_map.json"), ModelFile(rfilename="tokenizer.json"), ModelFile(rfilename="tokenizer_config.json")]
        config: None
        private: False
        downloads: 680
        library_name: transformers
        likes: 1
    }

It’s kind of more readable, and there is not any guessing involved with “Did I get this parameter right?”

Did you understand you too can get the knowledge of this model programmatically with its model ID? Here’s how you’d do it:
api.model_info('Jiva/xlm-roberta-large-it-mnli')

Taking it up a Notch

We saw how we could use the ModelSearchArguments and DatasetSearchArguments to remove the guesswork from when we would like to look the Hub, but what about if we have now a really complex, messy query?

Resembling:
I would like to look for all models trained for each text-classification and zero-shot classification, were trained on the Multi NLI and GLUE datasets, and are compatible with each PyTorch and TensorFlow (a more exact query to get the above model).

To setup this question, we’ll make use of the ModelFilter class. It’s designed to handle all these situations, so we need not scratch our heads:

>>> filt = ModelFilter(
>>>     task = ["text-classification", "zero-shot-classification"],
>>>     trained_dataset = [model_args.dataset.multi_nli, model_args.dataset.glue],
>>>     library = ['pytorch', 'tensorflow']
>>> )
>>> api.list_models(filt)

    [ModelInfo: {
         modelId: Jiva/xlm-roberta-large-it-mnli
         sha: c6e64469ec4aa17fedbd1b2522256f90a90b5b86
         lastModified: 2021-12-10T14:56:38.000Z
         tags: ['pytorch', 'xlm-roberta', 'text-classification', 'it', 'dataset:multi_nli', 'dataset:glue', 'arxiv:1911.02116', 'transformers', 'tensorflow', 'license:mit', 'zero-shot-classification']
         pipeline_tag: zero-shot-classification
         siblings: [ModelFile(rfilename=".gitattributes"), ModelFile(rfilename="README.md"), ModelFile(rfilename="config.json"), ModelFile(rfilename="pytorch_model.bin"), ModelFile(rfilename="sentencepiece.bpe.model"), ModelFile(rfilename="special_tokens_map.json"), ModelFile(rfilename="tokenizer.json"), ModelFile(rfilename="tokenizer_config.json")]
         config: None
         private: False
         downloads: 680
         library_name: transformers
         likes: 1
     }]

In a short time we see that it is a way more coordinated approach for looking through the API, with no added headache for you!

What’s the magic?

Very briefly we’ll talk in regards to the underlying magic at play that offers us this enum-dictionary-like datatype, the AttributeDictionary.

Heavily inspired by the AttrDict class from the fastcore library, the overall idea is we take a traditional dictionary and supercharge it for exploratory programming by providing tab-completion for each key within the dictionary.

As we saw earlier, this gets even stronger when we have now nested dictionaries we will explore through, akin to model_args.dataset.glue!

For those acquainted with JavaScript, we mimic how the object class is working.

This straightforward utility class can provide a way more user-focused experience when exploring nested datatypes and trying to know what’s there, akin to the return of an API request!

As mentioned before, we expand on the AttrDict in a number of key ways:

You possibly can delete keys with del model_args[key] or with del model_args.key
That clean __repr__ we saw earlier

One very essential concept to notice though, is that if a key incorporates a number or special character it must be indexed as a dictionary, and not as an object.

>>> from huggingface_hub.utils.endpoint_helpers import AttributeDictionary

A really transient example of this is that if we have now an AttributeDictionary with a key of 3_c:

>>> d = {"a":2, "b":3, "3_c":4}
>>> ad = AttributeDictionary(d)

>>> 
>>> ad.3_c

 File "", line 2
    ad.3_c
        ^
SyntaxError: invalid token

>>> 
>>> ad["3_c"]

Concluding thoughts

Hopefully by now you could have a transient understanding of how this recent searching API can directly impact your workflow and exploration of the Hub! Together with this, perhaps you understand of a spot in your code where the AttributeDictionary is likely to be useful so that you can use.

From here, be sure that to envision out the official documentation on Searching the Hub Efficiently and do not forget to offer us a star!

Source link

Supercharged Searching on the 🤗 Hub

Situating the Problem:

Finding what we’d like

Taking it up a Notch

What’s the magic?

Concluding thoughts

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

AlpamayoR1: Large Causal Reasoning Models for Autonomous Driving

AI in Multiple GPUs: How GPUs Communicate

Making automatic speech recognition work on large files with Wav2Vec2 in 🤗 Transformers

Google brings AI music to the masses

Agentic AI for Modern Deep Learning Experimentation

Supercharged Searching on the 🤗 Hub

Situating the Problem:

Finding what we’d like

Taking it up a Notch

What’s the magic?

Concluding thoughts

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.