The huggingface_hub library is a light-weight interface that gives a programmatic approach to exploring the hosting endpoints Hugging Face provides: models, datasets, and Spaces.
Up until now, searching on the Hub through this interface was tricky to drag off, and there have been many points of it a user needed to “just know” and get accustomed to.
In this text, we might be taking a look at a number of exciting recent features added to huggingface_hub to assist lower that bar and supply users with a friendly API to look for the models and datasets they wish to use without leaving their Jupyter or Python interfaces.
Before we start, should you don’t have the most recent version of the
huggingface_hublibrary in your system, please run the next cell:
!pip install huggingface_hub -U
Situating the Problem:
First, we could say the scenario you might be in. You would like to seek out all models hosted on the Hugging Face Hub for Text Classification, were trained on the GLUE dataset, and are compatible with PyTorch.
You could simply just open https://huggingface.co/models and use the widgets on there. But this requires leaving your IDE and scanning those results, all of which requires a number of button clicks to get you the knowledge you would like.
What if there have been an answer to this without having to go away your IDE? With a programmatic interface, it also might be easy to see this being integrated into workflows for exploring the Hub.
That is where the huggingface_hub is available in.
For those acquainted with the library, chances are you’ll already know that we will seek for these form of models. Nevertheless, getting the query right is a painful technique of trial and error.
Could we simplify that? Let’s discover!
Finding what we’d like
First we’ll import the HfApi, which is a category that helps us interact with the backend hosting for Hugging Face. We will interact with the models, datasets, and more through it. Together with this, we’ll import a number of helper classes: the ModelFilter and ModelSearchArguments
from huggingface_hub import HfApi, ModelFilter, ModelSearchArguments
api = HfApi()
These two classes may help us frame an answer to our above problem. The ModelSearchArguments class is a namespace-like one which incorporates each valid parameter we will seek for!
Let’s take a peek:
>>> model_args = ModelSearchArguments()
>>> model_args
Available Attributes or Keys:
* writer
* dataset
* language
* library
* license
* model_name
* pipeline_tag
We will see a wide range of attributes available to us (more on how this magic is finished later). If we were to categorize what we wanted, we could likely separate them out as:
pipeline_tag(or task): Text Classificationdataset: GLUElibrary: PyTorch
Given this separation, it will make sense that we might find them inside our model_args we have declared:
>>> model_args.pipeline_tag.TextClassification
'text-classification'
>>> model_args.dataset.glue
'dataset:glue'
>>> model_args.library.PyTorch
'pytorch'
What we start to note though is among the convience wrapping we perform here. ModelSearchArguments (and the complimentary DatasetSearchArguments) have a human-readable interface with formatted outputs the API wants, akin to how the GLUE dataset ought to be searched with dataset:glue.
This is vital because without this “cheat sheet” of knowing how certain parameters ought to be written, you’ll be able to very easily sit in frustration as you are trying to look for models with the API!
Now that we all know what the proper parameters are, we will search the API easily:
>>> models = api.list_models(filter = (
>>> model_args.pipeline_tag.TextClassification,
>>> model_args.dataset.glue,
>>> model_args.library.PyTorch)
>>> )
>>> print(len(models))
140
We discover that there have been 140 matching models that fit our criteria! (on the time of writing this). And if we take a more in-depth have a look at one, we will see that it does indeed look right:
>>> models[0]
ModelInfo: {
modelId: Jiva/xlm-roberta-large-it-mnli
sha: c6e64469ec4aa17fedbd1b2522256f90a90b5b86
lastModified: 2021-12-10T14:56:38.000Z
tags: ['pytorch', 'xlm-roberta', 'text-classification', 'it', 'dataset:multi_nli', 'dataset:glue', 'arxiv:1911.02116', 'transformers', 'tensorflow', 'license:mit', 'zero-shot-classification']
pipeline_tag: zero-shot-classification
siblings: [ModelFile(rfilename=".gitattributes"), ModelFile(rfilename="README.md"), ModelFile(rfilename="config.json"), ModelFile(rfilename="pytorch_model.bin"), ModelFile(rfilename="sentencepiece.bpe.model"), ModelFile(rfilename="special_tokens_map.json"), ModelFile(rfilename="tokenizer.json"), ModelFile(rfilename="tokenizer_config.json")]
config: None
private: False
downloads: 680
library_name: transformers
likes: 1
}
It’s kind of more readable, and there is not any guessing involved with “Did I get this parameter right?”
Did you understand you too can get the knowledge of this model programmatically with its model ID? Here’s how you’d do it:
api.model_info('Jiva/xlm-roberta-large-it-mnli')
Taking it up a Notch
We saw how we could use the ModelSearchArguments and DatasetSearchArguments to remove the guesswork from when we would like to look the Hub, but what about if we have now a really complex, messy query?
Resembling:
I would like to look for all models trained for each text-classification and zero-shot classification, were trained on the Multi NLI and GLUE datasets, and are compatible with each PyTorch and TensorFlow (a more exact query to get the above model).
To setup this question, we’ll make use of the ModelFilter class. It’s designed to handle all these situations, so we need not scratch our heads:
>>> filt = ModelFilter(
>>> task = ["text-classification", "zero-shot-classification"],
>>> trained_dataset = [model_args.dataset.multi_nli, model_args.dataset.glue],
>>> library = ['pytorch', 'tensorflow']
>>> )
>>> api.list_models(filt)
[ModelInfo: {
modelId: Jiva/xlm-roberta-large-it-mnli
sha: c6e64469ec4aa17fedbd1b2522256f90a90b5b86
lastModified: 2021-12-10T14:56:38.000Z
tags: ['pytorch', 'xlm-roberta', 'text-classification', 'it', 'dataset:multi_nli', 'dataset:glue', 'arxiv:1911.02116', 'transformers', 'tensorflow', 'license:mit', 'zero-shot-classification']
pipeline_tag: zero-shot-classification
siblings: [ModelFile(rfilename=".gitattributes"), ModelFile(rfilename="README.md"), ModelFile(rfilename="config.json"), ModelFile(rfilename="pytorch_model.bin"), ModelFile(rfilename="sentencepiece.bpe.model"), ModelFile(rfilename="special_tokens_map.json"), ModelFile(rfilename="tokenizer.json"), ModelFile(rfilename="tokenizer_config.json")]
config: None
private: False
downloads: 680
library_name: transformers
likes: 1
}]
In a short time we see that it is a way more coordinated approach for looking through the API, with no added headache for you!
What’s the magic?
Very briefly we’ll talk in regards to the underlying magic at play that offers us this enum-dictionary-like datatype, the AttributeDictionary.
Heavily inspired by the AttrDict class from the fastcore library, the overall idea is we take a traditional dictionary and supercharge it for exploratory programming by providing tab-completion for each key within the dictionary.
As we saw earlier, this gets even stronger when we have now nested dictionaries we will explore through, akin to model_args.dataset.glue!
For those acquainted with JavaScript, we mimic how the
objectclass is working.
This straightforward utility class can provide a way more user-focused experience when exploring nested datatypes and trying to know what’s there, akin to the return of an API request!
As mentioned before, we expand on the AttrDict in a number of key ways:
- You possibly can delete keys with
del model_args[key]or withdel model_args.key - That clean
__repr__we saw earlier
One very essential concept to notice though, is that if a key incorporates a number or special character it must be indexed as a dictionary, and not as an object.
>>> from huggingface_hub.utils.endpoint_helpers import AttributeDictionary
A really transient example of this is that if we have now an AttributeDictionary with a key of 3_c:
>>> d = {"a":2, "b":3, "3_c":4}
>>> ad = AttributeDictionary(d)
>>>
>>> ad.3_c
File "", line 2
ad.3_c
^
SyntaxError: invalid token
>>>
>>> ad["3_c"]
4
Concluding thoughts
Hopefully by now you could have a transient understanding of how this recent searching API can directly impact your workflow and exploration of the Hub! Together with this, perhaps you understand of a spot in your code where the AttributeDictionary is likely to be useful so that you can use.
From here, be sure that to envision out the official documentation on Searching the Hub Efficiently and do not forget to offer us a star!
