The 5 Most Under-Rated Tools on Hugging Face

-


Derek Thomas's avatar

tl;dr The Hugging Face Hub has a variety of tools and integrations which are often ignored that could make it easier to construct many forms of AI solutions

The Hugging Face Hub boasts over 850K public models, with ~50k recent ones added every month, and that just appears to be
climbing higher and better. We also offer an Enterprise Hub subscription that
unlocks compliance, security,
and governance features, together with additional
compute capacities on inference endpoints for production-level inference and more hardware options for doing demos on
Spaces.

The Hugging Face Hub allows broad usage since you may have diverse hardware, and you possibly can run almost anything you would like
in Docker Spaces. I’ve noticed we’ve a variety of features
which are unsung (listed below). Within the means of making a semantic search application on the Hugging Face hub I took
advantage of all of those features to implement various parts of the answer. While I feel the ultimate application
(detailed on this org reddit-tools-HF), is compelling, I’d prefer to use this instance to point out how you possibly can apply them to
your individual projects.

  • ZeroGPU – How can I take advantage of a free GPU?
  • Multi-process Docker – How can I solve 2 (n) problems in 1 space?
  • Gradio API – How can I make multiple spaces work together?
  • Webhooks – How can I trigger events in an area based on the hub changes?
  • Nomic Atlas – A feature-rich semantic search (visual and text based)



Use-Case

An robotically updated, visually enabled, semantic seek for a dynamic data source, free of charge

It’s easy to assume multiple scenarios where this is helpful:

  • E-commerce platforms which are seeking to handle their many products based on descriptions or reported issues
  • Law firms and compliance departments who must comb through legal documents or regulations
  • Researchers who need to sustain with recent advances and find relevant papers or articles for his or her needs

I will be demonstrating this through the use of a subreddit as my data source and using the Hub to facilitate the remaining. There are a
number of how to implement this. I could put all the things in 1 space, but that might be quite messy. Alternatively, having too
many components in an answer has its own challenges. Ultimately, I selected a design that permits me to focus on a few of
the unsung heroes on the Hub and display how you need to use them effectively
. The architecture is shown in Figure 1
and is fully hosted on Hugging Face in the shape of spaces, datasets and webhooks. Every feature I’m
using is free for max accessibility. As it’s essential to scale your service, you may consider upgrading to the Enterprise Hub.

You possibly can see that I’m using r/bestofredditorupdates as my Data Source, it has 10-15 recent posts a day. I pull from it every day using their API via a Reddit Application with PRAW, and store the leads to the Raw Dataset (reddit-tools-HF/dataset-creator-reddit-bestofredditorupdates). Storing recent data triggers a webhook, which in turn triggers the Data Processing Space to take motion. The Data Processing Space will take the Raw Dataset and add columns to it, namely feature embeddings generated by the Embedding Model Space and retrieved using a Gradio client. The Data Processing Space will then take the processed data and store it within the Processed Dataset. It should also construct the Data Explorer tool. Do note that the information is taken into account not-for-all-audiences as a consequence of the information source. More on this in Ethical Considerations

*I used nomic-ai/nomic-embed-text-v1.5 to generate the embeddings for just a few reasons:

  • Handles long contexts well (8192 tokens)
  • Efficient at 137M parameters
  • High on the MTEB leaderboard
  • Works with nomic-atlas for semantic search



ZeroGPU

One in all the challenges with modern models is that they typically require GPUs or other heavy hardware to run. These may be
bulky with 12 months long commitments and really expensive. Spaces makes it easy to make use of the hardware you desire at a low price, but
it’s not robotically spun up and down (though you could possibly programmatically do it!).
ZeroGPU is a brand new form of hardware for Spaces. There may be a quota for
free users and an even bigger one for PRO users.

It has two goals :

  • Provide free GPU access for Spaces
  • Allow Spaces to run on multiple GPUs
Zero Spaces
Figure 2: ZeroGPU behind the scenes

That is achieved by making Spaces efficiently hold and release GPUs as needed (versus a classical GPU Space with a GPU attached in any respect times). ZeroGPU uses Nvidia A100 GPUs under the hood (40GB of vRAM can be found
for every workload).



Application

I used ZeroGPU to host the amazing nomic embedding model in my Embedding Model Space. It’s super convenient because
I don’t actually need a dedicated GPU as I only must do inference occasionally and incrementally.

It’s extremely easy to make use of. The one change is that it’s essential to have a function with all of your GPU code inside, and
decorate that with @spaces.GPU.


import spaces

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True, device='cuda')

@spaces.GPU
def embed(document: str):
    return model.encode(document)



Multi-process Docker

Data Processing Space
Figure 3: Data Processing Space

Probably the most common requests we see from enterprises is that I need feature X, or tool Y integrated. Probably the greatest
parts of the Hugging Face Hub is that we’ve
an unreasonably robust API that may integrate with
principally anything. The second way of solving this problem is often in spaces. Here I’ll use a
blank docker space that may run an arbitrary docker container with
the Hardware of your alternative (a free CPU in my case).

My essential pain point is that I need to have the option to run 2 very various things in a single space. Most spaces have a single
identity, like showing off a diffusers model,
or generating music. Consider the Dataset Creator Space, I would like
to:

  • Run some code to drag data from Reddit and store it in Raw Dataset
    • It is a mostly invisible process
    • That is run by essential.py
  • Visualize the logs from the above code so I can have a very good understanding of what is happening (shown in Figure 3)

Note that each of those should run in separate processes. I’ve come across
many use-cases where visualizing the logs
is definitely really useful and necessary. It’s an excellent debugging tool and it is also rather more aesthetically pleasing in
scenarios where there isn’t a natural UI.



Application

I leverage a Multi-process Docker solution by leveraging the supervisord library, which is
touted as a process control system. It is a clean way of controlling multiple separate processes. Supervisord lets me do
multiple things in a single container, which is helpful in a Docker Space. Note that Spaces only lets you expose a
single port, so that may influence what solutions you think about.

Installing Supervisor is kind of easy because it’s a python package.

pip install supervisor

It is advisable to write a supervisord.conf file to specify your configuration. You possibly can see my whole example
here: supervisord.conf.
It’s pretty self explanatory. Note I don’t want the logs from program:app because app.py is just there to visualise
logs, not create them, so I route them to /dev/null.

[supervisord]
nodaemon=true

[program:main]
command=python essential.py
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0
autostart=true

[program:app]
command=python app.py
stdout_logfile=/dev/null
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0
autostart=true
autorestart=true

Lastly we want to begin our supervisord.conf to truly run our 2 processes. In my Dockerfile I simply run:

CMD ["supervisord", "-c", "supervisord.conf"]



Gradio API

Within the Data Processing Space I would like embeddings for the posts, this presents a challenge if I abstract the embedding
model in one other space. How do I call it?

Once you construct a Gradio app, by default you possibly can treat any interaction as an API call. This
means all those cool spaces on the Hub have an API related to them (Spaces allows
you to make use of an API call to Streamlit or Docker spaces too if the creator enables it)! Even cooler, is that we’ve
an easy to make use of client for this API.



Application

I used the client in my Data Processing Space to get embeddings from the nomic model deployed within the Embedding
Model Space
. It was utilized in
this utilities.py
file, I’ve extrapolated the relevant parts below:

from gradio_client import Client


client = Client("reddit-tools-HF/nomic-embeddings")


def update_embeddings(content, client):
    
    embedding = client.predict('search_document: ' + content, api_name="/embed")
    return np.array(embedding)


final_embedding = update_embeddings(content=row['content'], client=client)

There may be even a extremely cool API recorder now which allows you to use the GUI but records each step as an API interaction.



Webhooks

Webhooks
Figure 4: Project Webhooks

Webhooks are a foundation for MLOps-related features. They mean you can
listen for brand new changes on specific repos or to all repos belonging to a selected set of users/organizations (not only
your repos, but any repo).

You need to use them to auto-convert models, construct community bots, construct CI/CD in your models, datasets, and Spaces, and
rather more!



Application

In my use-case I desired to rebuild the Processed Dataset at any time when I update the Raw Dataset. You possibly can see
the full code here.
To do that I would like so as to add a webhook that triggers on the Raw Dataset updates and to send it’s payload to the Data
Processing Space
. There are multiple forms of updates that may occur, some may be on other branches, or within the discussions tab. My criteria is to trigger when each the README.md file and one other file are updated
on the essential branch of the repo, because that is what changes when a brand new commit is pushed to the dataset (here’s an example).

# Commit cleaned up for readability
T 1807	M	README.md
T 52836295	M	data/train-00000-of-00001.parquet

You will want to fastidiously determine what your criteria is as you adapt this to your use-case

First you’ll need to create your webhook in your settings. It is best to
follow this guide on how one can create a webhook, make
sure to make use of consistent endpoint names (/dataset_repo in my case).

Also note the webhook url is the Direct URL with /webhooks appended. The Direct URL may be found by clicking the three dots above the space and choosing Embed this Space. I also set a webhook secret within the Data Processing Space so it’s secure.

Here’s what my webhook creation input looks like. Just don’t tell anyone my secret 😉.

Goal Repositories: datasets/reddit-tools-HF/dataset-creator-reddit-bestofredditorupdates

Webhook URL: https://reddit-tools-hf-processing-bestofredditorupdates.hf.space/webhooks/dataset_repo

Secret (optional): Float-like-a-butterfly

Next you’ll need to eat your webhook in your space. To do that I’ll discuss:

  1. How you can setup the webhook server
  2. How you can selectively trigger only the updates we care about
    1. It should be a repo change
    2. It should be on the essential branch: refs/heads/essential
    3. It should be an update with not only the README.md changing



How you can setup the webhook server

First we want to eat the payload. We’ve got a convenient way
to eat a webhook payload
built into the huggingface_hub library. You possibly can see that I
use @app.add_webhook to define an endpoint that matches what I did upon webhook creation. Then I define my function.

Note it’s essential to reply to the payload request inside 30s or you’re going to get a 500 error. Because of this I even have an async
function to reply after which kick off my actual process as an alternative of doing the processing in
the handle_repository_changes function. You possibly can check
out background task documentation for more information.

from huggingface_hub import WebhookPayload, WebhooksServer

app = WebhooksServer(ui=ui.queue(), webhook_secret=WEBHOOK_SECRET)


@app.add_webhook("/dataset_repo")
async def handle_repository_changes(payload: WebhookPayload, task_queue: BackgroundTasks):

    
    
    

    logger.info(f"Webhook received from {payload.repo.name} indicating a repo {payload.event.motion}")
    task_queue.add_task(_process_webhook, payload=payload)
    return Response("Task scheduled.", status_code=status.HTTP_202_ACCEPTED)


def _process_webhook(payload: WebhookPayload):
    
    pass



Selectively Trigger

Since I’m considering any change on the repo level, I can use payload.event.scope.startswith("repo") to find out
if I care about this incoming payload.


if not payload.event.scope.startswith("repo"):
    return Response("No task scheduled", status_code=status.HTTP_200_OK)

I can access the branch information via payload.updatedRefs[0]


try:
    if payload.updatedRefs[0].ref != 'refs/heads/essential':
        response_content = "No task scheduled: Change not on essential branch"
        logger.info(response_content)
        return Response(response_content, status_code=status.HTTP_200_OK)
except:
    response_content = "No task scheduled"
    logger.info(response_content)
    return Response(response_content, status_code=status.HTTP_200_OK)

To ascertain which files were modified is a little more complicated. We are able to see some git information in commit_files_url but
then we want to parse it. It’s form of like a .tsv.
Steps:

  • Get commit information
  • Parse this into changed_files
  • Take motion based on my conditions
from huggingface_hub.utils import build_hf_headers, get_session


try:
    commit_files_url = f"""{payload.repo.url.api}/compare/{payload.updatedRefs[0].oldSha}..{payload.updatedRefs[0].newSha}?raw=true"""
    response_text = get_session.get(commit_files_url, headers=build_hf_headers()).text
    logger.info(f"Git Compare URl: {commit_files_url}")

    
    file_lines = response_text.split('n')

    
    changed_files = [line.split('t')[-1] for line in file_lines if line.strip()]
    logger.info(f"Modified files: {changed_files}")

    
    if all('README.md' in file for file in changed_files):
        response_content = "No task scheduled: its a README only update."
        logger.info(response_content)
        return Response(response_content, status_code=status.HTTP_200_OK)
except Exception as e:
    logger.error(f"{str(e)}")
    response_content = "Unexpected issue :'("
    logger.info(response_content)
    return Response(response_content, status_code=status.HTTP_501_NOT_IMPLEMENTED)



Nomic Atlas

One in all the common pain points we see with customers/partners is that data understanding and collaboration are
difficult. Data understanding is commonly step one to solving any AI use-case. My favorite approach to try this is
through visualization, and sometimes I don’t feel I even have great tools for that on the subject of semantic data. I used to be
absolutely delighted to find Nomic Atlas. It allows me to have a variety of key
features for data exploration:



Application

I built the nomic Atlas within the Data Processing Space. Within the flow I even have already built the Processed Dataset and
the one thing left is to visualise it. You possibly can see how I construct with nomic
in build_nomic.py.
As before, I’ll extrapolate the relevant parts for this blog:


from nomic import atlas
from nomic.dataset import AtlasClass
from nomic.data_inference import NomicTopicOptions


NOMIC_KEY = os.getenv('NOMIC_KEY')
nomic.login(NOMIC_KEY)


topic_options = NomicTopicOptions(build_topic_model=True, community_description_target_field='subreddit')

identifier = 'BORU Subreddit Neural Search'
project = atlas.map_data(embeddings=np.stack(df['embedding'].values),
                         data=df,
                         id_field='id',
                         identifier=identifier,
                         topic_model=topic_options)
print(f"Succeeded in creating new edition of nomic Atlas: {project.slug}")

Given how nomic works, it should create a brand new Atlas Dataset under your account every time you run atlas.map_data. I need
to maintain the identical dataset updated. Currently the most effective approach to do that is to delete your old dataset.

ac = AtlasClass()
atlas_id = ac._get_dataset_by_slug_identifier("derek2/boru-subreddit-neural-search")['id']
ac._delete_project_by_id(atlas_id)
logger.info(f"Succeeded in deleting old version of nomic Atlas.")
        

sleep_time = 300
logger.info(f"Sleeping for {sleep_time}s to attend for old version deletion on the server-side")
time.sleep(sleep_time)



Features

Webhooks
Figure 5: Nomic Screenshot

Using Nomic Atlas needs to be pretty self-explanatory and yow will discover
some further documentation here. But I’ll give a fast intro so I can then highlight a few of
the lesser known features.

The essential area with the dots shows each embedded document. The closer each document is, the more related it’s. This may
vary based on just a few things (how well the embedder works in your data, compression from high dimensionality to 2D
representation, etc) so take it with a grain of salt. We’ve got the flexibility to look and look at documents on the left.

Within the red box in Figure 5 we are able to see 5 boxes that allow us to look in other ways. Every one is applied
iteratively, which makes it an excellent way
to “chip away on the elephant”. We could search by date
or other field, after which use a text search for example. The good feature is the one on the far left, it is a neural
search you can use in 3 ways:

  1. Query Search – You give a brief description that ought to match an embedded (long) document
  2. Document Search – You give an extended document that ought to match an embedded document
  3. Embedding Search – Use an embedding vector directly to look

I typically use Query search once I’m exploring my uploaded documents.

Within the blue box in Figure 5 we are able to see each row of the dataset I uploaded visualized nicely. One feature I actually
liked is that it visualizes HTML. So you may have control on the way it looks. Since reddit posts are in markdown it is simple to
convert this to HTML to visualise it.



Ethical Considerations

The info source for all of this comprises content that’s labeled Not Suited For Work (NSFW), which has similarities to our
label of Not For All Audiences (NFAA). We don’t prohibit this content on the Hub, but we do need to handle it
accordingly. Moreover, recent work has shown that content that’s obtained indiscriminately from the web has a
risk of containing Child Sexual Abuse Material (CSAM), especially content that has a high prevalence of uncurated sexual
material.

To evaluate those risks within the context of this dataset curation effort, we are able to start by the method through
which the source data is collated. The unique stories (before being aggregated) undergo a moderator, then the
update is commonly in a subreddit where there’s a moderator. Every so often the update gets uploaded to the unique poster’s
profile. The ultimate version gets uploaded to r/bestofredditorupdates
which has strict moderation. They’re strict since they’ve more risks of brigading. All that to say there are at the very least
2 moderation steps, often 3 with one being well referred to as strict.

On the time of writing this there have been 69 stories labeled NSFW. Of those, none of them have CSAM material as I manually
checked them. I even have also gated the datasets containing NFAA material. To make the nomic visualization more accessible
I’m making a filtered dataset upon atlas creation by removing posts containing content with “NSFW” within the dataframe.



Conclusion

By shining a light-weight on these lesser-known tools and features throughout the Hugging Face Hub, I hope to encourage you to think
outside the box when constructing your AI solutions. Whether you replicate the use-case I’ve outlined or give you
something entirely your individual, these tools can aid you construct more efficient, powerful, and modern applications. Get
began today and unlock the complete potential of the Hugging Face Hub!



References



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x