Interactively explore your Huggingface dataset with one line of code

-



The Hugging Face datasets library not only provides access to greater than 70k publicly available datasets, but additionally offers very convenient data preparation pipelines for custom datasets.

Renumics Highlight means that you can create interactive visualizations to discover critical clusters in your data. Because Highlight understands the info semantics inside Hugging Face datasets, you’ll be able to start with only one line of code:

import datasets
from renumics import highlight

ds = datasets.load_dataset('speech_commands', 'v0.01', split='validation')

highlight.show(ds)

Highlight allows to leverage model results reminiscent of predictions and embeddings to realize a deeper understanding in data segments and model failure modes:

ds_results = datasets.load_dataset('renumics/speech_commands-ast-finetuned-results', 'v0.01', split='validation')
ds = datasets.concatenate_datasets([ds, ds_results], axis=1)

highlight.show(ds, dtype={'embedding': highlight.Embedding}, layout=highlight.layouts.debug_classification(embedding='embedding', inspect={'audio': highlight.dtypes.audio_dtype}))

Data inspection is an important task in just about all ML development stages, but it may well even be very time consuming.

“Manual inspection of information has probably the very best value-to-prestige ratio of any activity in machine learning.” — Greg Brockman

Highlight lets you make data inspection more scalable along two dimensions: Establishing and maintaining custom data inspection workflows and finding relevant data samples and clusters to examine. In the next sections we show some examples based on Hugging Face datasets.



Highlight 🤝 Hugging Face datasets

The datasets library has several features that makes it a perfect tool for working with ML datasets: It stores tabular data (e.g. metadata, labels) together with unstructured data (e.g. images, audio) in a typical Arrows table. Datasets also describes essential data semantics through features (e.g. images, audio) and extra task-specific metadata.

Highlight directly works on top of the datasets library. Which means that there isn’t a must copy or pre-process the dataset for data visualization and inspection. Highlight loads the tabular data into memory to permit for efficient, client-side data analytics. Memory-intensive unstructured data samples (e.g. audio, images, video) are loaded lazily on demand. Most often, data types and label mappings are inferred directly from the dataset. Here, we visualize the CIFAR-100 dataset with one line of code:

ds = datasets.load_dataset('cifar100', split='test')
highlight.show(ds)

In cases where the info types are ambiguous or not specified, the Highlight API allows to manually assign them:

label_mapping = dict(zip(ds.features['fine_label'].names, range(len(ds.features['fine_label'].names))))
highlight.show(ds, dtype={'img': highlight.Image, 'fine_label': highlight.dtypes.CategoryDType(categories=label_mapping)})



Leveraging model results for data inspection

Exploring raw unstructured datasets often yield little insights. Leveraging model results reminiscent of predictions or embeddings may also help to uncover critical data samples and clusters. Highlight has several visualization options (e.g. similarity map, confusion matrix) that specifically make use of model results.

We recommend storing your prediction results directly in a Hugging Face dataset. This not only means that you can make the most of the batch processing capabilities of the datasets library, but additionally keeps label mappings.

We will use the transformers library to compute embeddings and predictions on the CIFAR-100 image classification problem. We install the libraries via pip:

pip install renumics-spotlight datasets transformers[torch]

Now we are able to compute the enrichment:

import torch
import transformers

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_name = "Ahmed9275/Vit-Cifar100"
processor = transformers.ViTImageProcessor.from_pretrained(model_name)
cls_model = transformers.ViTForImageClassification.from_pretrained(model_name).to(device)
fe_model = transformers.ViTModel.from_pretrained(model_name).to(device)

def infer(batch):
    images = [image.convert("RGB") for image in batch]
    inputs = processor(images=images, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = cls_model(**inputs)
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1).cpu().numpy()
        embeddings = fe_model(**inputs).last_hidden_state[:, 0].cpu().numpy()
    preds = probs.argmax(axis=-1)
    return {"prediction": preds, "embedding": embeddings}

features = datasets.Features({**ds.features, "prediction": ds.features["fine_label"], "embedding": datasets.Sequence(feature=datasets.Value("float32"), length=768)})
ds_enriched = ds.map(infer, input_columns="img", batched=True, batch_size=2, features=features)

When you don’t wish to perform the complete inference run, you’ll be able to alternatively download pre-computed model results for CIFAR-100 to follow this tutorial:

ds_results = datasets.load_dataset('renumics/spotlight-cifar100-enrichment', split='test')
ds_enriched = datasets.concatenate_datasets([ds, ds_results], axis=1)

We will now use the outcomes to interactively explore relevant data samples and clusters in Highlight:

layout = highlight.layouts.debug_classification(label='fine_label', embedding='embedding', inspect={'img': highlight.dtypes.image_dtype})
highlight.show(ds_enriched, dtype={'embedding': highlight.Embedding}, layout=layout)
CIFAR-100 model debugging layout example.



Customizing data inspection workflows

Visualization layouts could be interactively modified, saved and loaded within the GUI: You may select different widget types and configurations. The Inspector widget allows to represent multimodal data samples including text, image, audio, video and time series data.

You can even define layouts through the Python API. This feature is particularly useful for constructing custom data inspection and curation workflows including EDA, model debugging and model monitoring tasks.

Together with the info issues widget, the Python API offers a terrific technique to integrate the outcomes of existing scripts (e.g. data quality checks or model monitoring) right into a scalable data inspection workflow.



Using Highlight on the Hugging Face hub

You should use Highlight directly in your local NLP, audio, CV or multimodal dataset. When you would love to showcase your dataset or model results on the Hugging Face hub, you need to use Hugging Face spaces to launch a Highlight visualization for it.

We’ve got already prepared example spaces for a lot of popular NLP, audio and CV datasets on the hub. You may simply duplicate one in all these spaces and specify your dataset within the HF_DATASET variable.

You may optionally select a dataset that comprises model results and other configuration options reminiscent of splits, subsets or dataset revisions.

Creating a new dataset visualization with Spotlight by duplicating a Hugging Face space.



What’s next?

With Highlight you’ll be able to create interactive visualizations and leverage data enrichments to discover critical clusters in your Hugging Face datasets. On this blog, we now have seen each an audio ML and a pc vision example.

You should use Highlight on to explore and curate your NLP, audio, CV or multimodal dataset:



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x