Introducing the SQL Console on Datasets

-


Caleb Fahlgren's avatar

Datasets use has been exploding and Hugging Face has turn out to be the default home for a lot of datasets. Every month, as the quantity of datasets uploaded to the Hub increases, so does the necessity to question, filter and discover them.

Dataset Monthly Creations
Datasets created on Hugging Face Hub every month

We’re very excited to announce that you could now run SQL queries in your datasets directly within the Hugging Face Hub!



Introducing the SQL Console for Datasets

On every dataset you need to see a brand new SQL Console badge. With only one click you’ll be able to open a SQL Console to question that dataset.

Querying the Magpie-Ultra dataset for excellent, prime quality reasoning instructions.

All of the work is finished within the browser and the console comes with just a few neat features:

  • 100% Local: The SQL Console is powered by DuckDB WASM, so you’ll be able to query your dataset with none dependencies.
  • Full DuckDB Syntax: DuckDB has full SQL syntax support, together with many inbuilt functions for regex, lists, JSON, embeddings and more. You will find DuckDB syntax to be very just like PostgreSQL.
  • Export Results: You may export the outcomes of your query to parquet.
  • Shareable: You may share your query results of public datasets with a link.



How it really works



Parquet Conversion

Most datasets on Hugging Face are stored in Parquet, a columnar data format that’s optimized for performance and storage efficiency. The Dataset Viewer on Hugging Face and the SQL Console load the info directly from the datasets Parquet files. And if the dataset is in one other format, the primary 5GB is auto-converted to Parquet. You will discover more information in regards to the Parquet conversion process within the Dataset Viewer Parquet API documentation.

Using the Parquet files, the SQL Console creates views for you to question based in your dataset splits and configs.



DuckDB WASM 🦆

DuckDB WASM is the engine that powers the SQL Console. It’s an in-process database engine that runs on Web Assembly within the browser. No server or backend needed.

By running solely within the browser, it gives the user the upmost flexibility to question data as they please with none dependencies. It also makes it really easy to share reproducible results with a straightforward link.

Chances are you’ll be wondering, “Will it work for large datasets?” and the reply is, “Yes!”.

Here’s a question of the OpenCo7/UpVoteWeb dataset which has 12.6M rows within the Parquet conversion.

Reddit Movie Suggestions

You may see we received results for a straightforward filter query in under 3 seconds.

While queries will take longer based on the dimensions of the dataset and query complexity, you will probably be surprised about how much you’ll be able to do with the SQL Console.

As with every technology, there are limitations.

  • The SQL Console will work for loads of queries. Nonetheless, the memory limit is ~3GB, so it is feasible to expire of memory and never have the option to process the query (Tip: try to make use of filters to cut back the quantity of information you’re querying together with LIMIT).
  • While DuckDB WASM may be very powerful, it doesn’t have full feature parity with DuckDB. For instance, DuckDB WASM doesn’t yet support the hf:// protocol to question datasets.



Example: Converting a dataset from Alpaca to conversations

Now that we have introduced the SQL Console, let’s explore a practical example. When fine-tuning a Large Language Model (LLM), you regularly must work with different data formats. One particularly popular format is the conversational format, where each row represents a multi-turn dialogue between a user and the model. The SQL Console may help us transform data into this format efficiently. Let’s examine how we will convert an Alpaca dataset to a conversational format using SQL.

Typically, developers would tackle this task with a Python pre-processing step, but we will show how you can use the SQL Console to realize the identical in lower than 30 seconds.

Within the dataset above, click on the SQL Console badge to open the SQL Console. You need to see the query below mechanically populated.

When you find yourself ready, click the Run Query button to execute the query.



SQL


WITH 
source_view AS (
  SELECT * FROM train  
)
SELECT 
  [
    struct_pack(
      "from" := 'user',
      "value" := CASE 
                   WHEN input IS NOT NULL AND input != '' 
                   THEN instruction || 'nn' || input
                   ELSE instruction
                 END
    ),
    struct_pack(
      "from" := 'assistant',
      "value" := output
    )
  ] AS conversation
FROM source_view
WHERE instruction IS NOT NULL 
AND output IS NOT NULL;

Within the query we use the struct_pack function to create a brand new STRUCT row for every conversation.

DuckDB has great documentation on the STRUCT Data Type and Functions. You will find many datasets contain columns with JSON data. DuckDB provides functions to simply parse and query these columns.

Alpaca to Conversation

Once now we have the outcomes, we will download them as a Parquet file. You may see what the ultimate output looks like below.

Try it out!

As an one other example, you’ll be able to try a SQL Console query for SkunkworksAI/reasoning-0.01 to see instructions with greater than 10 reasoning steps.



SQL Snippets

DuckDB has a ton of use cases that we’re still exploring. We created a SQL Snippets space to showcase what you’ll be able to do with the SQL Console.

Listed here are some really interesting use cases now we have found:

Remember, it’s one click to download your SQL results as a Parquet file and use in your dataset!

We might love to listen to what you’re thinking that of the SQL Console and if you might have any feedback, please comment on this post!



Resources



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x