as much as 40% faster training and tuning of Large Language Models

Generative AI has been taking the world by storm. As the information and AI company, we now have been on this journey with the discharge of the open source large language model Dolly, in addition to the internally crowdsourced dataset licensed for research and business use that we used to fine-tune it, the databricks-dolly-15k. Each the model and dataset can be found on Hugging Face. We’ve learned rather a lot throughout this process, and today we’re excited to announce our first of many official commits to the Hugging Face codebase that enables users to simply create a Hugging Face Dataset from an Apache Spark™ dataframe.

“It has been great to see Databricks release models and datasets to the community, and now we see them extending that work with direct open source commitment to Hugging Face. Spark is some of the efficient engines for working with data at scale, and it’s great to see that users can now profit from that technology to more effectively advantageous tune models from Hugging Face.”

— Clem Delange, Hugging Face CEO

Hugging Face gets first-class Spark support

Over the past few weeks, we’ve gotten many requests from users asking for a neater approach to load their Spark dataframe right into a Hugging Face dataset that might be utilized for model training or tuning. Prior to today’s release, to get data from a Spark dataframe right into a Hugging Face dataset, users had to write down data into Parquet files after which point the Hugging Face dataset to those files to reload them. For instance:

from datasets import load_dataset

train_df = train.write.parquet(train_dbfs_path, mode="overwrite")

train_test = load_dataset("parquet", data_files={"train":f"/dbfs{train_dbfs_path}/*.parquet", "test":f"/dbfs{test_dbfs_path}/*.parquet"})

#16GB == 22min

Not only was this cumbersome, nevertheless it also meant that data needed to be written to disk after which read in again. On top of that, the information would get rematerialized once loaded back into the dataset, which eats up more resources and, subsequently, more time and price. Using this method, we saw that a comparatively small (16GB) dataset took about 22 minutes to go from Spark dataframe to Parquet, after which back into the Hugging Face dataset.

With the most recent Hugging Face release, we make it much simpler for users to perform the identical task by simply calling the brand new “from_spark” function in Datasets:

from datasets import Dataset

df = [some Spark dataframe or Delta table loaded into df]

dataset = Dataset.from_spark(df)

#16GB == 12min

This permits users to make use of Spark to efficiently load and transform data for training or fine-tuning a model, then easily map their Spark dataframe right into a Hugging Face dataset for super easy integration into their training pipelines. This combines cost savings and speed from Spark and optimizations like memory-mapping and smart caching from Hugging Face datasets. These improvements cut down the processing time for our example 16GB dataset by greater than 40%, going from 22 minutes all the way down to only 12 minutes.

Why does this matter?

As we transition to this recent AI paradigm, organizations might want to use their extremely useful data to enhance their AI models in the event that they wish to get the very best performance inside their specific domain. It will almost actually require work in the shape of information transformations, and doing this efficiently over large datasets is something Spark was designed to do. Integrating Spark with Hugging Face gives you the cost-effectiveness and performance of Spark while retaining the pipeline integration that Hugging Face provides.

Continued Open-Source Support

We see this release as a brand new avenue to further contribute to the open source community, something that we imagine Hugging Face does extremely well, because it has grow to be the de facto repository for open source models and datasets. This is simply the primary of many contributions. We have already got plans so as to add streaming support through Spark to make the dataset loading even faster.

With a view to grow to be the very best platform for users to leap into the world of AI, we’re working hard to supply the very best tools to successfully train, tune, and deploy models. Not only will we proceed contributing to Hugging Face, but we’ve also began releasing improvements to our other open source projects. A recent MLflow release added support for the transformers library, OpenAI integration, and Langchain support. We also announced AI Functions inside Databricks SQL that lets users easily integrate OpenAI (or their very own deployed models in the long run) into their queries. To top all of it off, we also released a PyTorch distributor for Spark to simplify distributed PyTorch training on Databricks.

This text was originally published on April 26, 2023 in Databricks’s blog.

Source link

as much as 40% faster training and tuning of Large Language Models

Hugging Face gets first-class Spark support

Why does this matter?

Continued Open-Source Support

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Fostering Collaboration with the Chinese AI community

I Ditched My Mouse: How I Control My Computer With Hand Gestures (In 60 Lines of Python)

Machine Learning in Production? What This Really Means

Site catering to online criminals has been seized by the FBI

Updating Classifier Evasion for Vision Language Models

as much as 40% faster training and tuning of Large Language Models

Hugging Face gets first-class Spark support

Why does this matter?

Continued Open-Source Support

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.