The Xet team at Hugging Face is working on improving the efficiency of the Hub’s
storage architecture to make it easier and quicker for users to
store and update data and models. As Hugging Face hosts nearly 11PB of datasets
with Parquet files alone accounting for over 2.2PB of that storage,
optimizing Parquet storage is of pretty high priority.
Most Parquet files are bulk exports from various data evaluation pipelines
or databases, often appearing as full snapshots somewhat than incremental
updates. Data deduplication becomes critical for efficiency when users wish to
update their datasets frequently. Only by deduplicating can we store
all versions as compactly as possible, without requiring every little thing to be uploaded
again on every update. In a super case, we must always have the opportunity to store every version
of a growing dataset with only a bit of more room than the dimensions of its largest version.
Our default storage algorithm uses byte-level Content-Defined Chunking (CDC),
which generally dedupes well over insertions and deletions, however the Parquet layout brings some challenges.
Here we run some experiments to see how some easy modifications behave on
Parquet files, using a 2GB Parquet file with 1,092,000 rows from the
FineWeb
dataset and generating visualizations using our dedupe
estimator.
Background
Parquet tables work by splitting the table into row groups, each with a set
variety of rows (as an illustration 1000 rows). Each column inside the row group is
then compressed and stored:
Intuitively, which means that operations which don’t mess with the row
grouping, like modifications or appends, should dedupe pretty much. So let’s
test this out!
Append
Here we append 10,000 latest rows to the file and compare the outcomes with the
original version. Green represents all deduped blocks, red represents all
latest blocks, and shades in between show different levels of deduplication.
We are able to see that indeed we’re capable of dedupe nearly your entire file,
but only with changes seen at the tip of the file. The brand new file is 99.1%
deduped, requiring only 20MB of additional storage. This matches our
intuition pretty much.
Modification
Given the layout, we’d expect that row modifications to be pretty
isolated, but this is seemingly not the case. Here we make a small
modification to row 10000, and we see that while a lot of the file does dedupe,
there are a lot of small recurrently spaced sections of recent data!
A fast scan of the Parquet file
format suggests
that absolute file offsets are a part of the Parquet column headers (see the
structures ColumnChunk and ColumnMetaData)! Because of this any
modification is prone to rewrite all of the Column headers. So while the
data does dedupe well (it is usually green), we get latest bytes in every
column header.
On this case, the brand new file is simply 89% deduped, requiring 230MB of additional
storage.
Deletion
Here we delete a row from the center of the file (note: insertion must have
similar behavior). As this reorganizes your entire row group layout (each
row group is 1000 rows), we see that while we dedupe the primary half of
the file, the remaining file has completely latest blocks.
This is usually since the Parquet format compresses each column
aggressively. If we turn off compression we’re capable of dedupe more
aggressively:
Nonetheless the file sizes are nearly 2x larger if we store the information
uncompressed.
Is it possible to benefit from dedupe and compression at the identical
time?
Content Defined Row Groups
One potential solution is to make use of not only byte-level CDC, but additionally apply it on the row level:
we split row groups not based on absolute count (1000 rows), but on a hash of a provided
“Key” column. In other words, I split off a row group each time the hash of
the important thing column % [target row count] = 0, with some allowances for a minimum
and a maximum row group size.
I hacked up a fast inefficient experimental demonstration
here.
With this, we’re capable of efficiently dedupe across compressed Parquet
files whilst I delete a row. Here we clearly see an enormous red block
representing the rewritten row group, followed by a small change for each
column header.
Based on these experiments, we could consider improving Parquet file
dedupe-ability in a pair of how:
- Use relative offsets as an alternative of absolute offsets for file structure
data. This is able to make the Parquet structures position independent and
easy to “memcpy” around, even though it is an involving file format change that
might be difficult to do. - Support content defined chunking on row groups. The format actually
supports this today because it doesn’t require row groups to be uniformly sized,
so this will be done with minimal blast radius. Only Parquet format writers
would need to be updated.
While we are going to proceed exploring ways to enhance Parquet storage performance
(Ex: perhaps we could optionally rewrite Parquet files before uploading?
Strip absolute file offsets on upload and restore on download?), we’d
like to work with the Apache Arrow project to see if there’s interest in
implementing a few of these ideas within the Parquet / Arrow code base.
Within the meantime, we’re also exploring the behavior of our data dedupe process
on other common filetypes. Please do try our dedupe
estimator and tell us about
your findings!
