On this short blog post we’ll take you thru some easy benchmarks to point out the random access performance of Lance format.
Lance delivers comparable scan performance to parquet but supports fast random access, making it perfect for:
- engines like google
- real-time feature retrieval, and
- speeding up shuffling performance for deep learning training
What makes Lance interesting is that in the prevailing tooling ecosystem you either must take care of the complexity of putting together multiple systems OR coping with the expense of all in-memory stores. Furthermore, Lance doesn’t require extra servers or complicated setup. pip install pylance
is all you wish.
Here we’re going to match the random access performance of Lance vs parquet. We’ll create 100 million records where each value is a 1000-character long randomly generated string. We then run a benchmark of 1000 queries that fetch a random set of 20–50 rows across the dataset. Each tests are done on the identical Ubuntu 22.04 system:
sudo lshw -short
Class Description
=============================================================
system 20M9CTO1WW (LENOVO_MT_20M9_BU_Think_FM_ThinkPad P52)
memory 128GiB System Memory
memory 32GiB SODIMM DDR4 Synchronous 2667 MHz (0.4 ns)
memory 32GiB SODIMM DDR4 Synchronous 2667 MHz (0.4 ns)
memory 32GiB SODIMM DDR4 Synchronous 2667 MHz (0.4 ns)
memory 32GiB SODIMM DDR4 Synchronous 2667 MHz (0.4 ns)
memory 384KiB L1 cache
memory 1536KiB L2 cache
memory 12MiB L3 cache
processor Intel(R) Xeon(R) E-2176M CPU @ 2.70GHz
storage Samsung SSD 980 PRO 2TB
To run this benchmark we first generate 100 million entries, each of which is a 1000 character long string.
Converting from Lance to parquet is only one line:
pa.dataset.write_dataset(lance_dataset.scanner().to_reader(),
"take.parquet",
format="parquet")
For each datasets we run 1000 queries each. For every query, we generate 20–50 row id’s randomly after which retrieve those rows and record the run time. We then compute the common time per key.
The API we use is Dataset.take
:
The parquet snippet is sort of similar so it’s omitted.
Here’s the output (in seconds):
Lance: mean time per secret is 0.0006225714343229975
Parquet: mean time per secret is 1.246656603929473
I also benchmarked an analogous setup using LMDB and plotted all on the identical chart for comparison:
If you happen to’ve noticed we’ve only benchmarked Dataset::Take
on row ids. On the roadmap is to make this more generic so you possibly can lookup arbitrary keys in any column.
A part of the limitation is in Lance itself. Currently looking up a selected key is finished using a pyarrow Compute Expression, like dataset.to_table(columns=["value"], filter=pa.Field("key") ==
. Currently this requires scanning through the key
column to seek out the best row ids, which adds greater than 10ms to the query time. To resolve this problem, we plan to 1) calculate batch stats so we are able to 2) implement batch pruning. And for super heavily queried key columns, 3) adding a secondary index would make arbitrary key lookups much faster.
In python Lance is already queryable by DuckDB via the Arrow integration. Nevertheless, one major shortcoming of DuckDB’s Arrow integration is the extremely limited filter pushdown. For instance, pa.Field("key") ==
is pushed down across the pyarrow interface, but multiple key lookups is just not. This will be the difference between <10ms response time vs >500ms response time. In Lance OSS, we’re working on a native duckdb extension in order that we don’t must be subject to those limitations.
We’ve been claiming 100x faster random access performance than parquet but as this benchmark shows, it’s really more like 2000x. Lance brings fast random access performance to the OSS data ecosystem needed by essential ML workflows. That is critical for search, feature hydration, and shuffling for training deep learning models. While Lance’s performance is already very useful for these use cases we’ll be working to implement generalized key lookups, higher duckdb integration, and hooks to distribute large Lance datasets across Spark/Ray nodes.
If any of those use cases apply to you, please give Lance a shot. We’d love to listen to your feedback. If you happen to like us, please give us a ⭐ on ️Github!
Can you be more specific about the content of your enticle? After reading it, I still have some doubts. Hope you can help me. https://accounts.binance.com/en/register?ref=P9L9FQKY