Benchmarking random access in Lance Test setup Creating dataset Benchmarking Take Key lookup Duckdb integration Conclusion

Artificial Intelligence

Benchmarking random access in Lance Test setup Creating dataset Benchmarking Take Key lookup Duckdb integration Conclusion

admin

March 17, 2023

Benchmarking random access in Lance
Test setup
Creating dataset
Benchmarking Take
Key lookup
Duckdb integration
Conclusion

On this short blog post we’ll take you thru some easy benchmarks to point out the random access performance of Lance format.

Lance delivers comparable scan performance to parquet but supports fast random access, making it perfect for:

engines like google
real-time feature retrieval, and
speeding up shuffling performance for deep learning training

What makes Lance interesting is that in the prevailing tooling ecosystem you either must take care of the complexity of putting together multiple systems OR coping with the expense of all in-memory stores. Furthermore, Lance doesn’t require extra servers or complicated setup. pip install pylance is all you wish.

Here we’re going to match the random access performance of Lance vs parquet. We’ll create 100 million records where each value is a 1000-character long randomly generated string. We then run a benchmark of 1000 queries that fetch a random set of 20–50 rows across the dataset. Each tests are done on the identical Ubuntu 22.04 system:

sudo lshw -short
Class          Description
=============================================================
system         20M9CTO1WW (LENOVO_MT_20M9_BU_Think_FM_ThinkPad P52)
memory         128GiB System Memory
memory         32GiB SODIMM DDR4 Synchronous 2667 MHz (0.4 ns)
memory         32GiB SODIMM DDR4 Synchronous 2667 MHz (0.4 ns)
memory         32GiB SODIMM DDR4 Synchronous 2667 MHz (0.4 ns)
memory         32GiB SODIMM DDR4 Synchronous 2667 MHz (0.4 ns)
memory         384KiB L1 cache
memory         1536KiB L2 cache
memory         12MiB L3 cache
processor      Intel(R) Xeon(R) E-2176M  CPU @ 2.70GHz
storage        Samsung SSD 980 PRO 2TB

To run this benchmark we first generate 100 million entries, each of which is a 1000 character long string.