1. Introduction
We’re all used to work with CSVs, JSON files… With the standard libraries and for big datasets, these may be extremely slow to read, write and operate on, resulting in performance bottlenecks (been there). It’s precisely with big amounts of knowledge that being efficient handling the info is crucial for our data science/analytics workflow, and this is precisely where Apache Arrow comes into play.
Why? The major reason resides in how the info is stored in memory. While JSON and CSVs, for instance, are text-based formats, Arrow is a columnar in-memory data format (and that enables for fast data interchange between different data processing tools). Arrow is due to this fact designed to optimize performance by enabling zero-copy reads, reducing memory usage, and supporting efficient compression.
Furthermore, Apache Arrow is open-source and optimized for analytics. It’s designed to speed up big data processing while maintaining interoperability with various data tools, reminiscent of Pandas, Spark, and Dask. By storing data in a columnar format, Arrow enables faster read/write operations and efficient memory usage, making it ideal for analytical workloads.
Sounds great right? What’s best is that that is all of the introduction to Arrow I’ll provide. Enough theory, we would like to see it in motion. So, on this post, we’ll explore learn how to use Arrow in Python and learn how to make probably the most out of it.
2. Arrow in Python
To start, it’s essential install the needed libraries: pandas and pyarrow.
pip install pyarrow pandas
Then, as at all times, import them in your Python script:
import pyarrow as pa
import pandas as pd
Nothing recent yet, just needed steps to do what follows. Let’s start by performing some easy operations.
2.1. Creating and Storing a Table
The only we will do is hardcode our table’s data. Let’s create a two-column table with football data:
teams = pa.array(['Barcelona', 'Real Madrid', 'Rayo Vallecano', 'Athletic Club', 'Real Betis'], type=pa.string())
goals = pa.array([30, 23, 9, 24, 12], type=pa.int8())
team_goals_table = pa.table([teams, goals], names=['Team', 'Goals'])
The format is , but we will easily convert it to pandas if we would like:
df = team_goals_table.to_pandas()
And restore it back to arrow using:
team_goals_table = pa.Table.from_pandas(df)
And we’ll finally store the table in a file. We could use different formats, like feather, parquet… I’ll use this last one since it’s fast and memory-optimized:
import pyarrow.parquet as pq
pq.write_table(team_goals_table, 'data.parquet')
Reading a parquet file would just consist of using pq.read_table('data.parquet')
.
2.2. Compute Functions
Arrow has its own compute module for the standard operations. Let’s start by comparing two arrays element-wise:
import pyarrow.compute as pc
>>> a = pa.array([1, 2, 3, 4, 5, 6])
>>> b = pa.array([2, 2, 4, 4, 6, 6])
>>> pc.equal(a,b)
[
false,
true,
false,
true,
false,
true
]
That was easy, we could sum all elements in an array with:
>>> pc.sum(a)
And from this we could easily guess how we will compute a count, a floor, an exp, a mean, a max, a multiplication… No have to go over them, then. So let’s move to tabular operations.
We’ll start by showing learn how to sort it:
>>> table = pa.table({'i': ['a','b','a'], 'x': [1,2,3], 'y': [4,5,6]})
>>> pc.sort_indices(table, sort_keys=[('y', descending)])
[
2,
1,
0
]
Identical to in pandas, we will group values and aggregate the info. Let’s, for instance, group by “i” and compute the sum on “x” and the mean on “y”:
>>> table.group_by('i').aggregate([('x', 'sum'), ('y', 'mean')])
pyarrow.Table
i: string
x_sum: int64
y_mean: double
----
i: [["a","b"]]
x_sum: [[4,2]]
y_mean: [[5,5]]
Or we will join two tables:
>>> t1 = pa.table({'i': ['a','b','c'], 'x': [1,2,3]})
>>> t2 = pa.table({'i': ['a','b','c'], 'y': [4,5,6]})
>>> t1.join(t2, keys="i")
pyarrow.Table
i: string
x: int64
y: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
y: [[4,5,6]]
By default, it’s a left outer join but we could twist it by utilizing the parameter.
There are numerous more useful operations, but let’s see just yet another to avoid making this too long: appending a brand new column to a table.
>>> t1.append_column("z", pa.array([22, 44, 99]))
pyarrow.Table
i: string
x: int64
z: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
z: [[22,44,99]]
Before ending this section, we must see learn how to filter a table or array:
>>> t1.filter((pc.field('x') > 0) & (pc.field('x') < 3))
pyarrow.Table
i: string
x: int64
----
i: [["a","b"]]
x: [[1,2]]
Easy, right? Especially should you’ve been using pandas and numpy for years!
3. Working with files
We’ve already seen how we will read and write Parquet files. But let’s check another popular file types in order that now we have several options available.
3.1. Apache ORC
Being very informal, Apache ORC may be understood because the equivalent of Arrow within the realm of file types (despite the fact that its origins don't have anything to do with Arrow). Being more correct, it’s an open source and columnar storage format.
Reading and writing it's as follows:
from pyarrow import orc
# Write table
orc.write_table(t1, 't1.orc')
# Read table
t1 = orc.read_table('t1.orc')
As a side note, we could resolve to compress the file while writing by utilizing the “compression” parameter.
3.2. CSV
No secret here, pyarrow has the CSV module:
from pyarrow import csv
# Write CSV
csv.write_csv(t1, "t1.csv")
# Read CSV
t1 = csv.read_csv("t1.csv")
# Write CSV compressed and without header
options = csv.WriteOptions(include_header=False)
with pa.CompressedOutputStream("t1.csv.gz", "gzip") as out:
csv.write_csv(t1, out, options)
# Read compressed CSV and add custom header
t1 = csv.read_csv("t1.csv.gz", read_options=csv.ReadOptions(
column_names=["i", "x"], skip_rows=1
)]
3.2. JSON
Pyarrow allows JSON reading but not writing. It’s pretty straightforward, let’s see an example supposing now we have our JSON data in “data.json”:
from pyarrow import json
# Read json
fn = "data.json"
table = json.read_json(fn)
# We will now convert it to pandas if we would like to
df = table.to_pandas()
Feather is a conveyable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. So, contrary to Apache ORC, this one was indeed created early within the Arrow project.
from pyarrow import feather
# Write feather from pandas DF
feather.write_feather(df, "t1.feather")
# Write feather from table, and compressed
feather.write_feather(t1, "t1.feather.lz4", compression="lz4")
# Read feather into table
t1 = feather.read_table("t1.feather")
# Read feather into df
df = feather.read_feather("t1.feather")
4. Advanced Features
We just touched upon probably the most basic features and what the bulk would wish while working with Arrow. Nevertheless, its amazingness doesn’t end here, it’s right where it starts.
As this might be quite domain-specific and never useful for anyone (nor considered introductory) I’ll just mention a few of these features without using any code:
- We will handle memory management through the Buffer type (built on top of C++ Buffer object). Making a buffer with our data doesn't allocate any memory; it's a zero-copy view on the memory exported from the info bytes object. Maintaining with this memory management, an instance of MemoryPool tracks all of the allocations and deallocations (like and in C). This enables us to trace the quantity of memory being allocated.
- Similarly, there are other ways to work with input/output streams in batches.
- PyArrow comes with an abstract filesystem interface, in addition to concrete implementations for various storage types. So, for instance, we will write and browse parquet files from an S3 bucket using the S3FileSystem. Google Cloud and Hadoop Distributed File System (HDFS) are also accepted.
5. Conclusion and Key Takeaways
Apache Arrow is a strong tool for efficient Data Handling in Python. Its columnar storage format, zero-copy reads, and interoperability with popular data processing libraries make it ideal for data science workflows. By integrating Arrow into your pipeline, you may significantly boost performance and optimize memory usage.