Anatomy of a Parquet File

-

Lately, Parquet has grow to be a normal format for data storage in Big Data ecosystems. Its column-oriented format offers several benefits:

  • Faster query execution when only a subset of columns is being processed
  • Quick calculation of statistics across all data
  • Reduced storage volume because of efficient compression

When combined with storage frameworks like Delta Lake or Apache Iceberg, it seamlessly integrates with query engines (e.g., Trino) and data warehouse compute clusters (e.g., Snowflake, BigQuery). In this text, the content of a Parquet file is dissected using mainly standard Python tools to higher understand its structure and the way it contributes to such performances.

Writing Parquet file(s)

To supply Parquet files, we use PyArrow, a Python binding for Apache Arrow that stores dataframes in memory in columnar format. PyArrow allows fine-grained parameter tuning when writing the file. This makes PyArrow ideal for Parquet manipulation (one also can simply use Pandas).

# generator.py

import pyarrow as pa
import pyarrow.parquet as pq
from faker import Faker

fake = Faker()
Faker.seed(12345)
num_records = 100

# Generate fake data
names = [fake.name() for _ in range(num_records)]
addresses = [fake.address().replace("n", ", ") for _ in range(num_records)]
birth_dates = [
    fake.date_of_birth(minimum_age=67, maximum_age=75) for _ in range(num_records)
]
cities = [addr.split(", ")[1] for addr in addresses]
birth_years = [date.year for date in birth_dates]

# Solid the information to the Arrow format
name_array = pa.array(names, type=pa.string())
address_array = pa.array(addresses, type=pa.string())
birth_date_array = pa.array(birth_dates, type=pa.date32())
city_array = pa.array(cities, type=pa.string())
birth_year_array = pa.array(birth_years, type=pa.int32())

# Create schema with non-nullable fields
schema = pa.schema(
    [
        pa.field("name", pa.string(), nullable=False),
        pa.field("address", pa.string(), nullable=False),
        pa.field("date_of_birth", pa.date32(), nullable=False),
        pa.field("city", pa.string(), nullable=False),
        pa.field("birth_year", pa.int32(), nullable=False),
    ]
)

table = pa.Table.from_arrays(
    [name_array, address_array, birth_date_array, city_array, birth_year_array],
    schema=schema,
)

print(table)
pyarrow.Table
name: string not null
address: string not null
date_of_birth: date32[day] not null
city: string not null
birth_year: int32 not null
----
name: [["Adam Bryan","Jacob Lee","Candice Martinez","Justin Thompson","Heather Rubio"]]
address: [["822 Jennifer Field Suite 507, Anthonyhaven, UT 98088","292 Garcia Mall, Lake Belindafurt, IN 69129","31738 Jonathan Mews Apt. 024, East Tammiestad, ND 45323","00716 Kristina Trail Suite 381, Howelltown, SC 64961","351 Christopher Expressway Suite 332, West Edward, CO 68607"]]
date_of_birth: [[1955-06-03,1950-06-24,1955-01-29,1957-02-18,1956-09-04]]
city: [["Anthonyhaven","Lake Belindafurt","East Tammiestad","Howelltown","West Edward"]]
birth_year: [[1955,1950,1955,1957,1956]]

The output clearly reflects a columns-oriented storage, unlike Pandas, which often displays a conventional “row-wise” table.

How is a Parquet file stored?

Parquet files are generally stored in low-cost object storage databases like S3 (AWS) or GCS (GCP) to be easily accessible by data processing pipelines. These files are often organized with a partitioning strategy by leveraging directory structures:

# generator.py

num_records = 100

# ...

# Writing the parquet files to disk
pq.write_to_dataset(
    table,
    root_path='dataset',
    partition_cols=['birth_year', 'city']
)

If birth_year and city columns are defined as partitioning keys, PyArrow creates such a tree structure within the directory dataset:

dataset/
├─ birth_year=1949/
├─ birth_year=1950/
│ ├─ city=Aaronbury/
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ …
│ ├─ city=Alicialand/
│ ├─ …
├─ birth_year=1951 ├─ ...

The strategy enables partition pruning: when a question filters on these columns, the engine can use folder names to read only the vital files. This is the reason the partitioning strategy is crucial for limiting delay, I/O, and compute resources when handling large volumes of information (as has been the case for a long time with traditional relational databases).

The pruning effect could be easily verified by counting the files opened by a Python script that filters the birth 12 months:

# query.py
import duckdb

duckdb.sql(
    """
    SELECT * 
    FROM read_parquet('dataset/*/*/*.parquet', hive_partitioning = true)
    where birth_year = 1949
    """
).show()
> strace -e trace=open,openat,read -f python query.py 2>&1 | grep "dataset/.*.parquet"

[pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=DPO%20AP%2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=DPO%20AP%2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=East%20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=East%20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=FPO%20AA%2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=FPO%20AA%2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Latest%20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Latest%20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=North%20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=North%20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Port%20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Port%20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3

Only 23 files are read out of 100.

Reading a raw Parquet file

Let’s decode a raw Parquet file without specialized libraries. For simplicity, the dataset is dumped right into a single file without compression or encoding.

# generator.py

# ...

pq.write_table(
    table,
    "dataset.parquet",
    use_dictionary=False,
    compression="NONE",
    write_statistics=True,
    column_encoding=None,
)

The very first thing to know is that the binary file is framed by 4 bytes whose ASCII representation is “PAR1”. The file is corrupted if this just isn’t the case.

# reader.py

with open("dataset.parquet", "rb") as file:
    parquet_data = file.read()

assert parquet_data[:4] == b"PAR1", "Not a sound parquet file"
assert parquet_data[-4:] == b"PAR1", "File footer is corrupted"

As indicated within the documentation, the file is split into two parts: the “row groups” containing actual data, and the footer containing metadata (schema below).

The footer

The scale of the footer is indicated within the 4 bytes preceding the tip marker as an unsigned integer written in “little endian” format (noted “unpack function).

# reader.py

import struct

# ...

footer_length = struct.unpack("
Footer size in bytes: 1088

The footer information is encoded in a cross-language serialization format called Apache Thrift. Using a human-readable but verbose format like JSON after which translating it into binary can be less efficient when it comes to memory usage. With Thrift, one can declare data structures as follows:

struct Customer {
	1: required string name,
	2: optional i16 birthYear,
	3: optional list interests
}

On the idea of this declaration, Thrift can generate Python code to decode byte strings with such data structure (it also generates code to perform the encoding part). The thrift file containing all the information structures implemented in a Parquet file could be downloaded here. After having installed the thrift binary, let’s run:

thrift -r --gen py parquet.thrift

The generated Python code is placed within the “gen-py” folder. The footer’s data structure is represented by the FileMetaData class – a Python class robotically generated from the Thrift schema. Using Thrift’s Python utilities, binary data is parsed and populated into an instance of this FileMetaData class.

# reader.py

import sys

# ...

# Add the generated classes to the python path
sys.path.append("gen-py")
from parquet.ttypes import FileMetaData, PageHeader
from thrift.transport import TTransport
from thrift.protocol import TCompactProtocol

def read_thrift(data, thrift_instance):
    """
    Read a Thrift object from a binary buffer.
    Returns the Thrift object and the variety of bytes read.
    """
    transport = TTransport.TMemoryBuffer(data)
    protocol = TCompactProtocol.TCompactProtocol(transport)
    thrift_instance.read(protocol)
    return thrift_instance, transport._buffer.tell()

# The variety of bytes read just isn't used for now
file_metadata_thrift, _ = read_thrift(footer_data, FileMetaData())

print(f"Variety of rows in the entire file: {file_metadata_thrift.num_rows}")
print(f"Variety of row groups: {len(file_metadata_thrift.row_groups)}")

Variety of rows in the entire file: 100
Variety of row groups: 1

The footer accommodates extensive information concerning the file’s structure and content. As an example, it accurately tracks the variety of rows within the generated dataframe. These rows are all contained inside a single “row group.”

Row groups

Unlike purely column-oriented formats, Parquet employs a hybrid approach. Before writing column blocks, the dataframe is first partitioned vertically into row groups (the parquet file we generated is just too small to be split in multiple row groups).

This hybrid structure offers several benefits:

Parquet calculates statistics (resembling min/max values) for every column inside each row group. These statistics are crucial for query optimization, allowing query engines to skip entire row groups that don’t match filtering criteria. For instance, if a question filters for birth_year > 1955 and a row group’s maximum birth 12 months is 1954, the engine can efficiently skip that entire data section. This optimisation is named “predicate pushdown”. Parquet also stores other useful statistics like distinct value counts and null counts.

# reader.py
# ...

first_row_group = file_metadata_thrift.row_groups[0]
birth_year_column = first_row_group.columns[4]

min_stat_bytes = birth_year_column.meta_data.statistics.min
max_stat_bytes = birth_year_column.meta_data.statistics.max

min_year = struct.unpack("
The birth 12 months range is between 1949 and 1958
  • Row groups enable parallel processing of information (particularly invaluable for frameworks like Apache Spark). The scale of those row groups could be configured based on the computing resources available (using the row_group_size property in function write_table when using PyArrow).
# generator.py

# ...

pq.write_table(
    table,
    "dataset.parquet",
    row_group_size=100,
)

# /! Keep the default value of "row_group_size" for the subsequent parts
  • Even when this just isn’t the first objective of a column format, Parquet’s hybrid structure maintains reasonable performance when reconstructing complete rows. Without row groups, rebuilding a whole row might require scanning the whole thing of every column which can be extremely inefficient for big files.

Data Pages

The smallest substructure of a Parquet file is the page. It accommodates a sequence of values from the identical column and, due to this fact, of the identical type. The alternative of page size is the results of a trade-off:

  • Larger pages mean less metadata to store and browse, which is perfect for queries with minimal filtering.
  • Smaller pages reduce the quantity of unnecessary data read, which is healthier when queries goal small, scattered data ranges.

Now let’s decode the contents of the primary page of the column dedicated to addresses whose location could be present in the footer (given by the data_page_offset attribute of the appropriate ColumnMetaData) . Each page is preceded by a Thrift PageHeader object containing some metadata. The offset actually points to a Thrift binary representation of the page metadata that precedes the page itself. The Thrift class is named a PageHeader and can be present in the gen-py directory.

💡

# reader.py
# ...

address_column = first_row_group.columns[1]
column_start = address_column.meta_data.data_page_offset
column_end = column_start + address_column.meta_data.total_compressed_size
column_content = parquet_data[column_start:column_end]

page_thrift, page_header_size = read_thrift(column_content, PageHeader())
page_content = column_content[
    page_header_size : (page_header_size + page_thrift.compressed_page_size)
]
print(column_content[:100])
b'6x00x00x00481 Mata Squares Suite 260, Lake Rachelville, KY 874642x00x00x00671 Barker Crossing Suite 390, Mooreto'

The generated values finally appear, in plain text and never encoded (as specified when writing the Parquet file). Nevertheless, to optimize the columnar format, it is suggested to make use of one in all the next encoding algorithms: dictionary encoding, run length encoding (RLE), or delta encoding (the latter being reserved for int32 and int64 types), followed by compression using gzip or snappy (available codecs are listed here). Since encoded pages contain similar values (all addresses, all decimal numbers, etc.), compression ratios could be particularly advantageous.

As documented within the specification, when character strings (BYTE_ARRAY) should not encoded, each value is preceded by its size represented as a 4-byte integer. This could be observed within the previous output:

To read all of the values (for instance, the primary 10), the loop is somewhat easy:

idx = 0
for _ in range(10):
    str_size = struct.unpack("
481 Mata Squares Suite 260, Lake Rachelville, KY 87464
671 Barker Crossing Suite 390, Mooretown, MI 21488
62459 Jordan Knoll Apt. 970, Emilyfort, DC 80068
948 Victor Square Apt. 753, Braybury, RI 67113
365 Edward Place Apt. 162, Calebborough, AL 13037
894 Reed Lock, Latest Davidmouth, NV 84612
24082 Allison Squares Suite 345, North Sharonberg, WY 97642
00266 Johnson Drives, South Lori, MI 98513
15255 Kelly Plains, Richardmouth, GA 33438
260 Thomas Glens, Port Gabriela, OH 96758

And there we’ve got it! We’ve got successfully recreated, in a quite simple way, how a specialized library would read a Parquet file. By understanding its constructing blocks including headers, footers, row groups, and data pages, we are able to higher appreciate how features like predicate pushdown and partition pruning deliver such impressive performance advantages in data-intensive environments. I’m convinced knowing how Parquet works under the hood helps making higher decisions about storage strategies, compression selections, and performance optimization.

All of the code utilized in this text is accessible on my GitHub repository at https://github.com/kili-mandjaro/anatomy-parquet, where you possibly can explore more examples and experiment with different Parquet file configurations.

Whether you might be constructing data pipelines, optimizing query performance, or just interested in data storage formats, I hope this deep dive into Parquet’s inner structures has provided invaluable insights on your Data Engineering journey.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x