Bootstrap a Data Lakehouse in an Afternoon

doesn’t should be complicated. In this text, I’ll show you learn how to develop a basic, “starter” one which uses an Iceberg table on AWS S3 storage. Once the table is registered using AWS Glue, you’ll find a way to question and mutate it from Amazon Athena, including using:

Merging, updating and deleting data

Optimising and vacuuming your tables.

I’ll also show you learn how to inspect from DuckDB, and we’ll also see learn how to use Glue/Spark to insert more table data.

Our example is perhaps basic, however it’ll showcase the setup, the several tools and the processes you’ll be able to put in place to accumulate a more extensive data store. All modern cloud providers have equivalents of the AWS services I’m discussing in this text, so it ought to be fairly straightforward to copy what I discuss here on Azure, Google Cloud, and others.

To ensure that we’re all on the identical page, here’s a temporary explanation of a few of the key technologies we’ll be using.

AWS Glue/Spark

AWS Glue is a completely managed, serverless ETL service from Amazon that streamlines data preparation and integration for analytics and machine learning. It robotically detects and catalogues metadata from various sources, corresponding to S3, right into a centralised Data Store. Moreover, it could actually create customisable Python-based Spark ETL scripts to execute these tasks on a scalable, serverless Apache Spark platform. This makes it great for constructing data lakes on Amazon S3, loading data into data warehouses like Amazon Redshift, and performing data cleansing and transformation. all without managing infrastructure.

AWS Athena

AWS Athena is an interactive query service that simplifies data evaluation directly in Amazon S3 using standard SQL. As a serverless platform, there’s no need to administer or provision servers; just point Athena at your S3 data, define your schema (often with AWS Glue), and start running SQL queries. It’s continuously utilised for ad hoc evaluation, reporting, and exploration of huge datasets in formats corresponding to CSV, JSON, ORC, or Parquet.

Iceberg tables

Iceberg tables are an open table format for datasets that provide database-like capabilities for data stored in data lakes, corresponding to Amazon S3 object storage. Traditionally, on S3, you’ll be able to create, read, and delete objects(files), but updating them just isn’t possible. The Iceberg format addresses that limitation while also offering other advantages, including ACID transactions, schema evolution, hidden partitioning, and time-travel features.

DuckDB

DuckDB is an in-memory analytical database written in C++ and designed for analytical SQL workloads. Since its release a few years ago, it has grown in popularity and is now certainly one of the premier data processing tools utilized by data engineers and scientists, due to its grounding in SQL, performance, and flexibility.

Scenario overview

Let’s say you may have been tasked with constructing a small “warehouse-lite” analytics table for order events, but you don’t need to adopt a heavyweight platform just yet. You would like:

Secure writes (no broken readers, no partial commits)
Row-level changes (UPDATE/DELETE/MERGE, not only append)
Point-in-time reads (for audits and debugging)
Local analytics against production-accurate data for quick checks

What we’ll construct

Create an Iceberg table in Glue & S3 via Athena
Load and mutate rows (INSERT/UPDATE/DELETE/MERGE)
Time travel to prior snapshots (by timestamp and by snapshot ID)
Keep it fast with OPTIMIZE and VACUUM
Read locally from DuckDB (S3 access via DuckDB Secrets)
See learn how to add latest records to our table using Glue Spark code

So, in a nutshell, we’ll be using:-

S3 for data storage
Glue Catalogue for table metadata/discovery
Athena for serverless SQL reads writes
DuckDB for reasonable, local analytics against the identical Iceberg table
Spark for processing grunt

The important thing takeaway from our perspective is that through the use of the above technologies, we are going to find a way to perform database-like queries on object storage.

Establishing our development environment

I prefer to isolate local tooling in a separate environment. Use any tool you wish to do that; I’ll show using conda since that’s what I often do. For demo purposes, I’ll be running all of the code inside a Jupyter Notebook environment.

# create and activate an area env
conda create -n iceberg-demo python=3.11 -y
conda activate iceberg-demo

# install duckdb CLI + Python package and awscli for quick tests
pip install duckdb awscli jupyter

Prerequisites

As we’ll be using AWS services, you’ll need an AWS account. Also,

An S3 bucket for the info lake (e.g., s3://my-demo-lake/warehouse/)
A Glue database (we’ll create one)
Athena Engine Version 3 in your workgroup
An IAM role or user for Athena with S3 + Glue permissions

1/ Athena setup

When you’ve signed into AWS, open Athena within the console and set your workgroup, engine version and S3 output location (for query results). To do that, search for a hamburger-style menu icon on the highest left of the Athena home screen. Click on it to bring up a brand new menu block on the left. In there, it is best to see an Administration-> Workgroups link. You’ll robotically be assigned to the first workgroup. You may stick to this or create a brand new one when you like. Whichever option you select, edit it and be certain that the next options are chosen.

Analytics Engine — Athena SQL. Manually set the engine version to three.0.
Select customer-managed query result configuration and enter the required bucket and account information.

2/ Create an Iceberg table in Athena

We’ll store order events and let Iceberg manage partitioning transparently. I’ll use a “hidden” partition on the day of the timestamp to spread writes/reads. Return to the Athena home page and launch the Trino SQL query editor. Your screen should seem like this.

Image from AWS website

Type in and run the next SQL. Change bucket/table names to suit.

-- This robotically creates a Glue database 
-- when you haven't got one already
CREATE DATABASE IF NOT EXISTS analytics;

CREATE TABLE analytics.sales_iceberg (
  order_id    bigint,
  customer_id bigint,
  ts          timestamp,
  status      string,
  amount_usd  double
)
PARTITIONED BY (day(ts))
LOCATION 's3://your_bucket/warehouse/sales_iceberg/'
TBLPROPERTIES (
  'table_type' = 'ICEBERG',
  'format' = 'parquet',
  'write_compression' = 'snappy'
)

3) Load and mutate data (INSERT / UPDATE / DELETE / MERGE)

Athena supports real Iceberg DML, allowing you to insert rows, update and delete records, and upsert using the MERGE statement. Under the hood, Iceberg uses snapshot-based ACID with delete files; readers stay consistent while writers work in parallel.

Seed a couple of rows.

INSERT INTO analytics.sales_iceberg VALUES
  (101, 1, timestamp '2025-08-01 10:00:00', 'created', 120.00),
  (102, 2, timestamp '2025-08-01 10:05:00', 'created',  75.50),
  (103, 2, timestamp '2025-08-02 09:12:00', 'created',  49.99),
  (104, 3, timestamp '2025-08-02 11:47:00', 'created', 250.00);

A fast sanity check.

SELECT * FROM analytics.sales_iceberg ORDER BY order_id;

 order_id | customer_id |          ts           |  status  | amount_usd
----------+-------------+-----------------------+----------+-----------
  101     | 1           | 2025-08-01 10:00:00   | created  | 120.00
  102     | 2           | 2025-08-01 10:05:00   | created  |  75.50
  103     | 2           | 2025-08-02 09:12:00   | created  |  49.99
  104     | 3           | 2025-08-02 11:47:00   | created  | 250.00

Update and delete.

UPDATE analytics.sales_iceberg
SET status = 'paid'
WHERE order_id IN (101, 102)

-- removes order 103
DELETE FROM analytics.sales_iceberg
WHERE status = 'created' AND amount_usd < 60

Idempotent upserts with MERGE

Let’s treat order 104 as refunded and create a brand new order 105.

MERGE INTO analytics.sales_iceberg AS t
USING (
  VALUES
    (104, 3, timestamp '2025-08-02 11:47:00', 'refunded', 250.00),
    (105, 4, timestamp '2025-08-03 08:30:00', 'created',   35.00)
) AS s(order_id, customer_id, ts, status, amount_usd)
ON s.order_id = t.order_id
WHEN MATCHED THEN 
  UPDATE SET 
    customer_id = s.customer_id,
    ts = s.ts,
    status = s.status,
    amount_usd = s.amount_usd
WHEN NOT MATCHED THEN 
  INSERT (order_id, customer_id, ts, status, amount_usd)
  VALUES (s.order_id, s.customer_id, s.ts, s.status, s.amount_usd);

You may now re-query to see: 101/102 → paid, 103 deleted, 104 → refunded, and 105 → created. (When you’re running this in a “real” account, you’ll notice the S3 object count ticking up — more on maintenance shortly.)

SELECT * FROM analytics.sales_iceberg ORDER BY order_id

# order_id customer_id ts status amount_usd
1 101 1 2025-08-01 10:00:00.000000 paid 120.0
2 105 4 2025-08-03 08:30:00.000000 created 35.0
3 102 2 2025-08-01 10:05:00.000000 paid 75.5
4 104 3 2025-08-02 11:47:00.000000 refunded 250.0

4) Time travel (and version travel)

That is where the true value of using Iceberg shines. You may query the table because it checked out a moment in time or by a particular snapshot ID. In Athena, use this syntax,

-- Time travel to noon on Aug 2 (UTC)
SELECT order_id, status, amount_usd
FROM analytics.sales_iceberg
FOR TIMESTAMP AS OF TIMESTAMP '2025-08-02 12:00:00 UTC'
ORDER BY order_id;

-- Or Version travel (replace the id with an actual snapshot id out of your table)

SELECT *
FROM analytics.sales_iceberg
FOR VERSION AS OF 949530903748831860;

To get the varied version (snapshot) IDs related to a specific table, use this question.

SELECT * FROM "analytics"."sales_iceberg$snapshots"
ORDER BY committed_at DESC;

5) Keeping your data healthy: OPTIMIZE and VACUUM

Row-level writes (UPDATE/DELETE/MERGE) create many delete files and might fragment data. Two statements keep things fast and storage-friendly:

OPTIMIZE … REWRITE DATA USING BIN_PACK — compacts small/fragmented files and folds deletes into data
VACUUM — expires old snapshots + cleans orphan files

-- compact "hot" data (yesterday) and merge deletes
OPTIMIZE analytics.sales_iceberg
REWRITE DATA USING BIN_PACK
WHERE ts >= date_trunc('day', current_timestamp - interval '1' day);

-- expire old snapshots and take away orphan files
VACUUM analytics.sales_iceberg;

6) Local analytics with DuckDB (read-only)

It’s great to find a way to sanity-check production tables from a laptop without having to run a cluster. DuckDB’s httpfs + iceberg extensions make this straightforward.

6.1 Install & load extensions

Open your Jupyter notebook and kind in the next.

# httpfs gives S3 support; iceberg adds Iceberg readers.

import duckdb as db
db.sql("install httpfs; load httpfs;")
db.sql("install iceberg; load iceberg;")

6.2 Provide S3 credentials to DuckDB the “right” way (Secrets)

DuckDB has a small but powerful secrets manager. Essentially the most robust setup in AWS is the credential chain provider, which reuses regardless of the AWS SDK can find (environment variables, IAM role, etc.). Due to this fact, you have to to be certain that, for example, your AWS CLI credentials are configured.

db.sql("""CREATE SECRET ( TYPE s3, PROVIDER credential_chain )""")

After that, any s3://… reads on this DuckDB session will use the key data.

6.3 Point DuckDB on the Iceberg table’s metadata

Essentially the most explicit way is to reference a concrete metadata file (e.g., the newest one in your table’s metadata/ folder:)

To get an inventory of those, use this question

result = db.sql("""
SELECT *
FROM glob('s3://your_bucket/warehouse/**')
ORDER BY file
""")
print(result)

...
...
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00000-942a25ce-24e5-45f8-ae86-b70d8239e3bb.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00001-fa2d9997-590e-4231-93ab-642c0da83f19.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00002-0da3a4af-64af-4e46-bea2-0ac450bf1786.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00003-eae21a3d-1bf3-4ed1-b64e-1562faa445d0.metadata.json                                      │
s3://your_bucket_name/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json    
...
...
...

Search for the metadata.json file with the best numbered begin to the file name, 00004 in my case. Then, you should utilize that in a question like this to retrieve the newest position of your underlying table.

# Use the best numbered metadata file (00004 appears to be the newest in my case)
result = db.sql("""
SELECT *
FROM iceberg_scan('s3://your_bucket/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json')
LIMIT 10
""")
print(result)

┌──────────┬─────────────┬─────────────────────┬──────────┬────────────┐
│ order_id │ customer_id │         ts          │  status  │ amount_usd │
│  int64   │    int64    │      timestamp      │ varchar  │   double   │
├──────────┼─────────────┼─────────────────────┼──────────┼────────────┤
│      105 │           4 │ 2025-08-03 08:30:00 │ created  │       35.0 │
│      104 │           3 │ 2025-08-02 11:47:00 │ refunded │      250.0 │
│      101 │           1 │ 2025-08-01 10:00:00 │ paid     │      120.0 │
│      102 │           2 │ 2025-08-01 10:05:00 │ paid     │       75.5 │
└──────────┴─────────────┴─────────────────────┴──────────┴────────────┘

Want a particular snapshot? Use this to get an inventory.

result = db.sql("""
SELECT *
FROM iceberg_snapshots('s3://your_bucket/warehouse/sales_iceberg/metadata/00004-4a2cff23-2bf6-4c69-8edc-6d74c02f4c0e.metadata.json')
""")
print("Available Snapshots:")
print(result)

Available Snapshots:
┌─────────────────┬─────────────────────┬─────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ sequence_number │     snapshot_id     │      timestamp_ms       │                                                          manifest_list                                                           │
│     uint64      │       uint64        │        timestamp        │                                                             varchar                                                              │
├─────────────────┼─────────────────────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│               1 │ 5665457382547658217 │ 2025-09-09 10:58:44.225 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-5665457382547658217-1-bb7d0497-0f97-4483-98e2-8bd26ddcf879.avro │
│               3 │ 8808557756756599285 │ 2025-09-09 11:19:24.422 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-8808557756756599285-1-f83d407d-ec31-49d6-900e-25bc8d19049c.avro │
│               2 │   31637314992569797 │ 2025-09-09 11:08:08.805 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-31637314992569797-1-000a2e8f-b016-4d91-9942-72fe9ddadccc.avro   │
│               4 │ 4009826928128589775 │ 2025-09-09 11:43:18.117 │ s3://your_bucket/warehouse/sales_iceberg/metadata/snap-4009826928128589775-1-cd184303-38ab-4736-90da-52e0cf102abf.avro │
└─────────────────┴─────────────────────┴─────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

7) Optional extra: Writing from Spark/Glue

When you prefer Spark for larger batch writes, Glue can read/write Iceberg tables registered within the Glue Catalogue. You’ll probably still need to use Athena for ad-hoc SQL, time travel, and maintenance, but large CTAS/ETL can come via Glue jobs. (Just bear in mind that version compatibility and AWS LakeFormation permissions can bite, as Glue and Athena may lag barely on Iceberg versions.)

Here’s an example of some Glue Spark code that inserts a couple of latest data rows, starting at order_id = 110, into our existing table. Before running this, it is best to add the next Glue job parameter (under Glue Job Details-> Advanced Parameters-> Job parameters.

Key: --conf
Value: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

import sys
import random
from datetime import datetime
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import Row

# --------------------------------------------------------
# Init Glue job
# --------------------------------------------------------
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# --------------------------------------------------------
# Force Iceberg + Glue catalog configs (dynamic only)
# --------------------------------------------------------
spark.conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.warehouse", "s3://your_bucket/warehouse/")
spark.conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
spark.conf.set("spark.sql.defaultCatalog", "glue_catalog")

# --------------------------------------------------------
# Debug: list catalogs to substantiate glue_catalog is registered
# --------------------------------------------------------
print("Current catalogs available:")
spark.sql("SHOW CATALOGS").show(truncate=False)

# --------------------------------------------------------
# Read existing Iceberg table (optional)
# --------------------------------------------------------
existing_table_df = glueContext.create_data_frame.from_catalog(
    database="analytics",
    table_name="sales_iceberg"
)
print("Existing table schema:")
existing_table_df.printSchema()

# --------------------------------------------------------
# Create 5 latest records
# --------------------------------------------------------
new_records_data = []
for i in range(5):
    order_id = 110 + i
    record = {
        "order_id": order_id,
        "customer_id": 1000 + (i % 10),
        "price": round(random.uniform(10.0, 500.0), 2),
        "created_at": datetime.now(),
        "status": "accomplished"
    }
    new_records_data.append(record)

new_records_df = spark.createDataFrame([Row(**r) for r in new_records_data])
print(f"Creating {new_records_df.count()} latest records:")
new_records_df.show()

# Register temp view for SQL insert
new_records_df.createOrReplaceTempView("new_records_temp")

# --------------------------------------------------------
# Insert into Iceberg table (alias columns as needed)
# --------------------------------------------------------
spark.sql("""
    INSERT INTO analytics.sales_iceberg (order_id, customer_id, ts, status, amount_usd)
    SELECT order_id,
           customer_id,
           created_at AS ts,
           status,
           price AS amount_usd
    FROM new_records_temp
""")

print(" Sccessfully added 5 latest records to analytics.sales_iceberg")

# --------------------------------------------------------
# Commit Glue job
# --------------------------------------------------------
job.commit()

Double-check with Athena.

select * from analytics.sales_iceberg
order by order_id

# order_id customer_id ts status amount_usd
1 101 1 2025-08-01 10:00:00.000000 paid 120.0
2 102 2 2025-08-01 10:05:00.000000 paid 75.5
3 104 3 2025-08-02 11:47:00.000000 refunded 250.0
4 105 4 2025-08-03 08:30:00.000000 created 35.0
5 110 1000 2025-09-10 16:06:45.505935 accomplished 248.64
6 111 1001 2025-09-10 16:06:45.505947 accomplished 453.76
7 112 1002 2025-09-10 16:06:45.505955 accomplished 467.79
8 113 1003 2025-09-10 16:06:45.505963 accomplished 359.9
9 114 1004 2025-09-10 16:06:45.506059 accomplished 398.52

Future Steps

From here, you may:

Create more tables with data.

Experiment with partition evolution (e.g., change table partition from day → hour as volumes grow),

Add scheduled maintenance. For instance, EventBridge, Step, and Lambdas could possibly be used to run OPTIMIZE/VACUUM on a scheduled cadence.

Summary

In this text, I’ve tried to offer a transparent path for constructing an Iceberg data lakehouse on AWS. It should function a guide for data engineers who need to connect easy object storage with complex enterprise data warehouses.

Hopefully, I’ve shown that constructing a Data Lakehouse—a system that mixes the low price of knowledge lakes with the transactional integrity of warehouses—doesn’t necessarily require extensive infrastructure deployment. And while making a full lakehouse is something that evolves over an extended time, I hope I’ve convinced you that you just really could make the bones of 1 in a day.

By leveraging Apache Iceberg on a cloud storage system like Amazon S3, I demonstrated learn how to transform static files into dynamic, managed tables able to ACID transactions, row-level mutations (MERGE, UPDATE, DELETE), and time travel, all without provisioning a single server.

I also showed that through the use of latest analytic tools corresponding to DuckDB, it’s possible to read small to medium data lakes locally. And when your data volumes grow and get too big for local processing, I showed how easy it was to step as much as an enterprise class data processing platform like Spark.

Bootstrap a Data Lakehouse in an Afternoon

AWS Glue/Spark

AWS Athena

Iceberg tables

DuckDB

Scenario overview

What we’ll construct

Establishing our development environment

Prerequisites

1/ Athena setup

2/ Create an Iceberg table in Athena

3) Load and mutate data (INSERT / UPDATE / DELETE / MERGE)

4) Time travel (and version travel)

5) Keeping your data healthy: OPTIMIZE and VACUUM

6) Local analytics with DuckDB (read-only)

6.1 Install & load extensions

6.2 Provide S3 credentials to DuckDB the “right” way (Secrets)

6.3 Point DuckDB on the Iceberg table’s metadata

7) Optional extra: Writing from Spark/Glue

Future Steps

Summary

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Training Stable Diffusion with Dreambooth using Diffusers

Generating Human-level Text with Contrastive Search in Transformers 🤗

Malicious packages for dYdX cryptocurrency exchange empties user wallets

Introducing our latest pricing

Sixteen Claude AI agents working together created a brand new C compiler

Bootstrap a Data Lakehouse in an Afternoon

AWS Glue/Spark

AWS Athena

Iceberg tables

DuckDB

Scenario overview

What we’ll construct

Establishing our development environment

Prerequisites

1/ Athena setup

2/ Create an Iceberg table in Athena

3) Load and mutate data (INSERT / UPDATE / DELETE / MERGE)

4) Time travel (and version travel)

5) Keeping your data healthy: OPTIMIZE and VACUUM

6) Local analytics with DuckDB (read-only)

6.1 Install & load extensions

6.2 Provide S3 credentials to DuckDB the “right” way (Secrets)

6.3 Point DuckDB on the Iceberg table’s metadata

7) Optional extra: Writing from Spark/Glue

Future Steps

Summary

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.