demonstrates that it’s perfectly possible to insert 2M records per second into Postgres. As an alternative of chasing micro-benchmarks, in this text we’ll step back to ask a more vital query: Which abstractions actually matches our workload?
We’ll have a look at 5 ways to insert data into Postgres using Python. The goal shouldn’t be to look just at insert speeds and crown a winner but to know the trade-offs between abstraction, safety, convenience and performance.
In the long run you’ll understand:
- the strengths and weaknesses of ORM, Core and driver-level inserts
- when performance actually matters
- learn how to select the appropriate tool without over-engineering
Why fast inserts matter
High-volume insert workloads show up in all places:
- loading tens of millions of records
- syncing data from external APIs
- backfilling analytics tables
- ingesting events or logs into warehouses
Small inefficiencies compound quickly. Turning a 3-minute insert job right into a 10-second one can reduce system load, unlock staff and improve overall throughput.
That said, faster doesn’t routinely mean higher. When workloads are small sacrificing clarity and safety for marginal gains rarely pays off.
Understanding performance matters and is the true goal.
Which tool can we use to insert with?
To confer with our Postgres database we want a database driver. In our case that is psycopg3 with SQLAlchemy layered on top. Here’s a fast distinction:
Psycopg3 (the motive force)
psycopg3 is a low-level PostgreSQL driver for Python. This can be a very thin abstraction with minimal overhead that talks to Postgres directly.
The trade-off is responsibility: you write SQL yourself, manage bathing and handle correctness explicitly.
SQLAlchemy
SQLAlchemy sits on top of database drivers like psycopg3 and provides two layers:
1) SQLAlchemy Core
That is the SQL abstraction and execution layer. It’s database-agnostic which implies that you just write Python expressions and Core will translate them into SQL in the right database-dialect (PostgreSQL / SQL Server / SQLite) and safely binds parameters.
2) SQLAlchemy ORM
ORM is built on top of Core and abstracts much more. It maps Python classes to tables, tracks object state and handles relationships. The ORM is very productive and protected, but all that bookkeeping introduces overhead, especially for bulk operations.
Briefly:
All three exist on a spectrum. On one side there’s ORM, which takes lots of work out of your hands an provides lots of safety at the price of overhead. On the opposite side there’s the Driver could be very bare-bones but provides maximum throughput. Core is correct in the center and provides you a pleasant balance of safety, performance and control.
Simply said:
- ORM helps you employ the Core more easily
- Core helps you employ the Driver more safely and database-agnostic
The benchmark
To maintain the benchmark fair:
- each method receives data in the shape its designed for
(ORM objects for ORM,dictionaries for Core, tuples for the Driver) - only the time spent moving data from Python into Postgres is measured
- no method is penalized for conversion work
- The database exists in the identical environment as our Python script; this prevents out benchmark from begin bottle-necked by upload speed e.g.
The goal shouldn’t be to “find the fastest insert” but to know what each method does well.
1) Faster is at all times higher?
What is healthier? A Ferrari or a Jeep?
This will depend on the
Should you’re traversing a forest go along with the Jeep. Should you want be the primary across the finish line, the Ferrari is a greater alternative.
The identical applies with inserting. Shaving 300 milliseconds off a 10-second insert may not justify extra complexity and risk. In other cases, that gain is totally price it.
In some cases, the fastest method on paper is the if you account for:
- maintenance cost
- correctness guarantees
- cognitive load
2) What’s your Starting Point?
The appropriate insertion strategy less on row count and more on what your data already looks like
The ORM, Core and the motive force are usually not competing tools. They’re optimized for various purposes:
| Method | Purpose |
ORM (add_all) |
Business logic, correctness, small batches |
ORM(bulk_save_object) |
ORM objects at scale |
Core (execute) |
Structured data, light abstraction |
Driver (executemany) |
Raw rows, high throughput |
Driver (COPY) |
Bulk ingestion, ETL, firehose workloads |
An ORM excels in CRUD-heavy applications where clarity and safety are most vital. Think of internet sites and API’s. Performance is generally “ok” and clarity matters more.
Core shines in situations where you wish control without writing raw SQL. Think data ingestion, batch jobs, analytics pipelines and performance-sensitive services like ETL jobs.
You already know exactly what SQL you wish but you don’t need to manage connections or dialect differences yourself.
The Driver is optimized for optimum throughput; extremely large writes like writing tens of millions of rows for ML training sets, bulk loads, database maintenance or migrations or low-latency ingestion services.
The driving force minimizes extraction and python overhead and provides you the best throughput. The downside is that you’ve gotten to manually write SQL, making it easy to make mistakes.
3) Don’t mismatch abstractions
The ORM isn’t slow. COPY isn’t magic
Performance problems appear after we force data through an abstraction it’s not designed for:
- Using Core with SQLAlchemy ORM objects – >slow because of conversion overhead
- Using ORM with tuples – >awkward and brittle
- ORM bulk in ETL process – >wasted overhead
Sometimes dropping to a lower level can actually performance.
When to decide on which?
Rule of thumb:
| Layer | Use it when… |
| ORM | You’re constructing an application (correctness and productivity) |
| Core | You’re moving or transforming data (balance between safety and speed) |
| Driver | You’re pushing performance limits (raw power and full responsibility) |
Conclusion
In data and AI systems, performance isn’t limited by the database. It is proscribed by how well our code aligns with the form of the information and the abstractions we elect.
ORM, Core and Driver-level APIs form a spectrum from high-level safety to low-level power. All are excellent tools when utilized in the context they’re designed for.
The actual challenge isn’t knowing which is fasted, it’s in choosing the appropriate tool for you situation.
I hope this text was as clear as I intended it to be but when this shouldn’t be the case please let me know what I can do to make clear further. Within the meantime, take a look at my other articles on every kind of programming-related topics.
Completely happy coding!
— Mike
P.s: like what I’m doing? Follow me!
