Why Every Analytics Engineer Must Understand Data Architecture

-

, but in point of fact, little nuances in designing your data architecture could have costly implications. Hence, in this text, I wanted to supply a crash course on the architectures that shape your every day decisions — from relational databases to event-driven system

Before we start, I need you to recollect this: your architecture determines whether your organization is sort of a well-planned city with efficient highways and clear zoning, or like a metropolis that grew with none planning and now has traffic jams in every single place.

I’ve seen this firsthand. An organization that had grown rapidly through acquisitions had inherited data systems from each company they bought, and no person had ever taken the time to take into consideration how all of it fit together. Customer data lived in five different CRM systems, financial data was split between three different ERPs, and every system had its own definition of basic concepts like  and . Their “weekly” business review took two weeks to arrange. Six months later, after implementing a well-thought-out data architecture, they might generate the identical review in under two hours.

The difference wasn’t the newest technology or huge budgets. It was simply having a thoughtful approach to how data must be organized and managed.

In this text, I’ll walk you thru the core data architecture types, their strengths, weaknesses, and where every one truly shines. Fasten your seatbelts!

The info architecture evolution – image by creator

1. Relational Database — The Nice Old Wine

Relational databases date all the way in which back to the Seventies, when Edgar F. Codd proposed the relational model. At its core, a relational database is a highly organized, digital filing cabinet. Each table is a drawer dedicated to at least one thing, think of consumers, orders, products. Each row is a single record, each column a particular attribute.

The  part is where the ability comes from. The database understands how tables are connected. It knows that Customer X within the  table is identical Customer X who placed an order within the  table. This structure is what allows us to ask complex questions using SQL.

When working with relational databases, you follow a strict rule called schema-on-write. Consider constructing a house: you could have an in depth blueprint before you possibly can start laying the inspiration. You define every room, every window, and each doorway upfront. The info must fit this blueprint perfectly while you put it aside. This upfront work ensures all the things is consistent and the info is trustworthy.

The other approach, called schema-on-read, is like dumping all of your constructing materials into a giant pile. There’s no blueprint to start out. You simply resolve tips on how to structure it when it is advisable construct something. Flexible? Absolutely. Nevertheless it puts the burden of constructing sense of the chaos on whoever analyzes the info later.

Image by creator

2. Relational Data Warehouse — The Analyst’s Playground

Relational databases were (and still are) improbable at running the every day operations of a business — processing orders, managing inventory, updating customer records. We call these operational (OLTP) systems, and they need to be lightning-fast.

But this created an enormous challenge — what I wish to call the  problem. Business leaders needed to  data. But running complex analytical queries on the identical live database processing 1000’s of transactions a minute would grind all the things to a halt. It’s like attempting to do a deep inventory count in a busy supermarket during peak hours.

The answer? Create a separate playground for analysts. The relational data warehouse was born: a dedicated database built specifically for evaluation, where you centralize copies of information from various operational systems.

Image by creator

Two Schools of Thought: Inmon vs. Kimball

There are two fundamental approaches to constructing a knowledge warehouse. The top-down approach, introduced by Bill Inmon — “the daddy of the info warehouse” — starts with designing the general, normalized data warehouse first, then creating department-specific data marts from it. It gives you consistent data representation and reduced duplication, but comes with high upfront costs and complexity.

The bottom-up approach, championed by Ralph Kimball, flips this around. You begin by constructing individual data marts for specific departments using denormalized fact and dimension tables. Over time, these connect via conformed dimensions to form a unified view. It’s faster to start, more flexible, and cheaper, but risks inconsistencies and data silos if not managed rigorously.

Image by creator

3. Data Lake — The Promise and the Swamp

Around 2010, a brand new concept emerged promising to unravel all our problems (sounds familiar?): the data lake. The sales pitch was alluring — unlike a structured data warehouse, a knowledge lake was essentially an enormous, low-cost space for storing. You don’t need a plan, just dump all the things in there: structured data, log files, PDFs, social media feeds, even images and videos. 

That is the schema-on-read approach in practice. And in contrast to relational data warehouses, which offer each storage and compute, a knowledge lake is  — no bundled compute engine. It relies on object storage, which doesn’t require data to be structured in tabular form.

For some time, the hype was real. Then reality hit. Storing data in a lake was easy —  in a useful way was incredibly difficult. Business users were told:  Most business users didn’t have advanced coding skills. The crystal-clear lake quickly became a murky, unusable data swamp.

Image by creator

But the info lake didn’t disappear. The industry realized the initial vision was flawed, however the core technology remained incredibly useful. Today, the info lake has found its true calling — not as a warehouse substitute, but as a staging and preparation area: the right place to land raw data before deciding what to scrub, transform, and promote for reliable evaluation.

4. Data Lakehouse — The Better of Each Worlds

Once you mix a knowledge warehouse and a knowledge lake, what do you get? A data lakehouse. Databricks pioneered this term around 2020, and the concept has been gaining serious traction ever since.

I can almost hear you asking: 

Fair query. There was a single change to the classic data lake approach, but it surely was large enough to shift all the paradigm: adding a transactional storage layer on top of existing data lake storage. This layer, exemplified by Delta LakeApache Iceberg, and Apache Hudi, enables the info lake to work more like a conventional relational database management system, with ACID transactions, schema enforcement, and time travel.

The one change that shifted all the paradigm – image by creator

The lakehouse promotes a compelling idea: remove the necessity for a separate relational data warehouse and leverage only a knowledge lake on your entire architecture. All data formats: structured, semi-structured, and unstructured, are stored within the lake, and all evaluation happens directly from it. The transactional layer is the missing ingredient that makes this feasible.

5. Data Mesh — Decentralizing Data Ownership

So data lakehouses solved the storage and evaluation problem. Case closed, right? Not exactly. As firms grew, even an important centralized data platform created a brand new bottleneck.

Consider your central data team because the kitchen of a highly regarded restaurant. Marketing, Sales, Finance, and Logistics all place complex “orders” (data requests). The kitchen staff — your data engineers — are expert but swamped. They don’t have deep, nuanced understanding of each “dish.” The marketing team asks for a customer segmentation, and the kitchen has to first ask:  The result? An extended line of frustrated “customers” and a burned-out kitchen staff.

Data mesh asks a radical query: what if, as an alternative of 1 central kitchen, we gave each department its own specialized kitchen station? And what if we made the domain experts — the individuals who  know their very own data — liable for preparing high-quality data products for everybody else?

The 4 pillars of information mesh architecture – image by creator

Data mesh rests on 4 key principles: domain-oriented ownership (the people closest to the info own it), data as a product (treated with the identical care as any customer-facing product), a self-serve data platform (central team provides the infrastructure, domains construct the products), and federated computational governance (global standards enforced through a council with domain representatives).

6. Event-Driven Architecture — The Gossipy Neighbor

Now let’s switch gears. Consider event-driven architecture because the  approach to data — systems that react immediately to things happening, moderately than consistently checking for updates. As an alternative of System B asking System A every five minutes “Hey, did anything occur yet?” (like checking your fridge hoping food has magically appeared), an event-driven system taps you on the shoulder the moment something vital occurs.

A customer places an order? That’s an event. The system that creates it’s the producer. The systems that listen and react are consumers. And the intermediary where events get posted is the event broker — think Apache Kafka, Azure Event Hubs, or Eventstream in Microsoft Fabric.

Image by creator

The wonder lies within the words . The Marketing team can spin up a brand new service that listens to “Customer Signed Up” events without requiring the Sales team to vary a single line of code. If the welcome email service crashes, recent customers still get signed up — the events just pile up within the broker, waiting for the service to recuperate.

But this power comes with trade-offs. You now have a brand new piece of infrastructure to administer. Debugging gets harder because when something goes unsuitable, tracing a single event across multiple decoupled systems generally is a serious challenge. And the broker doesn’t all the time guarantee the  of delivery: you may get an “Order Shipped” event before the “Order Paid” event.

When to make use of it: Real-time analytics (IoT, clickstream, fraud detection), microservices integration, and asynchronous workflows.

When NOT to make use of it: Easy CRUD apps, tightly coupled workflows requiring immediate guaranteed responses, and strictly transactional systems where multi-step processes must succeed or fail atomically.

The Cheat Sheet

There’s no magic bullet — each architecture has its place. Here’s the short comparison to show you how to resolve:

The Key Takeaway

Understanding when to make use of what’s the crucial skill for any analytics engineer. Each day, you make decisions about tips on how to structure data, where to store it, tips on how to transform it, and tips on how to make it accessible. These decisions might sound minor within the moment:  — but they add as much as create the inspiration your entire analytics ecosystem sits on.

The info architecture landscape has evolved from normalized relational databases, through the “don’t touch the live system!” era of information warehouses, past the spectacular rise and fall (and redemption) of information lakes, into the lakehouse paradigm that provides us the very best of each worlds. Modern approaches like data mesh push ownership to the people closest to the info, and event-driven architectures let systems react immediately moderately than consistently polling for updates.

Thanks for reading!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x