From Data Warehouses and Lakes to Data Mesh: A Guide to Enterprise Data Architecture 1. Data is the Lifeblood of Digital 2. Operational (and Transactional) Data 3. Analytical Data 4. Data Warehouses & Data Lakes 5. Data Mesh & Data Products 6. Data Governance 7. Final Words My Popular AI & Data Science articles Unlimited Medium Access

Artificial Intelligence

From Data Warehouses and Lakes to Data Mesh: A Guide to Enterprise Data Architecture 1. Data is the Lifeblood of Digital 2. Operational (and Transactional) Data 3. Analytical Data 4. Data Warehouses & Data Lakes 5. Data Mesh & Data Products 6. Data Governance 7. Final Words My Popular AI & Data Science articles Unlimited Medium Access

admin

May 17, 2023

From Data Warehouses and Lakes to Data Mesh: A Guide to Enterprise Data Architecture
1. Data is the Lifeblood of Digital
2. Operational (and Transactional) Data
3. Analytical Data
4. Data Warehouses & Data Lakes
5. Data Mesh & Data Products
6. Data Governance
7. Final Words
My Popular AI & Data Science articles
Unlimited Medium Access

Understand how data works at large firms

There’s a disconnect between data science courses and the reality of working with data in the true world.

Once I landed my first analytics job at one in all Australia’s ‘Big 4’ banks half a decade ago, I used to be confronted by a posh data landscape characterised by…

Challenges in , & data;
business priorities pulling people in several directions;
systems which are difficult to keep up & upgrade;
A legacy proof against data-driven insights;
teams who didn’t confer with one another.

For some time, I plodded on and resigned myself to the concept perhaps this was just the way in which things were on this planet of enterprise data. I held faith that while our tech stack evolved at a extremely fast pace, UX would eventually catch up…

I had trained myself in data science but actually attending to do data science wasn’t straight-forward in any respect. Online courses don’t prepare you for this.

But here’s the kicker.

After some digging, I realised that my organisation wasn’t alone in facing these data challenges — they were .

We’re in a melting pot of technological innovation where things are moving at breakneck speed. Data is exploding, computing power is on the rise, AI is breaking through, and consumer expectations are ever-changing.

Everyone involved within the analytics industry is just trying to search out their footing. We’re all stumbling forward together. Fail fast and fail forward.

That’s why I penned this text.

I need to share my insights and help professionals like graduates, recent business analysts and self-taught data scientists quickly understand the info landscape on the enterprise level and shape expectations.

Latest to Medium? Join here and gain to the most effective articles on the web.

Let’s first align on the crucial role data plays in today’s competitive fast-paced business environment.

Corporations in every industry are moving towards data-driven decision-making.

At the identical time, consumers are increasingly expecting that increasingly leverage powerful analytics like AI and machine learning that’s trained on all the standard data the corporate can muster.

How the worlds of AI & machine learning intersect with enterprise analytics. Image by creator

It’s what allows you to watch personalised TV shows on demand (entertainment), order food and have it delivered inside an hour (groceries & shopping), and get a pre-approved mortgage in minutes (housing).

This implies a forward-thinking data stack is crucial to survive and thrive, because .

Or as British mathematician Clive Humby put it in 2006:

“Data is the brand new oil.”

IT departments and data platforms are not any longer basement-dwellers — they’re now aof the enterprise strategy.

So without further ado, let’s now dive into how data is organised, processed and stored at large firms.

In brief, the landscape is split into and .

30,000 feet view of the enterprise data landscape. Source: Z. Dehghani at MartinFowler.com with amendments by creator

often is available in the shape of individual records that represent specific events, corresponding to a sale, purchase, or a customer interaction, and is information a business relies on to run its day-to-day operations.

Operational data is stored in databases and is accessed by microservices, that are small software programs that help manage the info. The information is always being updated and represents the present state of the business.

is a very important variety of operational data. Examples of transactions at my area of banking means:

money moving between bank accounts;
payments for goods and services;
a customer interaction with one in all our channels, e.g. branch or online.

Transactional data that’s hot off the applying is named source data, or System-of-Record (SOR). Source data is freed from transformations and is the…

preferred data format by data scientists;
format of information ingested into data lakes;
starting of any data lineage.

4.1 Data Warehouses

are a longtime method to store in a that’s optimised for read operations — primarily SQL queries to support business intelligence (BI), reporting and data visualisations.

Some features of warehouses:

: Data warehouses have been the mainstay for descriptive analytics for many years, offering the flexibility to question and join large volumes of historical data quickly.
: Data warehouses traditionally employ a Schema-on-Write approach, where the structure, or , of your tables are defined upfront.

A standard star schema. Image by creator

: While data analysts and data scientists can work with the info directly within the analytical data store, it’s common to create that pre-aggregate the info to make it easier to supply reports, dashboards, and interactive visualisations. A standard data model — called the — relies on tables that contain numeric values you should analyse (for instance, some amount regarding ), that are related to — hence called a relational database — tables representing the entities (e.g. or ) you should measure.
: Data in warehouses could also be aggregated and loaded into an model, also generally known as the . Numeric values () from fact tables are pre-aggregated across a number of dimensions — for instance, total revenue (from the actual fact table by the scale , and . Visually, this looks just like the intersection of the three dimensions in a 3D cube. Profit-wise, the OLAP/cube model captures relationships that support “drill-up/drill-down” evaluation, and queries are fast because the info is pre-aggregated.

The “cube”. Measures (e.g. sales) are aggregated by dimensions time, customer & product. Image by creator

: Structured data files include readable formats like CSV and XLSX (Excel), and optimised formats like Avro, ORC & Parquet. Relational databases can even store semi-structured data like JSON files.

Read my Explainer 101 on data warehouses and data modelling.

4.2 Data Lakes

Data Lakes are the de facto industry approach to store a big volume of file-based data to support data science and large-scale analytical data processing scenarios.

: Data lakes use distributed compute and distributed storage to process and store huge volumes of doubtless . This implies the info is held and processed across potentially 1000’s of machines, generally known as a . This technology took off within the 2010’s, enabled by , a set of open-source big data software that empowered organisations to distribute huge amounts of information across many machines (HDFS distributed storage) and run SQL-like queries on tables stored across them (Hive & Spark distributed compute). Corporations like Cloudera and Hortonworks later commercialised Apache software into packages that enabled easier onboarding and maintenance by organisations world wide.
: Data lakes use a Schema-on-Read paradigm where a schema is simply created when the info is read. This implies data could be dumped within the lake en-masse without the costly must define schemas immediately, while allowing for schemas to be created for specific use cases down the road — precisely the form of flexibility that data scientists require for modelling.
: Data lakes are the house of — this include text files like txt & doc, audio files like MP3 & WAV, images like JPEG & PNG, videos like MP4, and even entire PDFs, social media posts, emails, webpages and sensor data. Data lakes (and NoSQL databases) also let you store your , like JSON and XML files, as-is.
: Data lakes are increasingly being hosted on public cloud providers like Amazon Web Services, Microsoft Azure and Google Cloud. This elastic and scalable infrastructure enables the organisation to mechanically and quickly adjust to changing demands in resources in each compute and storage while maintaining performance and paying just for exactly what you utilize. There are three common kinds of cloud computing, with different divisions of shared responsibility between the cloud provider and client. Probably the most flexible means that you can essentially rent empty space in the info centre. The cloud provider maintains the physical infrastructure and access to the web. In contrast, the model has the client renting a fully-developed software solution ran through the web (think Microsoft Office). For enterprise data, probably the most cloud popular model is the center ground where the provider chooses the OS with the client in a position to construct its data architecture and enterprise applications on top.

Cloud computing types & shared responsibility model. Image by creator

: Technologies like Apache Kafka has enabled data to be processed near real-time as a perpetual stream of information, enabling the creation of systems that reveal quick insights and trends, or take immediate responsive motion to events as they occur. For example, the flexibility to send an quick mobile notification to customers who is perhaps transferring money to scammers leverages this technology.

Read my Explainer 101 on the cloud computing industry.

Architect Zhamek Dehghani condensed the evolution — challenges, progress and failings— of the enterprise data landscape across three generations:

: proprietary enterprise data warehouse and business intelligence platforms; solutions with large price tags which have left firms with equally , and tables and reports that only a small group of specialized people understand, leading to an under-realized positive impact on the business.

: big data ecosystem with a knowledge lake as a silver bullet; complex big data ecosystem and long running batch jobs operated by a central team of hyper-specialised data engineers have created that at best has enabled pockets of R&D analytics; over-promised and under-realised.

roughly much like the previous generation, with a contemporary twist towards streaming for real-time data availability with architectures, unifying the batch and stream processing for data transformation, in addition to fully embracing cloud-based managed services for storage, data pipeline execution engines and machine learning platforms.

The present data lake architecture could be summarised as:

. All analytical data is stored in a single place, managed by a central data engineering team that don’t have domain knowledge on the info, making it difficult to unlock its full potential or fix data quality issues stemming from source. Opposite of a decentralised architecture that federates data ingestion to groups across the business.
. Architecture that strive to serve everyone without specifically catering for anyone. A jack-of-all-trades platform. Opposite of a domain-driven architecture whereby data is owned by the several business domains.
. The information platform is built as one big piece that’s hard to alter and upgrade. Opposite of a modular architecture allowing individual parts or micro-services to be tweaked and modified.

A central data team manages a monolithic domain-agnostic data lake (or is it data monster?). Source: Data Mesh Architecture (with permission)

The issues are clear and so appears to be among the solutions.

Enter .

Data mesh is the next-generation data architecture that moves away from a single centralised data team towards a decentralised design where data is owned and managed by teams across the organisation that understands it probably the most, generally known as .

Importantly, each business unit or domain goals to infuse product considering to create quality and reusable — a self-contained and accessible data set treated as a product by the info’s producers — which may then published and shared across the mesh to consumers in other domains and business units — called on the mesh.

Data Mesh: Individual business units share finely crafted data built to a ‘product standard’. Source: Data Mesh Architecture (with permission)

Data mesh enables teams to work independently with greater autonomy and agility, while still ensuring that data is consistent, reliable and well-governed.

Here’s an example from my job.

Without delay, data for our customers together with their transactions, products, income and liabilities are sitting in our centralised data lake. (And across our data warehouses too.)

In the longer term, as we federate our capabilities and ownership across the bank, the domain’s own data engineers can independently create and manage their data pipelines, without counting on a centralised ingestion team far faraway from the business and lacking in credit expertise.

This credit team will take in constructing and refining high-quality, strategic, and reusable data products that could be shared to different nodes (business domains) across the mesh, providing the team with reliable credit information to make higher decisions about approving home loans.

These same data products may also be utilised by the domain to develop machine learning models to raised understand the behaviours of our bank card customers in order that we may offer them higher services and discover those in danger.

These are examples of leveraging the strategic value of information products throughout the mesh.

4 principles of mesh: domain-ownership, data as a product, self-serve platform, federated governance. Source: Data Mesh Architecture (with permission)

Data mesh fosters a culture of information ownership and collaboration where data is treated as a that’s moreover and seamlessly shared across teams and departments, reasonably than languishing in a entangled web of often-duplicated ETL pipelines crafted by siloed teams for specific ad hoc tasks.

Data mesh pushes organisations away from a costly and inefficient towards a scalable and forward-thinking .

Data governance is like an enormous game of Who’s the Boss, but for data. Similar to the show, there are numerous complicated relationships to navigate.

It’s about determining who’s in control of what data, who gets to access it, who needs to guard it and what controls and monitoring is in place to make sure things don’t go fallacious.

With my workplace boasting 40,000 employees, tons of processes, and competing priorities, it may possibly feel like an actual challenge to keep up order and ensure everyone seems to be on the identical page.

To data analysts, data scientists and developers, data governance can feel like that annoying friend who at all times desires to know what you’re as much as. But they’re absolutely crucial for organisations, especially well-regulated ones. Otherwise, it will be like a circus with out a ringmaster — chaotic, inconceivable to administer, and .

Data governance components. Image by creator

Some core considerations of information governance are:

. It’s like attempting to keep your embarrassing childhood photos hidden from the world. But for businesses, it’s so much more serious than simply dodgy haircuts. Let’s say a bank by accident reveals all of its customer’s financial info. That’s gonna cost them a ton of money, and much more importantly, trust.

. You ought to be sure that that your customer’s data is protected from each external threats (like hackers) and internal threats (like rogue employees). Which means robust authentication systems, fault-tolerant firewalls, ironclad encryption technologies and vigilant 24/7 cybersecurity. No one wants their data ending up on the dark web auctioned to criminals.

. Think of constructing a sandwich — put in rotten ingredients and also you’ll find yourself with a gross meal. In case your data quality stinks, you’ll find yourself with unreliable insights that no one desires to bite into. And for those who’re in a regulated industry, you higher be sure that your sandwich is made with fresh ingredients, or your data may not pass your compliance obligations.

Maintaining reliable information on how data flows through the enterprise — a.k.a — is crucial for ensuring data quality and troubleshooting when things go fallacious.

Having weak data privacy, security and/or quality means more .

That is where is available in. Who gets to call the shots and make decisions concerning the data? Who owns the danger of things going fallacious?

In practice, it’s a bit like a game of hot potato where no one really desires to hold onto the potato for too long. But someone has to take responsibility for it, so we are able to avoid data mishaps and keep our potato hot, fresh and secure.

The move towards data mesh goals to:

enhance data quality across the board (via reusable data products);
optimise data ownership (have appropriate domains own their data);
simplify data lineage (byebye distant ETLs right into a centralised data lake).

The realm of enterprise-level data can often be a perplexing one, marked by the buildup of technical debt resulting from a cycle of experimentation followed by overcorrection, not unlike the fluctuations of the stock market.

While the tales of huge firms are each unique, they share a couple of common threads. One such thread is the organic expansion towards an unwieldy and daunting enterprise data warehouse, subsequently succeeded by the eager embrace of a centralised data lake aimed toward saving costs, concentrating expertise, and magnifying the worth of information.

This approach brought forth a wholly recent set of issues. So back all of us go — this time a dramatic swing towards decentralising data stacks and federating data management to the teams that best understand their very own data.

Phew! Like a colony of penguins shuffling along a constantly-shifting expanse of ice.

To offer some personal background, the bank I work for has navigated through all the info architecture eras I described in this text.

We spent many years on data warehouses. We then launched into a now-7-year journey to prop up a strategic data lake intended to change into the cornerstone of our data infrastructure.

Long story short, our data warehouses and data lake are still around today, living together in a little bit of an ungainly marriage of sorts. (It’s a work-in-progress…)

We’ve began our own journey to decentralise this data lake towards mesh. We’re busting the spaghetti-like complexity of our data landscape by leveraging the ability of reusable data products.

Large firms are presently specializing in busting many years of tech debt of their data landscape. Image by creator

And I’m proud to say that among the many Big 4 Banks of Australia, we’re apparently leading the way in which. This is completely delightful because large blue-chip organisations aren’t typically on the forefront of technology innovation.

Like many firms, our challenges are big, as all this technical debt is the by-product of tons of of projects, steered by 1000’s of colleagues through the years who’ve come and gone.

My online data science courses — kindly sponsored by my company — taught me how you can wrangle data and train logistic regression models and gradient boosted trees, but ill-prepared me for the realities of working with data at large organisations.

On my first day, I believed I’d be handed some nice juicy data on a platter and dive straight into training models.

Hope I’m far along the Dunning Kruger curve?! Source: Wikipedia

Like Forrest Gump discovered, life ain’t that straightforward.

Through trial and failure — I learnt first-hand that there are such a lot of skills that determine your impact as a knowledge scientist beyond what courses give you — from business engagement to data storytelling to navigating politics and all of the nuances of a flawed yet constantly-evolving enterprise data landscape.

By writing this text, I hope I can spare you a few of my very own stumbles.

Let me know for those who relate to those experiences in your individual journey!

Find me on Linkedin, Twitter & YouTube.

AI Revolution: Fast-paced Intro to Machine Learning — here
ChatGPT & GPT-4: How OpenAI Won the NLU War — here
Generative AI Art: Midjourney & Stable Diffusion Explained — here
Power of Data Storytelling — Sell Stories, Not Data — here
Data Warehouses, Data Lakes & Data Mesh Explained — here
Power BI — From Data Modelling to Stunning Reports — here
Data Warehouses & Data Modelling — a Quick Crash Course — here
Machine Learning versus Mechanistic Modelling — here
Popular Machine Learning Performance Metrics Explained — here
Way forward for Work: Is Your Profession Protected within the AI Age — here
Beyond ChatGPT: Seek for a Truly Intelligence Machine — here
Regression: Predict House Prices using Python — here
Classification: Predict Worker Churn using Python — here
Python Jupyter Notebooks versus Dataiku DSS — here
The right way to Leverage Cloud Computing for Your Business — here

Join Medium here and luxuriate in to the most effective articles on the web.

You shall be directly supporting myself and other top writers. Cheers!

Understand how data works at large firms

4.1 Data Warehouses

4.2 Data Lakes

LEAVE A REPLY Cancel reply