Home Artificial Intelligence From Data Warehouses and Lakes to Data Mesh: A Guide to Enterprise Data Architecture 1. Data is the Lifeblood of Digital 2. Operational (and Transactional) Data 3. Analytical Data 4. Data Warehouses & Data Lakes 5. Data Mesh & Data Products 6. Data Governance 7. Final Words My Popular AI & Data Science articles Unlimited Medium Access

From Data Warehouses and Lakes to Data Mesh: A Guide to Enterprise Data Architecture 1. Data is the Lifeblood of Digital 2. Operational (and Transactional) Data 3. Analytical Data 4. Data Warehouses & Data Lakes 5. Data Mesh & Data Products 6. Data Governance 7. Final Words My Popular AI & Data Science articles Unlimited Medium Access

0
From Data Warehouses and Lakes to Data Mesh: A Guide to Enterprise Data Architecture
1. Data is the Lifeblood of Digital
2. Operational (and Transactional) Data
3. Analytical Data
4. Data Warehouses & Data Lakes
5. Data Mesh & Data Products
6. Data Governance
7. Final Words
My Popular AI & Data Science articles
Unlimited Medium Access

Understand how data works at large firms

Image: Headway (Unsplash)

There’s a disconnect between data science courses and the reality of working with data in the true world.

Once I landed my first analytics job at one in all Australia’s ‘Big 4’ banks half a decade ago, I used to be confronted by a posh data landscape characterised by…

  • Challenges in , & data;
  • business priorities pulling people in several directions;
  • systems which are difficult to keep up & upgrade;
  • A legacy proof against data-driven insights;
  • teams who didn’t confer with one another.

For some time, I plodded on and resigned myself to the concept perhaps this was just the way in which things were on this planet of enterprise data. I held faith that while our tech stack evolved at a extremely fast pace, UX would eventually catch up…

I had trained myself in data science but actually attending to do data science wasn’t straight-forward in any respect. Online courses don’t prepare you for this.

But here’s the kicker.

After some digging, I realised that my organisation wasn’t alone in facing these data challenges — they were .

We’re in a melting pot of technological innovation where things are moving at breakneck speed. Data is exploding, computing power is on the rise, AI is breaking through, and consumer expectations are ever-changing.

Everyone involved within the analytics industry is just trying to search out their footing. We’re all stumbling forward together. Fail fast and fail forward.

That’s why I penned this text.

I need to share my insights and help professionals like graduates, recent business analysts and self-taught data scientists quickly understand the info landscape on the enterprise level and shape expectations.

Latest to Medium? Join here and gain to the most effective articles on the web.

Let’s first align on the crucial role data plays in today’s competitive fast-paced business environment.

Corporations in every industry are moving towards data-driven decision-making.

At the identical time, consumers are increasingly expecting that increasingly leverage powerful analytics like AI and machine learning that’s trained on all the standard data the corporate can muster.

How the worlds of AI & machine learning intersect with enterprise analytics. Image by creator

It’s what allows you to watch personalised TV shows on demand (entertainment), order food and have it delivered inside an hour (groceries & shopping), and get a pre-approved mortgage in minutes (housing).

This implies a forward-thinking data stack is crucial to survive and thrive, because .

Or as British mathematician Clive Humby put it in 2006:

“Data is the brand new oil.”

IT departments and data platforms are not any longer basement-dwellers — they’re now aof the enterprise strategy.

So without further ado, let’s now dive into how data is organised, processed and stored at large firms.

In brief, the landscape is split into and .

30,000 feet view of the enterprise data landscape. Source: Z. Dehghani at MartinFowler.com with amendments by creator

often is available in the shape of individual records that represent specific events, corresponding to a sale, purchase, or a customer interaction, and is information a business relies on to run its day-to-day operations.

Operational data is stored in databases and is accessed by microservices, that are small software programs that help manage the info. The information is always being updated and represents the present state of the business.

is a very important variety of operational data. Examples of transactions at my area of banking means:

  • money moving between bank accounts;
  • payments for goods and services;
  • a customer interaction with one in all our channels, e.g. branch or online.

Transactional data that’s hot off the applying is named source data, or System-of-Record (SOR). Source data is freed from transformations and is the…

  • preferred data format by data scientists;
  • format of information ingested into data lakes;
  • starting of any data lineage.

More on these ideas later.

Transactional data processing systems, called systems, must handle many transactions quite fast. They depend on databases that may quickly store and retrieve data, and ensure the info stays accurate by enforcing rules called ACID semantics:

  • Atomicity: each transaction is treated as a single unit;
  • Consistency: that transactions must pass or fail;
  • Isolation: that multiple transactions can occur at the identical time without interferring with one another;
  • Durability: data changes are saved even when the system shuts down.

OLTP systems are used for necessary business applications that must work accurately, quickly and at scale.

In banking, these include systems that process deposits, withdrawals, transfers and balance enquiries. Specific examples include online banking systems corresponding to web portals and mobile apps, credit and debit card authorisation systems, cheque processors and wire transfer systems that facilitate money transfers between banks.

The bread and butter of how banks interface with customers.

is a temporal (time-based) and aggregated (consolidated) view of an organization’s operational or transactional data. This provides a summarised view of the facts of the organisation over time, designed to:

  • gain insights into business performance (descriptive and diagnostic analytics);
  • make data-driven decisions for the (predictive and prescriptive analytics).
From descriptive evaluation to predictive modelling. Image by creator

Analytical data is ceaselessly used to create dashboards and reports (often built by ) and train machine learning models () used to predict things like house prices or churning.

In brief, analytical processing is different from transactional processing because the previous is concentrated on analysing data while the latter on recording specific events.

Analytical processing systems typically leverage read-only systems that store vast volumes of historical data or business metrics. Analytics could be performed on a snapshot of the info at a given cut-off date.

Now, let’s connect the dots between operational and analytical data.

Operational data is transformed into analytical data through , typically built by .

These ‘pipelines’ are typically — which entails the info from operational systems, it for one’s business needs, and it into a knowledge warehouse or data lake, ready for evaluation.

ETL pipelines connect operational and analytical data stores. Image by creator

All the analytical data plane — where the enterprise stores its analytical data — has diverged into two essential architectures and technology stacks:

  • Data Warehouses;
  • Data Lakes.

Different users might perform data work at different stages throughout the enterprise architecture.

  • often query tables and aggregate data in the info warehouse to supply effective dashboards, reports and visualisations, which and eat downstream.
  • often work in a knowledge lake to explore data in . This implies prototyping their data wrangling and modelling in a developer (i.e. non-production) environment on live (i.e. production) data that’s been meticulously prepared by . Once the business signs off on the worth of the models, operationalise them into production so the model can serve each internal and external customers at scale under the watch of an 24/7 operations team (MLOps).
Warehouses vs data lakes. Source: Z. Dehghani at MartinFowler.com with amendments by creator

For those recent to enterprise IT, there are two essential kinds of environments it is advisable to grasp:

  • , where you construct and take a look at stuff. Change is reasonable and breaking things don’t bring down what you are promoting. Also generally known as a environment. Projects are funded by the organisation’s .
  • where you deploy and serve your finalised and signed-off apps, data, pipelines and systems to real customers. Your work is now . Be certain it’s good, because change is pricey. Prod — because it’s colloquially called — are highly-secure locked-down environments taken care of by an operations or run team that’s funded by the org’s . I wrote more on CapEx vs OpEx here.

In brief, construct stuff in non-prod, deploy it into prod. Gotcha!

Alright, let’s now dive into some details of each data architectures.

4.1 Data Warehouses

are a longtime method to store in a that’s optimised for read operations — primarily SQL queries to support business intelligence (BI), reporting and data visualisations.

Some features of warehouses:

  • : Data warehouses have been the mainstay for descriptive analytics for many years, offering the flexibility to question and join large volumes of historical data quickly.
  • : Data warehouses traditionally employ a Schema-on-Write approach, where the structure, or , of your tables are defined upfront.
A standard star schema. Image by creator
  • : While data analysts and data scientists can work with the info directly within the analytical data store, it’s common to create that pre-aggregate the info to make it easier to supply reports, dashboards, and interactive visualisations. A standard data model — called the — relies on tables that contain numeric values you should analyse (for instance, some amount regarding ), that are related to — hence called a relational database — tables representing the entities (e.g. or ) you should measure.
  • : Data in warehouses could also be aggregated and loaded into an model, also generally known as the . Numeric values () from fact tables are pre-aggregated across a number of dimensions — for instance, total revenue (from the actual fact table by the scale , and . Visually, this looks just like the intersection of the three dimensions in a 3D cube. Profit-wise, the OLAP/cube model captures relationships that support “drill-up/drill-down” evaluation, and queries are fast because the info is pre-aggregated.
The “cube”. Measures (e.g. sales) are aggregated by dimensions time, customer & product. Image by creator
  • : Structured data files include readable formats like CSV and XLSX (Excel), and optimised formats like Avro, ORC & Parquet. Relational databases can even store semi-structured data like JSON files.

Read my Explainer 101 on data warehouses and data modelling.

4.2 Data Lakes

Data Lakes are the de facto industry approach to store a big volume of file-based data to support data science and large-scale analytical data processing scenarios.

  • : Data lakes use distributed compute and distributed storage to process and store huge volumes of doubtless . This implies the info is held and processed across potentially 1000’s of machines, generally known as a . This technology took off within the 2010’s, enabled by , a set of open-source big data software that empowered organisations to distribute huge amounts of information across many machines (HDFS distributed storage) and run SQL-like queries on tables stored across them (Hive & Spark distributed compute). Corporations like Cloudera and Hortonworks later commercialised Apache software into packages that enabled easier onboarding and maintenance by organisations world wide.
  • : Data lakes use a Schema-on-Read paradigm where a schema is simply created when the info is read. This implies data could be dumped within the lake en-masse without the costly must define schemas immediately, while allowing for schemas to be created for specific use cases down the road — precisely the form of flexibility that data scientists require for modelling.
  • : Data lakes are the house of — this include text files like txt & doc, audio files like MP3 & WAV, images like JPEG & PNG, videos like MP4, and even entire PDFs, social media posts, emails, webpages and sensor data. Data lakes (and NoSQL databases) also let you store your , like JSON and XML files, as-is.
  • : Data lakes are increasingly being hosted on public cloud providers like Amazon Web Services, Microsoft Azure and Google Cloud. This elastic and scalable infrastructure enables the organisation to mechanically and quickly adjust to changing demands in resources in each compute and storage while maintaining performance and paying just for exactly what you utilize. There are three common kinds of cloud computing, with different divisions of shared responsibility between the cloud provider and client. Probably the most flexible means that you can essentially rent empty space in the info centre. The cloud provider maintains the physical infrastructure and access to the web. In contrast, the model has the client renting a fully-developed software solution ran through the web (think Microsoft Office). For enterprise data, probably the most cloud popular model is the center ground where the provider chooses the OS with the client in a position to construct its data architecture and enterprise applications on top.
Cloud computing types & shared responsibility model. Image by creator
  • : Technologies like Apache Kafka has enabled data to be processed near real-time as a perpetual stream of information, enabling the creation of systems that reveal quick insights and trends, or take immediate responsive motion to events as they occur. For example, the flexibility to send an quick mobile notification to customers who is perhaps transferring money to scammers leverages this technology.

Read my Explainer 101 on the cloud computing industry.

Architect Zhamek Dehghani condensed the evolution — challenges, progress and failings— of the enterprise data landscape across three generations:

: proprietary enterprise data warehouse and business intelligence platforms; solutions with large price tags which have left firms with equally , and tables and reports that only a small group of specialized people understand, leading to an under-realized positive impact on the business.

: big data ecosystem with a knowledge lake as a silver bullet; complex big data ecosystem and long running batch jobs operated by a central team of hyper-specialised data engineers have created that at best has enabled pockets of R&D analytics; over-promised and under-realised.

roughly much like the previous generation, with a contemporary twist towards streaming for real-time data availability with architectures, unifying the batch and stream processing for data transformation, in addition to fully embracing cloud-based managed services for storage, data pipeline execution engines and machine learning platforms.

The present data lake architecture could be summarised as:

  • . All analytical data is stored in a single place, managed by a central data engineering team that don’t have domain knowledge on the info, making it difficult to unlock its full potential or fix data quality issues stemming from source. Opposite of a decentralised architecture that federates data ingestion to groups across the business.
  • . Architecture that strive to serve everyone without specifically catering for anyone. A jack-of-all-trades platform. Opposite of a domain-driven architecture whereby data is owned by the several business domains.
  • . The information platform is built as one big piece that’s hard to alter and upgrade. Opposite of a modular architecture allowing individual parts or micro-services to be tweaked and modified.
A central data team manages a monolithic domain-agnostic data lake (or is it data monster?). Source: Data Mesh Architecture (with permission)

The issues are clear and so appears to be among the solutions.

Enter .

Data mesh is the next-generation data architecture that moves away from a single centralised data team towards a decentralised design where data is owned and managed by teams across the organisation that understands it probably the most, generally known as .

Importantly, each business unit or domain goals to infuse product considering to create quality and reusable — a self-contained and accessible data set treated as a product by the info’s producers — which may then published and shared across the mesh to consumers in other domains and business units — called on the mesh.

Data Mesh: Individual business units share finely crafted data built to a ‘product standard’. Source: Data Mesh Architecture (with permission)

Data mesh enables teams to work independently with greater autonomy and agility, while still ensuring that data is consistent, reliable and well-governed.

Here’s an example from my job.

Without delay, data for our customers together with their transactions, products, income and liabilities are sitting in our centralised data lake. (And across our data warehouses too.)

In the longer term, as we federate our capabilities and ownership across the bank, the domain’s own data engineers can independently create and manage their data pipelines, without counting on a centralised ingestion team far faraway from the business and lacking in credit expertise.

This credit team will take in constructing and refining high-quality, strategic, and reusable data products that could be shared to different nodes (business domains) across the mesh, providing the team with reliable credit information to make higher decisions about approving home loans.

These same data products may also be utilised by the domain to develop machine learning models to raised understand the behaviours of our bank card customers in order that we may offer them higher services and discover those in danger.

These are examples of leveraging the strategic value of information products throughout the mesh.

4 principles of mesh: domain-ownership, data as a product, self-serve platform, federated governance. Source: Data Mesh Architecture (with permission)

Data mesh fosters a culture of information ownership and collaboration where data is treated as a that’s moreover and seamlessly shared across teams and departments, reasonably than languishing in a entangled web of often-duplicated ETL pipelines crafted by siloed teams for specific ad hoc tasks.

Data mesh pushes organisations away from a costly and inefficient towards a scalable and forward-thinking .

Data governance is like an enormous game of Who’s the Boss, but for data. Similar to the show, there are numerous complicated relationships to navigate.

It’s about determining who’s in control of what data, who gets to access it, who needs to guard it and what controls and monitoring is in place to make sure things don’t go fallacious.

With my workplace boasting 40,000 employees, tons of processes, and competing priorities, it may possibly feel like an actual challenge to keep up order and ensure everyone seems to be on the identical page.

To data analysts, data scientists and developers, data governance can feel like that annoying friend who at all times desires to know what you’re as much as. But they’re absolutely crucial for organisations, especially well-regulated ones. Otherwise, it will be like a circus with out a ringmaster — chaotic, inconceivable to administer, and .

Data governance components. Image by creator

Some core considerations of information governance are:

. It’s like attempting to keep your embarrassing childhood photos hidden from the world. But for businesses, it’s so much more serious than simply dodgy haircuts. Let’s say a bank by accident reveals all of its customer’s financial info. That’s gonna cost them a ton of money, and much more importantly, trust.

. You ought to be sure that that your customer’s data is protected from each external threats (like hackers) and internal threats (like rogue employees). Which means robust authentication systems, fault-tolerant firewalls, ironclad encryption technologies and vigilant 24/7 cybersecurity. No one wants their data ending up on the dark web auctioned to criminals.

. Think of constructing a sandwich — put in rotten ingredients and also you’ll find yourself with a gross meal. In case your data quality stinks, you’ll find yourself with unreliable insights that no one desires to bite into. And for those who’re in a regulated industry, you higher be sure that your sandwich is made with fresh ingredients, or your data may not pass your compliance obligations.

Maintaining reliable information on how data flows through the enterprise — a.k.a — is crucial for ensuring data quality and troubleshooting when things go fallacious.

Having weak data privacy, security and/or quality means more .

That is where is available in. Who gets to call the shots and make decisions concerning the data? Who owns the danger of things going fallacious?

In practice, it’s a bit like a game of hot potato where no one really desires to hold onto the potato for too long. But someone has to take responsibility for it, so we are able to avoid data mishaps and keep our potato hot, fresh and secure.

The move towards data mesh goals to:

  • enhance data quality across the board (via reusable data products);
  • optimise data ownership (have appropriate domains own their data);
  • simplify data lineage (byebye distant ETLs right into a centralised data lake).

The realm of enterprise-level data can often be a perplexing one, marked by the buildup of technical debt resulting from a cycle of experimentation followed by overcorrection, not unlike the fluctuations of the stock market.

While the tales of huge firms are each unique, they share a couple of common threads. One such thread is the organic expansion towards an unwieldy and daunting enterprise data warehouse, subsequently succeeded by the eager embrace of a centralised data lake aimed toward saving costs, concentrating expertise, and magnifying the worth of information.

This approach brought forth a wholly recent set of issues. So back all of us go — this time a dramatic swing towards decentralising data stacks and federating data management to the teams that best understand their very own data.

Phew! Like a colony of penguins shuffling along a constantly-shifting expanse of ice.

To offer some personal background, the bank I work for has navigated through all the info architecture eras I described in this text.

We spent many years on data warehouses. We then launched into a now-7-year journey to prop up a strategic data lake intended to change into the cornerstone of our data infrastructure.

Long story short, our data warehouses and data lake are still around today, living together in a little bit of an ungainly marriage of sorts. (It’s a work-in-progress…)

We’ve began our own journey to decentralise this data lake towards mesh. We’re busting the spaghetti-like complexity of our data landscape by leveraging the ability of reusable data products.

Large firms are presently specializing in busting many years of tech debt of their data landscape. Image by creator

And I’m proud to say that among the many Big 4 Banks of Australia, we’re apparently leading the way in which. This is completely delightful because large blue-chip organisations aren’t typically on the forefront of technology innovation.

Like many firms, our challenges are big, as all this technical debt is the by-product of tons of of projects, steered by 1000’s of colleagues through the years who’ve come and gone.

My online data science courses — kindly sponsored by my company — taught me how you can wrangle data and train logistic regression models and gradient boosted trees, but ill-prepared me for the realities of working with data at large organisations.

On my first day, I believed I’d be handed some nice juicy data on a platter and dive straight into training models.

Hope I’m far along the Dunning Kruger curve?! Source: Wikipedia

Like Forrest Gump discovered, life ain’t that straightforward.

Through trial and failure — I learnt first-hand that there are such a lot of skills that determine your impact as a knowledge scientist beyond what courses give you — from business engagement to data storytelling to navigating politics and all of the nuances of a flawed yet constantly-evolving enterprise data landscape.

By writing this text, I hope I can spare you a few of my very own stumbles.

Let me know for those who relate to those experiences in your individual journey!

Find me on Linkedin, Twitter & YouTube.

  • AI Revolution: Fast-paced Intro to Machine Learning — here
  • ChatGPT & GPT-4: How OpenAI Won the NLU War — here
  • Generative AI Art: Midjourney & Stable Diffusion Explained — here
  • Power of Data Storytelling — Sell Stories, Not Data — here
  • Data Warehouses, Data Lakes & Data Mesh Explained — here
  • Power BI — From Data Modelling to Stunning Reports — here
  • Data Warehouses & Data Modelling — a Quick Crash Course — here
  • Machine Learning versus Mechanistic Modelling — here
  • Popular Machine Learning Performance Metrics Explained — here
  • Way forward for Work: Is Your Profession Protected within the AI Age — here
  • Beyond ChatGPT: Seek for a Truly Intelligence Machine — here
  • Regression: Predict House Prices using Python — here
  • Classification: Predict Worker Churn using Python — here
  • Python Jupyter Notebooks versus Dataiku DSS — here
  • The right way to Leverage Cloud Computing for Your Business — here

Join Medium here and luxuriate in to the most effective articles on the web.

You shall be directly supporting myself and other top writers. Cheers!

LEAVE A REPLY

Please enter your comment!
Please enter your name here