Data Platform Architecture Types Data platform architecture types Data warehouse Data lake (Databricks, Dataproc, EMR) Lakehouse Data mesh Relational and Non-relational Database Management systems Business intelligence stack Conclusion

Artificial Intelligence

Data Platform Architecture Types Data platform architecture types Data warehouse Data lake (Databricks, Dataproc, EMR) Lakehouse Data mesh Relational and Non-relational Database Management systems Business intelligence stack Conclusion

admin

February 23, 2023

Data Platform Architecture Types
Data platform architecture types
Data warehouse
Data lake (Databricks, Dataproc, EMR)
Lakehouse
Data mesh
Relational and Non-relational Database Management systems
Business intelligence stack
Conclusion

How well does it answer your corporation needs? Dilemma of a selection.

It is straightforward to wander off with the abundance of information tools available out there right away. The Web is full of opinionated stories (often speculative) about which data tools to make use of and the right way to make our modern on this particular 12 months. Which data tools are the most effective? Who’s the leader? The best way to select the best ones? This story is for many who are within the “space” and are constructing the most effective data platform on the planet.

So what’s a “Modern Data Stack” and the way modern is it?

To place it simply, it’s a used to work with data. Depending on what we’re going to do with the information, these tools might include the next:

– a managed ETL/ELT data pipeline services

– a cloud-based managed data warehouse/ data lake as a destination for data

– an information transformation tool

– a business intelligence or data visualization platform

– machine learning and data science capabilities

Sometimes it doesn’t matter how modern it’s.

Indeed, if our BI tool is super modern with bespoke OLAP cubes for data modelling and git integration but it could possibly’t render a report into an email it doesn’t matter.

Often these little things are crucial. Business needs and data pipeline requirements are crucial.

Within the diagram below we will see the information journey and a collection of relevant tools to make use of during each step of the information pipeline.

Redshift, Postgres, Google BigQuery, Snowflake, Databricks, Hadoop, Dataproc, Spark, or Elastic Map Reduce?

Which product to decide on on your data platform?

It is dependent upon the day by day you might be planning to run together with your data, , which suits these tasks probably the most.

I remember a few years ago the web was boiling with “Hasdoop is dead” type stories. There was a noticeable shift towards data warehouse architecture. In 2023 everyone appears to be obsessive about real-time data streaming and scalability suggesting Spark and Kafka soon to develop into the general public benchmark leaders.

So which one is the most effective? Who’s the leader and which data tools to decide on? The best way to select it?

What I understood is that those benchmark judgements were very subjective and may have been considered with a pinch of salt. What really matters is how well those tools are aligned with our business requirements should we wish to construct an information platform.

A serverless, distributed SQL engine (.). It’s a where your data is stored in a and you might be free to make use of all the benefits of using datasets. In fact, we will do it because most of the trendy data warehouses are distributed and scale well, which implies you don’t have to worry about table . It suits well for ad-hoc analytics working with .

Most of the trendy data warehouse solutions can process structured and unstructured data and are indeed very convenient if the vast majority of your users are with good SQL skills. Modern data warehouses integrate easily with Business intelligence solutions like , which also depend on lots. It’s not designed to store images, videos, or documents. Nevertheless, with SQL you may do almost all the things and even train Machine learning models in some vendor solutions.

A kind of architecture where your data is being stored within the cloud storage, i.e., AWS S3, Google Cloud Storage, ABS. It’s, in fact, natural to make use of it for images, videos, or documents in addition to some other file types (JSON, CSV, PARQUET, AVRO, etc.), but to investigate it your users can have to jot down some code.

Essentially the most common for this task could be with a very good variety of libraries available. could be one other popular selection for this task.

Amazing advantages include code.

It’s the very best level of flexibility in data processing. Our users just have to know the right way to do it.

A mix of information warehouse and data lake architecture. It has the most effective of two worlds and serves each programmers and regular business users comparable to data analysts. It enables your corporation with the power to run interactive SQL queries while remaining very flexible by way of customization. Most of the trendy data warehouse solutions can run interactive queries on data that’s stored in the information lake, i.e. . One data pipeline can appear to be this, for instance:

Lake house pipeline example. Image by writer

A knowledge mesh architecture is a decentralized approach that allows your organization to administer data and run cross-team / cross-domain data evaluation by itself and share the information.

Each business unit might need a mixture of various programming skills, i.e. in addition to quite a lot of data workload requirements (flexible data processing vs. interaction SQL queries). Having said that, each business unit is free to decide on its own data warehouse/data lake solution but still will have the opportunity to share the information with other units with no data movement.

A Relational Database Management system (RDS) stores data in a row-based table with columns that connect related data elements. It’s designed to record and optimize to fetch current data quickly. Popular relational databases are databases don’t support only easy transactions, whereas Relational Database also supports complex transactions with joins. NoSQL Database is used to handle data coming at high velocity. Popular NoSQL databases are:

Document databases: MongoDB and CouchDB
Key-value databases: Redis and DynamoDB

has the same columnar structure, same as RDS it’s relational. Data is organized into tables, rows, and columns too. Nevertheless, it’s different in the best way that database data is organized and stored by row, while data warehouse data is stored by columns, to facilitate online analytical processing (OLAP) whereas database uses Online Transactional Processing (OLTP). For instance, supports each data warehouse and data lake approaches, enabling it to access and analyze large amounts of information.

A Data warehouse is designed for data evaluation, including large amounts of historical data. Using an information warehouse require users to create a pre-defined, fixed schema upfront, which helps with data analytics lots. Tables should be easy (denormalized) to compute large amounts of information.

and joins are complicated because they’re . So the first difference between a and an information warehouse is that while the standard database is designed and optimized to record data, the information warehouse is designed and optimized to answer analytics. You’d wish to use a database once you run an App and you have to fetch some current data fast. RDS stores the present data required to power an application.

You’ll have to determine which one is correct for you.

Modern Data Stack should include BI tools that help with data modelling and visualization. Some high-level overviews will be found below.

Free version formerly called Google Data Studio. That is an ideal free tool for BI with community-based support.
Great collection of widgets and charts
Great collection of community-based data connectors
Free email scheduling and delivery. Perfectly renders reports into an email.
Free data governance features
Because it’s a free community tool it has a little bit of undeveloped API

(paid version)

Robust data modelling features and self-serving capabilities. Great for medium and huge size firms.
API features

Outstanding visuals
Reasonable pricing
Patented VizQL engine driving its intuitive analytics experience
Connections with many data sources, comparable to HADOOP, SAP, and DB Technologies, improving data analytics quality.
Integrations with Slack, Salesforce, and plenty of others.

Custom-branded email reports
Serverless and straightforward to administer
Robust API
Serverless auto-scaling
Pay-per-use pricing

Excel integration
Powerful data ingestion and connection capabilities
Shared dashboards from Excel data made with ease
A variety of visuals and graphics is instantly available

(former Periscope)

Sisense is an end-to-end data analytics platform that makes data discovery and analytics accessible to customers and employees alike via an embeddable, scalable architecture.

Offers data connectors for nearly every major service and data source
Delivers a code-free experience for non-technical users, though the platform also supports Python, R, and SQL
Git integration and custom datasets
May be a bit expensive because it’s based on pay per license per user model
Some features are still under construction, i.e. report email delivery and report rendering

Natural language for queries

CSS design for dashboards
Collaboration features to permit rapid prototyping before committing to a premium plan
Notebook support
Git support

Great for beginners and really flexible
Has a docker image so we will run it right away
Self-service Analytics

API
Write queries of their natural syntax and explore schemas
Use query results as data sources to hitch different databases

A few of these tools have free versions. For instance, Looker Data Studio is free with basic dashboarding features like email, i.e. drag-and-drop widget builder, and a wide variety of charts. Others have paid features, i.e., data modelling, alerts, notebooks, and git integration.

They’re all great tools with their pros and cons. A few of them are more user-friendly some can offer more robust APIs, CI/CD features, and git integration. For among the tools, these features can be found only within the paid version.

Modern data-driven apps would require a database to store the present application data. So if you might have an application to run, consider OLTP and RDS architecture.

Data lakes, warehouses, lake houses, and databases each have their advantages and serve each purpose.

Firms that wish to perform big data analytics running complex SQL queries on historical data may select to enhance their databases with an information warehouse (or a lake house). It makes the information stack flexible and modern.

Typically, the reply would all the time be the identical:

Go for the most affordable one or the one which works best together with your dev stack

Try it and you will note that a relational database will be easily integrated into the information platform. It doesn’t matter if it’s an information lake or an information warehouse. A wide range of data connectors will enable easy and seamless data extraction.

Nevertheless, there are just a few things to think about.

The important thing thing here is to try data tools to see how well they will be aligned with our business requirements.

For instance, some BI tools can offer only pay-per-user pricing which won’t be a very good slot in case we want to share the dashboard with external users.

If there are any cost-saving advantages it may be higher to maintain data tools with the identical cloud vendor where your development stack is.

We’d want to envision if there’s an overlap in functionality between tools, i.e. will we really want a BI solution that might perform data modelling in its own OLAP cube after we already do it in the information warehouse?

Data modelling is significant

Indeed, it defines how often we process the information which is able to inevitably reflect in processing costs.

The shift to a lake or data warehouse would depend totally on the skillset of your users. The Data warehouse solution will enable more interactivity and narrows down our selection to a SQL-first product (Snowflake, BigQuery, etc.).

Data lakes are for users with programming skills and we might wish to go for Python-first products like Databricks, Galaxy, Dataproc, EMR.

Beneficial read

https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/how-to-build-a-data-architecture-to-drive-innovation-today-and-tomorrow
https://aws.amazon.com/emr/
https://cloud.google.com/learn/what-is-a-data-lake
https://medium.com/towards-data-science/data-pipeline-design-patterns-100afa4b93e3
https://www.snowflake.com/trending/data-architecture-principles