Constructing a Data Platform in 2024

Artificial Intelligence

Constructing a Data Platform in 2024

admin

February 10, 2024

How one can construct a contemporary, scalable data platform to power your analytics and data science projects (updated)

What’s modified?

Since 2021, perhaps a greater query is what HASN’T modified?

Stepping out of the shadow of COVID, our society has grappled with a myriad of challenges — political and social turbulence, fluctuating financial landscapes, the surge in AI advancements, and Taylor Swift emerging as the largest star within the … *checks notes* … National Football League!?!

Over the past three years, my life has modified as well. I’ve navigated the info challenges of varied industries, lending my expertise through work and consultancy at each large corporations and nimble startups.

Concurrently, I’ve dedicated substantial effort to shaping my identity as a Data Educator, collaborating with a number of the most famed corporations and prestigious universities globally.

Because of this, here’s a brief list of what inspired me to jot down an amendment to my original 2021 article:

Firms, big and small, are starting to achieve levels of knowledge scale previously reserved for Netflix, Uber, Spotify and other giants creating unique services with data. Simply cobbling together data pipelines and cron jobs across various applications not works, so there are recent considerations when discussing data platforms at scale.

Although I briefly mentioned streaming in my 2021 article, you’ll see a renewed focus within the 2024 version. I’m a powerful believer that data has to maneuver on the speed of business, and the one strategy to truly accomplish this in modern times is thru data streaming.

I discussed modularity as a core concept of constructing a contemporary data platform in my 2021 article, but I failed to emphasise the importance of knowledge orchestration. This time around, I even have a complete section dedicated to orchestration and why it has emerged as a natural compliment to a contemporary data stack.

The Platform

To my surprise, there continues to be no single vendor solution that has domain over all the data vista, although Snowflake has been trying their best through acquisition and development efforts (Snowpipe, Snowpark, Snowplow). Databricks has also made notable improvements to their platform, specifically within the ML/AI space.

All the components from the 2021 articles made the cut in 2024, but even the familiar entries look a bit different 3 years later:

Source
Integration
Data Store
Transformation
Orchestration
Presentation
Transportation
Observability

Integration

The combination category gets the largest upgrade in 2024, splitting into three logical subcategories:

Batch

The power to process incoming data signals from various sources at a every day/hourly interval is the bread and butter of any data platform.

Fivetran still looks as if the undeniable leader within the managed ETL category, however it has some stiff competition via up & comers like Airbyte and large cloud providers which have been strengthening their platform offerings.

Over the past 3 years, Fivetran has improved its core offering significantly, prolonged its connector library and even began to branch out into light orchestration with features like their dbt integration.

It’s also price mentioning that many vendors, reminiscent of Fivetran, have merged one of the best of OSS and enterprise capital funding into something called Product Led Growth, offering free tiers of their product offering that lower the barrier to entry into enterprise grade platforms.

Even when the issues you’re solving require many custom source integrations, it is smart to make use of a managed ETL provider for the majority and custom Python code for the remainder, all held together by orchestration.

Streaming

Kafka/Confluent is king in the case of data streaming, but working with streaming data introduces numerous recent considerations beyond topics, producers, consumers, and brokers, reminiscent of serialization, schema registries, stream processing/transformation and streaming analytics.

Confluent is doing a great job of aggregating all the components required for successful data streaming under one roof, but I’ll be stating streaming considerations throughout other layers of the info platform.

The introduction of knowledge streaming doesn’t inherently demand a whole overhaul of the info platform’s structure. In reality, the synergy between batch and streaming pipelines is crucial for tackling the various challenges posed to your data platform at scale. The important thing to seamlessly addressing these challenges lies, unsurprisingly, in data orchestration.

Eventing

In lots of cases, the info platform itself must be chargeable for, or on the very least inform, the generation of first party data. Many could argue that it is a job for software engineers and app developers, but I see a synergistic opportunity in allowing the individuals who construct your data platform to even be chargeable for your eventing strategy.

I break down eventing into two categories:

Change Data Capture — CDC

The essential gist of CDC is using your database’s CRUD commands as a stream of knowledge itself. The primary CDC platform I got here across was an OSS project called Debezium and there are a lot of players, big and small, vying for space on this emerging category.

Click Streams — Segment/Snowplow

Constructing telemetry to capture customer activity on web sites or applications is what I’m referring to as click streams. Segment rode the clicking stream wave to a billion dollar acquisition, Amplitude built click streams into a whole analytical platform and Snowplow has been surging more recently with their OSS approach, demonstrating that this space is ripe for continued innovation and eventual standardization.

AWS has been a pacesetter in data streaming, offering templates to ascertain the outbox pattern and constructing data streaming products reminiscent of MSK, SQS, SNS, Lambdas, DynamoDB and more.

Data Store

One other significant change from 2021 to 2024 lies within the shift from “Data Warehouse” to “Data Store,” acknowledging the expanding database horizon, including the rise of Data Lakes.

Viewing Data Lakes as a technique quite than a product emphasizes their role as a staging area for structured and unstructured data, potentially interacting with Data Warehouses. Choosing the appropriate data store solution for every aspect of the Data Lake is crucial, however the overarching technology decision involves tying together and exploring these stores to remodel raw data into downstream insights.

Distributed SQL engines like Presto , Trino and their quite a few managed counterparts (Pandio, Starburst), have emerged to traverse Data Lakes, enabling users to make use of SQL to affix diverse data across various physical locations.

Amid the push to maintain up with generative AI and Large Language Model trends, specialized data stores like vector databases turn out to be essential. These include open-source options like Weaviate, managed solutions like Pinecone and lots of more.

Transformation

Few tools have revolutionized data engineering like dbt. Its impact has been so profound that it’s given rise to a recent data role — the analytics engineer.

dbt has turn out to be the go-to alternative for organizations of all sizes looking for to automate transformations across their data platform. The introduction of dbt core, the free tier of the dbt product, has played a pivotal role in familiarizing data engineers and analysts with dbt, hastening its adoption, and fueling the swift development of recent features.

Amongst these features, dbt mesh stands out as particularly impressive. This innovation enables the tethering and referencing of multiple dbt projects, empowering organizations to modularize their data transformation pipelines, specifically meeting the challenges of knowledge transformations at scale.

Stream transformations represent a less mature area compared. Although there are established and reliable open-source projects like Flink, which has been in existence since 2011, their impact hasn’t resonated as strongly as tools coping with “at rest” data, reminiscent of dbt. Nevertheless, with the increasing accessibility of streaming data and the continued evolution of computing resources, there’s a growing imperative to advance the stream transformations space.

For my part, the long run of widespread adoption on this domain depends upon technologies like Flink SQL or emerging managed services from providers like Confluent, Decodable, Ververica, and Aiven. These solutions empower analysts to leverage a well-recognized language, reminiscent of SQL, and apply those concepts to real-time, streaming data.

Orchestration

Reviewing the Ingestion, Data Store, and Transformation components of constructing an information platform in 2024 highlights the daunting challenge of selecting between a large number of tools, technologies, and solutions.

From my experience, the important thing to finding the appropriate iteration in your scenario is thru experimentation, allowing you to swap out different components until you achieve the specified final result.

Data orchestration has turn out to be crucial in facilitating this experimentation through the initial phases of constructing an information platform. It not only streamlines the method but additionally offers scalable options to align with the trajectory of any business.

Orchestration is usually executed through Directed Acyclic Graphs (DAGs) or code that structures hierarchies, dependencies, and pipelines of tasks across multiple systems. Concurrently, it manages and scales the resources utilized to run these tasks.

Airflow stays the go-to solution for data orchestration, available in various managed flavors reminiscent of MWAA, Astronomer, and galvanizing spin-off branches like Prefect and Dagster.

Without an orchestration engine, the flexibility to modularize your data platform and unlock its full potential is proscribed. Moreover, it serves as a prerequisite for initiating an information observability and governance strategy, playing a pivotal role within the success of all the data platform.

Presentation

Surprisingly, traditional data visualization platforms like Tableau, PowerBI, Looker, and Qlik proceed to dominate the sector. While data visualization witnessed rapid growth initially, the space has experienced relative stagnation over the past decade. An exception to this trend is Microsoft, with commendable efforts towards relevance and innovation, exemplified by products like PowerBI Service.

Emerging data visualization platforms like Sigma and Superset feel just like the natural bridge to the long run. They permit on-the-fly, resource-efficient transformations alongside world-class data visualization capabilities. Nevertheless, a potent newcomer, Streamlit, has the potential to redefine every part.

Streamlit, a robust Python library for constructing front-end interfaces to Python code, has carved out a useful area of interest within the presentation layer. While the technical learning curve is steeper in comparison with drag-and-drop tools like PowerBI and Tableau, Streamlit offers infinite possibilities, including interactive design elements, dynamic slicing, content display, and custom navigation and branding.

Streamlit has been so impressive that Snowflake acquired the corporate for nearly $1B in 2022. How Snowflake integrates Streamlit into its suite of offerings will likely shape the long run of each Snowflake and data visualization as a complete.

Transportation

Transportation, Reverse ETL, or data activation — the ultimate leg of the info platform — represents the crucial stage where the platform’s transformations and insights loop back into source systems and applications, truly impacting business operations.

Currently, Hightouch stands out as a pacesetter on this domain. Their robust core offering seamlessly integrates data warehouses with data-hungry applications. Notably, their strategic partnerships with Snowflake and dbt emphasize a commitment to being recognized as a flexible data tool, distinguishing them from mere marketing and sales widgets.

The long run of the transportation layer seems destined to intersect with APIs, making a scenario where API endpoints generated via SQL queries turn out to be as common as exporting .csv files to share query results. While this transformation is anticipated, there are few vendors exploring the commoditization of this space.

Observability

Just like data orchestration, data observability has emerged as a necessity to capture and track all of the metadata produced by different components of an information platform. This metadata is then utilized to administer, monitor, and foster the expansion of the platform.

Many organizations address data observability by constructing internal dashboards or counting on a single point of failure, reminiscent of the info orchestration pipeline, for commentary. While this approach may suffice for basic monitoring, it falls short in solving more intricate logical observability challenges, like lineage tracking.

Enter DataHub, a well-liked open-source project gaining significant traction. Its managed service counterpart, Acryl, has further amplified its impact. DataHub excels at consolidating metadata exhaust from various applications involved in data movement across a corporation. It seamlessly ties this information together, allowing users to trace KPIs on a dashboard back to the originating data pipeline and each step in between.

Monte Carlo and Great Expectations serve the same observability role in the info platform but with a more opinionated approach. The growing popularity of terms like “end-to-end data lineage” and “data contracts” suggests an imminent surge on this category. We will expect significant growth from each established leaders and modern newcomers, poised to revolutionize the outlook of knowledge observability.

Closing

The 2021 version of this text is 1,278 words.

The 2024 version of this text is well ahead of 2K words before this closing.

I suppose meaning I should keep it short.

Constructing a platform that’s fast enough to satisfy the needs of today and versatile enough to grow to the demands of tomorrow starts with modularity and is enabled by orchestration. As a way to adopt probably the most modern solution in your specific problem, your platform must make room for data solutions of all shapes in sizes, whether it’s an OSS project, a recent managed service or a collection of products from AWS.

There are lots of ideas in this text but ultimately the alternative is yours. I’m desperate to hear how this inspires people to explore recent possibilities and create recent ways of solving problems with data.

Note: I’m not currently affiliated with or employed by any of the businesses mentioned on this post, and this post isn’t sponsored by any of those tools.