Embracing Simplicity and Composability in Data Engineering

-

Lessons from 30+ years in data engineering: The missed value of keeping it easy

Image by writer

Now we have a simple and fundamental principle in computer programming: the separation of concerns between logic and data. Yet, after I have a look at the present data engineering landscape, it’s clear that we’ve strayed from this principle, complicating our efforts significantly — I’ve previously written about this issue.

There are other elegantly easy principles that we steadily overlook and fail to follow. The developers of the Unix operating system, as an example, introduced well thought-out and easy abstractions for constructing software products. These principles have stood the test of time, evident in tens of millions of applications built upon them. Nevertheless, for some reason we frequently take convoluted detours via complex and sometimes closed ecosystems, loosing sight of the KISS principle and the Unix philosophy of simplicity and composability.

Why does this occur?

Let’s explore some examples and delve right into a little bit of history to higher understand this phenomenon. This exploration might help to understand why we repeatedly fail to maintain things easy.

Unix-like systems offer a fundamental abstraction of knowledge as files. In these systems nearly every little thing related to data is a file, including:

  • Regular Files: Typically text, pictures, programs, etc.
  • Directories: A special sort of file containing lists of other files, organizing them hierarchically.
  • Devices: Files representing hardware devices, including block-oriented (disks) and character-oriented devices (terminals).
  • Pipes: Files enabling communication between processes.
  • Sockets: Files facilitating network communication between computer nodes.

Each application can use common operations that each one work similar on these different file types, like open(), read(), write(), close(), and lseek (change the position inside a file). The content of a file is only a stream of bytes and the system has no assumptions in regards to the structure of a file’s content. For each file the system maintains basic metadata in regards to the owner, access rights, timestamps, size, and site of the data-blocks on disks.

This compact and at the identical time versatile abstraction supports the development of very flexible data systems. It has, as an example, also been used to create the well-known relational database systems, which introduced the brand new abstraction called relation (or table) for us.

Unfortunately these systems evolved in ways in which moved away from treating relations as files. To access the information in these relations now requires calling the database application, using the structured query language (SQL) which was defined as the brand new interface to access data. This allowed databases to higher control access and offer higher-level abstractions than the file system.

Was this an improvement basically? For a number of many years, we obviously believed in that and relational database systems got all the fashion. Interfaces reminiscent of ODBC and JDBC standardized access to numerous database systems, making relational databases the default for a lot of developers. Vendors promoted their systems as comprehensive solutions, incorporating not only data management but additionally business logic, encouraging developers to work entirely throughout the database environment.

A brave man named Carlos Strozzi tried to counteract this development and cling to the Unix philosophy. He aimed to maintain things easy and treat the database as just a skinny extension to the Unix file abstraction. Because he didn’t wish to force applications to only use SQL for accessing the information, he called it NoSQL RDBMS. The term NoSQL was later taken over by the movement towards alternative data storage models driven by the necessity to handle increasing data volumes at web scale. Relational databases were dismissed by the NoSQL community as outdated and incapable to handle the needs of recent data systems. A confusing multitude of recent APIs occured.

Mockingly, the NoSQL community eventually recognized the worth of a typical interface, resulting in the reinterpretation of NoSQL as “Not Only SQL” and the reintroduction of SQL interfaces to NoSQL databases. Concurrently, the open-source movement and recent open data formats like Parquet and Avro emerged, saving data in plain files compatible with the nice old Unix file abstractions. Systems like Apache Spark and DuckDB now use these formats, enabling direct data access via libraries relying solely on file abstractions, with SQL as certainly one of many access methods.

Ultimately, databases actually didn’t offer the higher abstraction for the implementation of all of the multifaceted requirements within the enterprise. SQL is a priceless tool but not the one or most suitable choice. We needed to take the detours via RDBMS and NoSQL databases to find yourself back at files. Possibly we recognize that easy Unix-like abstractions actually provide a strong foundation for the versatile requirements to data management.

Don’t get me mistaken, databases remain crucial, offering features like ACID, granular access control, indexing, and plenty of other. Nevertheless, I feel that one single monolithic system with a constrained and opinionated way of representing data shouldn’t be the best approach to take care of all that varied requirements at enterprise level. Databases add value but needs to be open and usable as components inside larger systems and architecures.

Databases are only one example of the trend to create recent ecosystems that aim to be the higher abstraction for applications to handle data and even logic. An analogous phenomenon occured with the large data movement. In an effort to process the large amounts of knowledge that traditional databases could apparently now not handle, an entire recent ecosystem emerged across the distributed data system Hadoop.

Hadoop implemented the distributed file system HDFS, tightly coupled with the processing framework MapReduce. Each components are completely Java-based and run within the JVM. Consequently, the abstractions offered by Hadoop weren’t seamless extensions to the operating system. As a substitute, applications needed to adopt a very recent abstraction layer and API to leverage the advancements in the large data movement.

This ecosystem spawned a large number of tools and libraries, ultimately giving rise to the brand new role of the information engineer. A brand new role that seemed inevitable since the ecosystem had grown so complex that regular software engineers could now not sustain. Clearly, we didn’t keep things easy.

With the insight that big data can’t be handled by single systems, we witnessed the emergence of recent distributed operating system equivalents. This somewhat unwieldy term refers to systems that allocate resources to software components running across a cluster of compute nodes.

For Hadoop, this role was stuffed with YARN (Yet One other Resource Negotiator), which managed resource allocation among the many running MapReduce jobs in Hadoop clusters, very like an operating system allocates resources amongst processes running in a single system.

Consequently, another approach would have been to scale the Unix-like operating systems across multiple nodes while retaining familiar single-system abstractions. Indeed, such systems, referred to as Single System Image (SSI), were developed independently of the large data movement. This approach abstracted the indisputable fact that the Unix-like system ran on many distributed nodes, promising horizontal scaling while evolving proven abstractions. Nevertheless, the event of those systems proved complex apparently and stagnated around 2015.

A key think about this stagnation was likely the parallel development by influential cloud providers, who advanced YARN functionality right into a distributed orchestration layer for traditional Linux systems. Google, for instance, pioneered this with its internal system Borg, which apparently required less effort than rewriting the operating system itself. But once more, we sacrificed simplicity.

Today, we lack a system that transparently scales single-system processes across a cluster of nodes. As a substitute, we were blessed (or cursed?) with Kubernetes that evolved from Google’s Borg to turn out to be the de-facto standard for a distributed resource and orchestration layer running containers in clusters of Linux nodes. Known for its complexity, Kubernetes requires the training about Persistent Volumes, Persistent Volume Claims, Storage Classes, Pods, Deployments, Stateful Sets, Replica Sets and more. A completely recent abstraction layer that bears little resemblance to the straightforward, familiar abstractions of Unix-like systems.

It shouldn’t be only computer systems that suffer from supposed advances that disregard the KISS principle. The identical applies to systems that organize the event process.

Since 2001, we’ve a lean and well-thougt-out manifesto of principles for agile software development. Following these straightforward principles helps teams to collaborate, innovate, and ultimately produce higher software systems.

Nevertheless, in our effort to make sure successful application, we tried to prescribe these general principles more precisely, detailing them a lot that teams now require agile training courses to completely grasp the complex processes. We finally got overly complex frameworks like SAFe that the majority agile practitioners wouldn’t even consider agile anymore.

You wouldn’t have to imagine in agile principles — some argue that agile working has failed — to see the purpose I’m making. We are likely to complicate things excessively when industrial interests gain upper hand or once we rigidly prescribe rules that we imagine should be followed. There’s a great talk on this by Dave Thomas (certainly one of the authors of the manifesto) where he explains what happens once we ignore simplicity.

The KISS principle and the Unix philosophy are easy to grasp, but within the every day madness of knowledge architecture in IT projects, they could be hard to follow. Now we have too many tools, too many vendors selling too many products that each one promise to resolve our challenges.

The one way out is to really understand and cling to sound principles. I feel we should always all the time think twice before replacing tried and tested easy abstractions with something recent and stylish.

I’ve written about my personal strategy for staying up to the mark and understanding the large picture to take care of the intense complexity we face.

Commercialism must not determine decisions

It is tough to follow the straightforward principles given by the Unix philosophy when your organization is clamoring for a brand new giant AI platform (or every other platform for that matter).

Enterprise Resource Planning (ERP) providers, as an example, made us imagine on the time that they may deliver systems covering all relevant business requirements in an organization. How dare you contradict these specialists?

Unified Real-Time (Data) Platform (URP) providers now claim their systems will solve all our data concerns. How dare you not use such a comprehensive system?

But products are all the time only a small brick in the general system architecture, regardless of how extensive the range of functionality is advertised.

Data engineering needs to be grounded in the identical software architecture principles utilized in software engineering. And software architecture is about balancing trade-offs and maintaining flexibility, specializing in long-term business value. Simplicity and composability can assist you to maintain this focus.

Pressure from closed pondering models

Not only commercialism keeps us from adhering to simplicity. Even open source communities could be dogmatic. While we seek golden rules for perfect systems development, they don’t exist in point of fact.

The Python community may say that non-pythonic code is bad. The functional programming community might claim that applying OOP principles will send you to hell. And the protagonists on agile programming should want to persuade you that any development following the waterfall approach will doom your project to failure. In fact, they’re all mistaken of their absolutism, but we frequently dismiss ideas outside our pondering space as inappropriate.

We like clear rules that we just must follow to achieve success. At certainly one of my clients, as an example, the software development team had intensely studied software design patterns. Such patterns could be very helpful find a tried and tested solution for common problems. But what I actually observed within the team was that they viewed these patterns as rules that that they had to stick to rigidly. Not following these rules was like being a nasty software engineer. But this often leaded to overly complex designs for quite simple problems. Critical pondering based on sound principles cannot get replaced by rigid adherence to rules.

In the long run, it takes courage and thorough understanding of principles to embrace simplicity and composability. This approach is important to design reliable data systems that scale, could be maintained, and evolve with the enterprise.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x