Recently, I even have had the fortune of talking to quite a few data engineers and data architects concerning the problems they face with data of their businesses. The essential pain points I heard time and time again were:
- Not knowing why something broke
- Getting burnt with high cloud compute costs
- Taking too long to construct data solutions/complete data projects
- Needing expertise on many tools and technologies
These problems aren’t recent. I’ve experienced them, you’ve probably experienced them. Yet, we are able to’t seem to search out an answer that solves all of those issues in the long term. You would possibly think to yourself, ‘well point one might be solved with {insert data observability tool}’, or ‘point two just needs a stricter data governance plan in place’. The issue with these variety of solutions is that they add additional layers of complexity, which cause the ultimate two pain points to extend in seriousness. The combination sum of pain stays the identical, just a special distribution between the 4 points.
This text goals to present a contrary variety of problem solving: radical simplicity.
TL;DR
- Software engineers have found massive success in embracing simplicity.
- Over-engineering and pursuing perfection can lead to bloated, slow-to-develop data systems, with sky high costs to the business.
- Data teams should consider sacrificing some functionality for the sake of simplicity and speed.
A Lesson From Those Software Guys
In 1989, the pc scientist Richard P. Gabriel wrote a comparatively famous essay on computer systems paradoxically called ‘Worse Is Higher’. I won’t go into the main points, you may read the essay here in case you like, however the underlying message was that software quality doesn’t necessarily improve as functionality increases. In other words, on occasions, you may sacrifice completeness for simplicity and find yourself with an inherently ‘higher’ product due to it.
This was an odd idea to the pioneers of computing in the course of the 1950/60s. The philosophy of the day was: a pc system must be pure, and it could possibly only be pure if it accounts for all possible scenarios. This was likely resulting from the proven fact that most leading computer scientists on the time were academics, who very much desired to treat computer science as a tough science.
Academics at MIT, the leading institution in computing on the time, began working on the operating system for the subsequent generation of computers, called Multics. After nearly a decade of development and thousands and thousands of dollars of investment, the MIT guys released their recent system. It was unquestionably probably the most advanced operating system of the time, nevertheless it was a pain to put in resulting from the computing requirements, and have updates were slow resulting from the dimensions of the code base. Because of this, it never caught on beyond a couple of select universities and industries.
While Multics was being built, a small group supporting Multics’s development became frustrated with the growing requirements required for the system. They eventually decided to interrupt away from the project. Armed with this experience they set their sights on creating their very own operating system, one with a fundamental philosophy shift:
The design have to be easy, each in implementation and interface. It’s more vital for the implementation to be easy than the interface. Simplicity is an important consideration in a design.
— Richard P. Gabriel
Five years after Multics’s release, the breakaway group released their operating system, Unix. Slowly but steadily it caught traction, and by the Nineties Unix became the go-to selection for computers, with over 90% of the world’s top 500 fastest supercomputers using it. To this present day, Unix remains to be widely used, most notably because the system underlying macOS.
There have been obviously other aspects beyond its simplicity that led to Unix’s success. But its lightweight design was, and still is, a highly beneficial asset of the system. That might only come about since the designers were willing to sacrifice functionality. The info industry mustn’t be afraid to to think the identical way.
Back to Data within the twenty first Century
Pondering back at my very own experiences, the philosophy of most big data engineering projects I’ve worked on was just like that of Multics. For instance, there was a project where we wanted to automate standardising the raw data coming in from all our clients. The choice was made to do that in the info warehouse via dbt, since we could then have a full view of information lineage from the very raw files all the way through to the standardised single table version and beyond. The issue was that the primary stage of transformation was very manual, it required loading each individual raw client file into the warehouse, then dbt creates a model for cleansing each client’s file. This led to 100s of dbt models needing to be generated, all using essentially the identical logic. Dbt became so bloated it took minutes for the info lineage chart to load within the dbt docs website, and our GitHub Actions for CI (continuous integration) took over an hour to finish for every pull request.
This might have been resolved fairly simply if leadership had allowed us to make the primary layer of transformations outside of the info warehouse, using AWS Lambda and Python. But no, that may have meant the info lineage produced by dbt wouldn’t be 100% complete. That was it. That was the entire reason to not massively simplify the project. Just like the group who broke away from the Multics project, I left this project mid-build, it was just too frustrating to work on something that so clearly might have been much simpler. As I write this, I discovered they’re still working on the project.
So, What the Heck is Radical Simplicity?
Radical simplicity in data engineering isn’t a framework or data-stack toolkit, it is just a state of mind. A philosophy that prioritises easy, straightforward solutions over complex, all-encompassing systems.
Key principles of this philosophy include:
- Minimalism: Specializing in core functionalities that deliver probably the most value, somewhat than attempting to accommodate every possible scenario or requirement.
- Accepting trade-offs: Willingly sacrificing a point of completeness or perfection in favour of simplicity, speed, and ease of maintenance.
- Pragmatism over idealism: Prioritising practical, workable solutions that solve real business problems efficiently, somewhat than pursuing theoretically perfect but overly complex systems.
- Reduced cognitive load: Designing systems and processes which can be easier to grasp, implement, and maintain, thus reducing the expertise required across multiple tools and technologies.
- Cost-effectiveness: Embracing simpler solutions that usually require less computational resources and human capital, resulting in lower overall costs.
- Agility and adaptableness: Creating systems which can be easier to change and evolve as business needs change, somewhat than rigid, over-engineered solutions.
- Deal with outcomes: Emphasising the tip results and business value somewhat than getting caught up within the intricacies of the info processes themselves.
This mindset might be in direct contradiction to modern data engineering solutions of adding more tools, processes, and layers. Because of this, be expected to fight your corner. Before suggesting an alternate, simpler, solution, come prepared with a deep understanding of the issue at hand. I’m reminded of the quote:
It takes quite a lot of labor to make something easy, to really understand the underlying challenges and provide you with elegant solutions. […] It’s not only minimalism or the absence of clutter. It involves digging through the depth of complexity. To be truly easy, you will have to go really deep. […] You’ve gotten to deeply understand the essence of a product with a purpose to have the ability to eliminate the parts that should not essential.
— Steve Jobs
Side note: Bear in mind that adopting radical simplicity doesn’t mean ignoring recent tools and advanced technologies. In reality one among my favourite solutions for an information warehouse in the intervening time is using a brand new open-source database called duckDB. Test it out, it’s pretty cool.
Conclusion
The teachings from software engineering history offer beneficial insights for today’s data landscape. By embracing radical simplicity, data teams can address lots of the pain points plaguing modern data solutions.
Don’t be afraid to champion radical simplicity in your data team. Be the catalyst for change in case you see opportunities to streamline and simplify. The trail to simplicity isn’t easy, however the potential rewards might be substantial.