TL;DR: with data-intensive architectures, there often comes a pivotal point where constructing in-house data platforms makes more sense than buying off-the-shelf solutions.
The Mystical Pivot Point
Buying off-the-shelf data platforms is a preferred selection for startups to speed up their business, especially within the early stages. Nonetheless, is it true that firms which have already bought never have to pivot to construct, similar to service providers had promised? There are reasons for each side of the view:
- Have to Pivot: The associated fee of shopping for will eventually exceed the price of constructing, as the price grows faster whenever you buy.
- No have to Pivot: The platform’s requirements will proceed to evolve and increase the price of constructing, so buying will all the time be cheaper.
It’s such a puzzle, yet few articles have discussed it. On this post, we’ll delve into this topic, analyzing three dynamics that increase the explanations for constructing and two strategies to think about when deciding to pivot.
Dynamics | Pivot Strategies |
– Growth of Technical Credit – Shift of Customer Persona – Misaligned Priority |
– Cost-Based Pivoting – Value-Based Pivoting |
Growth of Technical Credit
All of it began outside the scope of the information platform. Want it or not, to enhance efficiency or your operation, your organization needs to accumulate at three different levels. Realising it or not, they are going to start making constructing easier for you.
What’s technical credit? Take a look at this artile published in ACM.
Those three levels of are:
Technical Credits | Key Purposes |
Cluster Orchestration | Enhance efficiency in managing multi-flavor Kubernetes clusters. |
Container Orchestration | Enhance efficiency in managing microservices and open-source stacks |
Function Orchestration | Enhance efficiency by organising an internal FaaS (Function as a Service) that abstracts all infrastructure details away. |
For cluster orchestration, there are typically three different flavors of Kubernetes clusters.
- Clusters for microservices
- Clusters for streaming services
- Clusters for batch processing
Each of them requires different provision strategies, especially in network design and auto-scaling. Take a look at this post for an summary of the network design differences.

For container orchestration efficiency, one possible approach to speed up is by extending the Kubernetes cluster with a custom resource definition (CRD). On this post, I shared how kubebuilder works and a couple of examples built with it. e.g., an in-house DS platform by CRD.

For the function orchestration efficiency, it required a mixture of the SDK and the infrastructure. Many organisations will use scaffolding tools to generate code skeletons for microservices. With this inversion of control, the duty for the user is just filling up the rest-api’s handler body.
On this post on Toward Data Science, most services within the MLOps journey are built using FaaS. Especially for model-serving services, machine learning engineers only have to fill in a couple of essential functions, that are critical to feature loading, transformation, and request routing.

The next table shares the and of various levels of .
Technical Credits | Key User Journey | Area of Control |
Cluster Orchestration |
Self-serve on creating multi-flavour K8s clusters. | – Policy for Region, Zone, and IP CIDR Project – Network Peering – Policy for Instance Provisioning – Security & OS harden – Terraform Modules and CI/CD pipelines |
Container Orchestration | Self-serve on service deployment, open-source stack deployment, and CRD constructing | – GitOps for Cluster Resources Releases – Policy for Ingress Creation – Policy for Customer Resource Definition – Policy for Cluster Auto Scaling – Policy for Metric Collection and Monitoring – Cost Tracking |
Function Orchestration |
Focus solely on implementing business logic by filling pre-defined function skeletons. | – Identity and Permission Control – Configuration Management – Internal State Checkpointing – Scheduling & Migration – Service Discovery – Health Monitoring |
With the expansion of , the cost of constructing will reduce.

Nonetheless, the transferability differs for various levels of Technical Credits. From bottom to top, it becomes less and fewer transferable. You’ll find a way to implement consistent infrastructure management and reuse microservices. Nonetheless, it is difficult to reuse the technical credit for constructing FaaS across different topics. Moreover, declining constructing costs don’t mean that you must rebuild every thing yourself. For an entire build-vs-buy trade-off evaluation, two more aspects play a component, that are:
- Shift of Customer Persona
- Misaligned Priority
Shift of Customer Persona
As your organization grows, you’ll soon realize that persona distribution for data platforms is shifting.

If you find yourself small, the vast majority of your users are Data Scientists and Data Analysts. They explore data, validate ideas, and generate metrics. Nonetheless, when more data-centric product features are released, engineers begin to write down Spark jobs to back up their online services and ML models. Those data pipelines are similar to microservices. Such a persona shift, making a completely GitOps data pipeline development journey acceptable and even welcomed.
Misaligned Priority
There will likely be misalignments between SaaS providers and also you, just because everyone must act in the most effective interest of their very own company. The misalignment initially appears minor but might progressively worsen over time. Those potential misalignments are:
Priority | SaaS provider | You |
Feature Prioritisation | Advantage of the Majority of Customers | Advantages of your Organisation |
Cost | Secondary Impact(potential customer churn) | Direct Impact(have to pay more) |
System Integration | Standard Interface |
Customisable Integration |
Resource Pooling | Share between their Tenants | Share across your internal system |
For resource pooling, data systems are perfect for co-locating with online systems, as their workloads typically peak at different times. More often than not, online systems experience peak usage through the day, whereas data platforms peak at night. With higher commitments to your cloud provider, the advantages of resource pooling change into more significant. Especially whenever you purchase yearly reserved instance quotas, combining each online and offline workload gives you stronger bargaining power. SaaS providers, nonetheless, will prioritise pivoting to serverless architecture to enable resource pooling amongst their customers, thereby improving their profit margin.
Pivot! Pivot! Pivot?
Even with the price of constructing declining and misalignments rising, constructing won’t ever be a straightforward option. It requires domain expertise and long-term investment. Nonetheless, the excellent news is that you simply don’t should perform a whole switch. There are compelling reasons to adopt a hybrid approach or step-by-step pivoting, maximizing the return on investment from each buying and constructing. There could be two ways moving forward:
- Cost-Based Pivoting
- Value-Based Pivoting
Disclaimer: I hereby present my perspective. It presents some general principles, and you’re encouraged to do your personal research for validation.
Approach One: Cost-Based Pivoting
The 80/20 rule also applies well to the Spark jobs. 80% of Spark jobs run in production, while the remaining 20% are submitted by users from the dev/sandbox environment. Among the many 80% of jobs in production, 80% are small and simple, while the remaining 20% are large and complicated. .
Want to grasp why Databricks Photon performs well on complex spark jobs? Take a look at this post by Huong.
Moreover, sandbox or development environments require stronger data governance controls and data discoverability capabilities, each of which require quite complex systems. In contrast, the production environment is more focused on GitOps control, which is less complicated to construct with existing offerings from the Cloud and the open-source community.

When you can construct a cost-based dynamic routing system, comparable to a multi-armed bandit, to route less complex Spark jobs to a cheaper in-house platform, you possibly can potentially save a major amount of cost. Nonetheless, with two prerequisites:
- : A platform like Databricks could have its own SDK or notebook notation that is restricted to the Databricks ecosystem. To attain dynamic routing, you could implement standards to create platform-agnostic artifacts that may run on different platforms. This practice is crucial to forestall vendor lock-in in the long run.
- (e.g., Hive Metastore): It’s an anti-pattern to have two duplicated systems side by side. But it may possibly be essential whenever you pivot to construct. For instance, open-source Spark can’t leverage Databricks’ Unity Catalog to its full capability. Due to this fact, it’s possible you’ll have to develop a catalog service, comparable to a Hive metastore, to your in-house platform.
Please also note that a small proportion of complex jobs may account for a big portion of your bill. Due to this fact, conducting thorough research to your case is required.
Approach Two: Value-Based Pivoting
The second pivot approach relies on how the dose pipeline generates values to your company.
- Operational: Data as Product as Value
- Analytical: Insight as Values
The framework of breakdown is inspired by this text, MLOps: Continuous delivery and automation pipelines in machine learning. It brings up a crucial concept called experimental-operational symmetry.

We classify our data pipelines in two dimensions:
- Based on the complexity of the artifact, they’re classified into low-code, scripting, and high-code pipelines.
- Based on the worth it generates, they’re classified into operational and analytical pipelines.
High-code and operational pipelines require for rigorous code review and validation. Scripting and analytical pipelines require for fast development velocity. When an analytical pipeline carries a crucial analytical insight and desires to be democratized, it must be transitioned to an operational pipeline with code reviews, because the health of this pipeline will change into critical to many others.
The overall symmetry, , is just not beneficial for scripting and high-code artifacts.
Let’s examine the operational principles and key requirements of those different pipelines.
Pipeline Type | Operational Principle | Key Requirements of the Platform |
Data as Product(Operational) | Strict GitOps, Rollback on Failure | Stability & Close Internal Integration |
Insight as Values(Analytical) | Fast Iteration, Rollover on Failure | User Experience & Developer Velocity |
Due to the various ways of yielding value and operation principles, you possibly can:
- Pivot Operational Pipelines: Since internal integration is more critical for the operational pipeline, it makes more sense to pivot those to in-house platforms first.
- Pivot low-code Pipelines: The low-code pipeline will also be easily converted as a consequence of its low-code nature.
At Last
Pivot or Not Pivot, it is just not a straightforward call. In summary, these are practices it is best to adopt whatever the decision you make:
- Concentrate to the expansion of your internal technical credit, and refresh your evaluation of total cost of ownership.
- Promote Platform-Agnostic Artifacts to avoid vendor lock-in.
In fact, whenever you indeed have to pivot, have a radical strategy. How does AI change our evaluation here?
- AI makes prompt->high-code possible. It dramatically accelerates the event of each operational and analytical pipelines. To maintain up with the trend, you would possibly want to think about buying or constructing in the event you are confident.
- AI demands higher quality from data. Ensuring data quality will likely be more critical for each in-house platforms and SaaS providers.
Listed here are my thoughts on this unpopular topic, . Let me know your thoughts on it. Cheers!