Stop Constructing AI Platforms

-

and medium corporations achieve success in constructing Data and ML platforms, constructing AI platforms is now profoundly difficult. This post discusses three key the explanation why try to be cautious about constructing AI platforms and proposes my thoughts on promising directions as an alternative.

Disclaimer: It is predicated on personal views and doesn’t apply to cloud providers and data/ML SaaS corporations. They need to as an alternative double down on the research of AI platforms.

Where I’m Coming From

In my previous article in Toward Data Science, I shared how a knowledge platform evolves into an ML platform. This journey applies to most small and medium-sized corporations. Nonetheless, there was no clear path for small and medium-sized corporations to proceed developing their platforms into AI platforms yet. Leveling as much as AI platforms, the trail forked into two directions:

  • AI Infrastructure: The “Recent Electricity” (AI Inference) is more efficient when centrally generated. It’s a game for large techs and enormous model providers.
  • AI Applications Platform: Cannot construct the “beach house” (AI platform) on always shifting ground. The evolving AI capability and emerging latest development paradigm make finding lasting standardization difficult.

Nonetheless, there are still directions which are more likely to remain vital whilst AI models proceed to evolve. It is roofed at the top of this post.

High Barrier of AI Infrastructure

While Databricks is perhaps only several times higher than your individual Spark jobs, DeepSeek might be 100x more efficient than you on LLM inferencing. Training and serving an LLM model require significantly more investment in infrastructure and, as importantly, control over the LLM model’s structure.

Image Generated by OpenAI ChatGPT 4o

In this series, I briefly shared the infrastructure for LLM training, which incorporates parallel training strategies, topology designs, and training accelerations. On the hardware side, besides high-performance GPUs and TPUs, a significant slice of the price went to networking setup and high-performance storage services. Clusters require a further RDMA network to enable non-blocking, point-to-point connections for data exchange between instances. The orchestration services must support complex job scheduling, failover strategies, hardware issue detection, and GPU resource abstraction and pooling. The training SDK must facilitate asynchronous checkpointing, data processing, and model quantization.

Regarding model serving, model providers often incorporate inference efficiency during model development stages. Model providers likely have higher model quantification strategies, which might produce the identical model quality with a significantly smaller model size. Model providers are more likely to develop a greater model parallel strategy attributable to the control they’ve over the model structure. It might probably increase the batch size during LLM inference, which effectively increases GPU utilization. Moreover, large LLM players have logistical benefits that enable them to access cheaper routers, mainframes, and GPU chips. More importantly, stronger model structure control and higher model parallel capability mean model providers can leverage cheaper GPU devices. For model consumers counting on open-source models, GPU deprecation might be an even bigger concern.

Take DeepSeek R1 for example. Let’s say you’re using p5e.48xlarge AWS instance which give 8 H200 chips with NVLink connected. It can cost you 35$ per hour. Assuming you might be doing in addition to Nvidia and achieve 151 tokens/second performance. To generate 1 million output tokens, it can cost you $64(1 million / (151 * 3600) * $35). How much does DeepSeek sell its token at per million? 2$ only! DeepSeek can achieve 60 times the efficiency of your cloud deployment (assuming a 50% margin from DeepSeek).

So, LLM inference power is indeed like electricity. It reflects the variety of applications that LLMs can power; it also implies that it’s most effective when centrally generated. Nevertheless, it is best to still self-host LLM services for privacy-sensitive use cases, identical to hospitals have their electricity generators for emergencies.

Always shifting ground

Investing in AI infrastructure is a daring game, and constructing lightweight platforms for AI applications comes with its hidden pitfalls. With the rapid evolution of AI model capabilities, there is no such thing as a aligned paradigm for AI applications; subsequently, there may be a scarcity of a solid foundation for constructing AI applications.

Image Generated by OpenAI ChatGPT 4o

The easy answer to that’s:

If we take a holistic view of information and ML platforms, development paradigms emerge only when the capabilities of algorithms converge.
Domains Algorithm Emerge Solution Emerge Big Platforms Emerge
Data Platform 2004 — MapReduce (Google) 2010–2015 — Spark, Flink, Presto, Kafka 2020–Now — Databricks, Snowflake
ML Platform 2012 — ImageNet (AlexNet, CNN breakthrough) 2015–2017 — TensorFlow, PyTorch, Scikit-learn 2018–Now — SageMaker, MLflow, Kubeflow, Databricks ML
AI Platform 2017 — Transformers (Attention is All You Need) 2020–2022 —ChatGPT, Claude, Gemini, DeepSeek 2023–Now — ??

After several years of fierce competition, a number of large model players remain standing within the Arena. Nonetheless, the evolution of the AI capability will not be yet converging. With the advancement of AI models’ capabilities, the present development paradigm will quickly change into obsolete. Big players have just began to take their stab at agent development platforms, and latest solutions are popping up like popcorn in an oven. Winners will eventually appear, I imagine. For now, constructing agent standardization themselves is a difficult call for small and medium-sized corporations. 

Path Dependency of Old Success

One other challenge of constructing an AI platform is somewhat subtle. It’s about reflecting the mindset of platform builders, whether having path dependency from the previous success of constructing data and ML platforms.

Image Generated by OpenAI ChatGPT 4o

As we previously shared, since 2017, the info and ML development paradigms are well-aligned, and probably the most critical task for the ML platform is standardization and abstraction. Nonetheless, the event paradigm for AI applications will not be yet established. If the team follows the previous success story of constructing a knowledge and ML platform, they may find yourself prioritizing standardization on the fallacious time. Possible directions are:

  • Construct an AI Model Gateway: Provide centralised audit and logging of requests to LLM models.
  • Construct an AI Agent Framework: Develop a self-built SDK for creating AI agents with enhanced connectivity to the interior ecosystem.
  • Standardise RAG Practices: Constructing a Standard Data Indexing Flow to lower the bar for engineer construct knowledge services.

Those initiatives can indeed be significant. However the ROI really relies on the dimensions of your organization. Regardless, you’re gonna have the next challenges:

  • Sustain with the newest AI developments.
  • Customer adoption rate when it is simple for patrons to bypass your abstraction.

Suppose builders of information and ML platforms are like “Closet Organizers”, AI builders now should act like “Fashion Designers”. It requires embracing latest ideas, conducting rapid experiments, and even accepting a level of imperfection.

My Thoughts on Promising Directions

Though so many challenges are ahead, please be reminded that it continues to be gratifying to work on the AI platform at once, as you will have substantial leverage which wasn’t there before:

  • The transformation capability of AI is more substantial than that of information and machine learning.
  • The motivation to adopt AI is way stronger than ever.

If you happen to pick the suitable direction and strategy, the transformation you may bring to your organisation is important. Listed here are a few of my thoughts on directions that may experience less disruption because the AI model scales further. I believe they’re equally vital with AI platformisation:

  • High-quality, rich-semantic data products: Data products with high accuracy and accountability, wealthy descriptions, and trustworthy metrics will “radiate” more impact with the expansion of AI models.
  • Multi-modal Data Serving: OLTP, OLAP, NoSQL, and Elasticsearch, a scalable knowledge service behind the MCP server, may require multiple sorts of databases to support high-performance data serving. It’s difficult to keep up a single source of truth and performance with constant reverse ETL jobs.
  • AI DevOps: AI-centric software development, maintenance, and analytics. Code-gen accuracy is greatly increased over the past 12 months.
  • Experimentation and Monitoring: Given the increased uncertainty of AI applications, the evaluation and monitoring of those applications are much more critical.

These are my thoughts on constructing AI platforms. Please let me know your thoughts on it as well. Cheers!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x