From Data Platform to ML Platform

-

There’s nothing flawed with those systems so long as it fulfil business requirements. All systems that fulfil our business need are good systems. If there are easy, it’s even higher.

At this stage, there are multiple ways of doing data evaluation:

  1. Simply submit queries to OLTP database’s replica node. (Not really helpful).
  2. Enabling CDC(Change Data Capture) of OLTP databse and ingest those data to OLAP database. Come to the choice of ingestion service for CDC logs, you may select based on the OLAP database you might have chosen. For instance, Flink data streaming with CDC connectors is a method to handle this. Many enterprise services include their very own suggested solution, e.g. Snowpipe for Snowflake. Additionally it is really helpful to load data from replica node to preserve the CPU/IO bandwidth of master node for online traffic.

On this stage, ML workloads may be running in your local environment. You’ll be able to arrange a Jupyter notebook locally, and cargo structured data from OLAP Database, then train your ML model locally.

The potential challenges of this architecture are but not limited to:

  • It is difficult to administer unstructured or semi-structured data with OLAP database.
  • OLAP may need performance regression when come to massive data processing. (greater than TB data required for a single ETL task)
  • Lack of supporting for various compute engines, e.g. Spark or Presto. Most of compute engine do support connecting to OLAP with JDBC endpoint, however the parallel processing can be badly limited by the IO bottleneck of JDBC endpoint itself.
  • The price of storing massive data in OLAP database is high.

You would possibly know the direction to resolve this already. Construct a Data lake! Bringing in Data lake don’t obligatory mean you should completely sunset OLAP Database. It remains to be common to see company having two systems co-exist for various use-cases.

A knowledge lake means that you can persist unstructured and semi-structure data, and performs schema-on-write. It allows you reduce cost by storing large data volume with specialised storage solution and spun up compute cluster based in your demand. It further means that you can manage TB/PB dataset effortlessly by scaling up the compute clusters.

There’s how your infrastructure might looks like:

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x