How one can construct a Path to Live (RTL) for data products like Machine Learning models Attempting to make the software RTL work for data But what’s a greater solution? How can a Software RTL and Data RTL co-exist?

Most of the time, our enterprise platforms are designed for traditional software application development. They sometimes consist of 4 environments — Dev, Test, Pre-Prod & Prod — where the environments grow to be increasingly safer as you progress through them.

As such, “Dev” or Development is probably the most liberal of zones where developers can typically do as they please and, at the opposite extreme, “Prod” or Production is a no-touch zone and typically the one place where live data can reside.

Nevertheless, these environments tend to not be suitable for the construct and release of knowledge products. By “data products” we mean applications where data and code are tightly coupled and depending on one another. Machine learning models are a major example; where data scientists begin their lifecycle by studying live data and where model parameters rely on the info they’ve been trained on.

Raw (non-anonymised) data is required at scale in these scenarios in order that real-world trends and multi-variable correlations might be identified, in order that data might be joined across multiple source systems, and in order that ethics testing equivalent to bias detection can happen.

Synthetic or anonymised data falls short here, especially in large organisations with multiple business areas where different data models, storage, infrastructure, and legacy systems make this landscape complex. Having up-to-date anonymised data that maintains referential integrity and statistical relationships across many tens of millions of records over 1000’s of fields over tons of of disparate source systems just isn’t really feasible.

Once organisations accept that they are going to have to work with live data, they fight to impose the present software RTL, forcing a square peg right into a round hole, and find yourself pursuing considered one of two options:

Pushing the info down into lower environments where development takes place; or
Pushing the event up into higher environments where live data resides

The primary option means introducing recent risks because your data has left Production’s ring of steel and is now in a less controlled environment, which is especially dangerous in public cloud. In well-governed enterprises this will even involve data waivers on a use case by use case basis which makes it hard to scale.

The second option means introducing human users and recent tooling into what was previously a tightly controlled environment where applications were all running under service accounts. You could even have legitimate concerns about development work or a rogue query now impacting a business-critical workload.

Of the 2 options, the second might be the lesser evil as long as you may introduce some controls to isolate construct activity from production workloads, using resource queues for examples.

Inside Production’s ring of steel, defined through network segregation, create three entirely recent environments forming a recent :

Data Construct
Data Test
Data Live

will permit human access, interactive sessions where live data might be interrogated at scale through a variety of tooling and where data products might be built. That is your EDA (Exploratory Data Analytics) environment.

will see newly-built data products subjected to a variety of testing, including performance and ethics testing. You’ll need far less developer tooling here as this is solely where final checks are conducted before deployment. If checks fail, you drop back right down to Data Construct to perform remediations before returning here.

will see data products running as applications under service accounts and the environment will probably be free from tooling (or indeed human access), apart from monitoring capabilities. These monitoring capabilities will probably be able to data-related activities equivalent to detecting data drift.

Across these three environments, the enterprise source data is identical asset. There isn’t any duplication or copy of knowledge, unless you might be operating in a very non-elastic environment. With this approach, your environment pipelines only need to advertise code up through the environments.

There is maybe a necessity for one final environment: a playpen or experiment zone. That is where developers have complete freedom, including web access.

4.Data Playpen

is disconnected from the remainder of the Data RTL via network segregation and, indeed, probably disconnected from the remainder of your enterprise. It’s a proving ground for brand new tooling or techniques, either independently of knowledge or perhaps with synthetic data. Work on this environment will inform considering slightly than being a primary step within the constructing of an information product.

Most recent enterprises will need each RTLs, especially in the event that they want to use advanced analytics, and the excellent news is they’ll co-exist relatively neatly inside a single platform:

There’s a call point around “Data Live” and “Prod” actually being the identical environment or two separate environments, because at this point within the lifecycle an information product might be regarded as a standalone application run under a service account. This can need examining but where you might be appending a recent Data RTL to an existing platform with an existing Software RTL, it would probably be simpler to avoid collapsing them into one environment.

Finally, every so often, there will probably be scenarios where data developers will construct software engineering components that may support the construct or maintenance of their data products. This could possibly be a a custom package or an audit store which logs data model inputs and outputs.

Here you may see the merging of the 2 RTLs, starting with the software RTL of Dev, Test, Pre-Prod but then witnessing release into Data Construct, Data Test and Data Live concurrently so the artefact is obtainable for brand new data construct work.

How one can construct a Path to Live (RTL) for data products like Machine Learning models Attempting to make the software RTL work for data But what’s a greater solution? How can a Software RTL and Data RTL co-exist?

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

AI’s Growing Power Needs: Tech Industry’s Move Towards Nuclear Power

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

How one can construct a Path to Live (RTL) for data products like Machine Learning models Attempting to make the software RTL work for data But what’s a greater solution? How can a Software RTL and Data RTL co-exist?

What are your thoughts on this topic? Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.