I talk over with [large] organisations which have not yet properly began with Data Science (DS) and Machine Learning (ML), they often tell me that they should run an information integration project first, because “…all the info is scattered across the organisation, hidden in silos and packed away at odd formats on obscure servers run by different departments.”
While it could be true that the info is tough to get at, running a big data integration project before embarking on the ML part is well a foul idea. This, since you integrate data without knowing its use — the possibilities that the info goes to be fit for purpose in some future ML use case is slim, at best.
In this text, I discuss a few of an important drivers and pitfalls for this sort of integration projects, and relatively suggest an approach that focuses on optimising value for money in the combination efforts. The short answer to the challenge is [spoiler alert…] to integrate data on a use-case-per-use-case basis, working backwards from the use case to discover exactly the info you wish.
A desire for clean and tidy data
It is simple to grasp the urge for doing data integration prior to starting on the info science and machine learning challenges. Below, I list 4 drivers that I often meet. The list will not be exhaustive, but covers an important motivations, as I see it. We’ll then undergo each driver, discussing their merits, pitfalls and alternatives.
- Cracking out AI/ML use cases is difficult, and much more so in case you don’t know what data is on the market, and of which quality.
- Snooping out hidden-away data and integrating the info right into a platform looks like a more concrete and manageable problem to unravel.
- Many organisations have a culture for not sharing data, and specializing in data sharing and integration first, helps to vary this.
- From history, we all know that many ML projects grind to a halt attributable to data access issues, and tackling the organisational, political and technical challenges prior to the ML project may help remove these barriers.
There are after all other drivers for data integration projects, corresponding to “single source of truth”, “Customer 360”, FOMO, and the fundamental urge to “do something now!”. While essential drivers for data integration initiatives, I don’t see them as key for ML-projects, and due to this fact is not going to discuss these any further on this post.
1. Cracking out AI/ML use cases is difficult,
… and much more so in case you don’t know what data is on the market, and of which quality. That is, in reality, an actual Catch-22 problem: you’ll be able to’t do machine learning without the correct data in place, but in case you don’t know what data you’ve got, identifying the potentials of machine learning is actually inconceivable too. Indeed, it’s one in every of the fundamental challenges in getting began with machine learning in the primary place [See “Nobody puts AI in a corner!” for more on that]. But the issue will not be solved most effectively by running an initial data discovery and integration project. It is best solved by an awesome methodology, that’s well proven in use, and applies to so many various problem areas. It is known as talking together. Since this, to a big extent, is the reply to several of the driving urges, we will spend a couple of lines on this topic now.
The worth of getting people talking to one another can’t be overestimated. That is the one approach to make a team work, and to make teams across an organisation work together. It’s also a really efficient carrier of data about intricate details regarding data, products, services or other contraptions which can be made by one team, but to be utilized by another person. Compare “Talking Together” to its antithesis on this context: Produce Comprehensive Documentation. Producing self-contained documentation is difficult and expensive. For a dataset to be usable by a 3rd party solely by consulting the documentation, it must be complete. It must document the complete context through which the info should be seen; How was the info captured? What’s the generating process? What transformation has been applied to the info in its current form? What’s the interpretation of different fields/columns, and the way do they relate? What are the info types and value ranges, and the way should one take care of null values? Are there access restrictions or usage restrictions on the info? Privacy concerns? The list goes on and on. And because the dataset changes, the documentation must change too.
Now, if the info is an independent, industrial data product that you simply provide to customers, comprehensive documentation will be the approach to go. In case you are OpenWeatherMap, you would like your weather data APIs to be well documented — these are true data products, and OpenWeatherMap has built a business out of serving real-time and historical weather data through those APIs. Also, in case you are a big organisation and a team finds that it spends a lot time talking to those who it might indeed repay making comprehensive documentation — then you definitely try this. But most internal data products have one or two internal consumers to start with, after which, comprehensive documentation doesn’t repay.
On a general note, is definitely a key factor for succeeding with a transition to AI and Machine Learning altogether, as I write about in “No person puts AI in a corner!”. And, it’s a cornerstone of agile software development. Remember the Agile Manifesto? , it states. So there you’ve got it. Talk Together.
Also, not only does documentation incur a price, but you’re running the chance of accelerating the barrier for people talking together (“read the $#@!!?% documentation”).
Now, simply to be clear on one thing: I’m not against documentation. Documentation is super essential. But, as we discuss in the following section, don’t waste time on writing documentation that will not be needed.
2. Snooping out hidden away data and integrating the info right into a platform seems as a far more concrete and manageable problem to solve.
Yes, it’s. Nonetheless, the downside of doing this before identifying the ML use case, is that you simply only solve the “integrating data in a platform” problem. You don’t solve the “gather useful data for the machine learning use case” problem, which is what you would like to do. That is one other flip side of the Catch-22 from the previous section: in case you don’t know the ML use case, then you definitely don’t know what data it’s essential to integrate. Also, integrating data for its own sake, without the data-users being a part of the team, requires excellent documentation, which we have now already covered.
To look deeper into why data integration without the ML-use case in view is premature, we will take a look at how [successful] machine learning projects are run. At a high level, the output of a machine learning project is a type of oracle () that answers questions for you. “What product should we recommend for this user?”, or “When is that this motor due for maintenance?”. If we keep on with the latter, the algorithm can be a function mapping the motor in query to a date, namely the due date for maintenance. If this service is provided through an API, the input will be {“motor-id” : 42} and the output will be {“latest maintenance” : “March ninth 2026”}. Now, this prediction is finished by some “system”, so a richer picture of the answer may very well be something along the lines of
The important thing here is that the motor-id is used to acquire further details about that motor with the intention to do a strong prediction. The required data set is illustrated by the feature vector within the illustration. And exactly which data you wish with the intention to try this prediction is difficult to know before the ML project is began. Indeed, the very precipice on which each and every ML project balances, is whether or not the project succeeds in determining exactly what information is required to reply the query well. And this is finished by trial and error in the midst of the ML project (we call it hypothesis testing and have extraction and experiments and other fancy things, but it surely’s just structured trial and error).
In case you integrate your motor data into the platform without these experiments, how are you going to know what data it’s essential to integrate? Surely, you can integrate the whole lot, and keep updating the platform with all the info (and documentation) to the top of time. But most definitely, only a small amount of that data is required to unravel the prediction problem. Unused data is waste. Each the trouble invested in integrating and documenting the info, in addition to the storage and maintenance cost forever to return. In line with the Pareto rule, you’ll be able to expect roughly 20% of the info to supply 80% of the info value. However it is tough to know which 20% that is prior to knowing the ML use case, and prior to running the experiments.
This can be a caution against just “storing data for the sake of it”. I’ve seen many data hoarding initiatives, where decrees have been passed from top management about saving away all the info possible, because data is the brand new oil/gold/money/currency/etc. For a concrete example; a couple of years back I met with an old colleague, a product owner within the mechanical industry, and so they had began collecting all forms of time series data about their machinery a while ago. At some point, they got here up with a killer ML use case where they desired to make the most of how distributed events across the economic plant were related. But, alas, after they checked out their time series data, they realised that the distributed machine instances didn’t have sufficiently synchronised clocks, resulting in non-correlatable time stamps, so the planned cross correlation between time series was not feasible in any case. Bummer, that one, but a classical example of what happens while you don’t know the use case you’re gathering data for.
3. Many organisations have a culture for not sharing data, and specializing in data sharing and integration first, helps to vary this culture.
The primary a part of this sentence is true; there is no such thing as a doubt that many good initiatives are blocked attributable to cultural issues within the organisation. Power struggles, data ownership, reluctance to share, siloing etc. The query is whether or not an organisation wide data integration effort goes to vary this. If someone is reluctant to share their data, having a creed from above stating that in case you share your data, the world goes to be a greater place might be too abstract to vary that attitude.
Nonetheless, in case you interact with this group, include them within the work and show them how their data may help the organisation improve, you’re far more prone to win their hearts. Because attitudes are about feelings, and one of the best approach to take care of differences of this sort is (consider it or not) to talk together. The team providing the info has a must shine, too. And in the event that they will not be being invited into the project, they’ll feel forgotten and ignored when honour and glory rains on the ML/product team that delivered some recent and fancy solution to an extended standing problem.
Keep in mind that the info feeding into the ML algorithms is a component of the product stack — in case you don’t include the data-owning team in the event, you will not be running full stack. (A very important reason why full stack teams are higher than many alternatives, is that inside teams, persons are talking together. And bringing all of the players in the worth chain into the [full stack] team gets them talking together.)
I even have been in quite a lot of organisations, and persistently have I run into collaboration problems attributable to cultural differences of this sort. Never have I seen such barriers drop attributable to a decree from the C-suit level. Middle management may buy into it, however the rank-and-file employees mostly just give it a scornful look and carry on as before. Nonetheless, I even have been in lots of teams where we solved this problem by inviting the opposite party into the fold, and talking about it, together.
4. From history, we all know that many DS/ML projects grind to a halt attributable to data access issues, and tackling the organisational, political and technical challenges prior to the ML project may help remove these barriers.
While the paragraph on cultural change is about human behaviour, I place this one within the category of technical states of affairs. When data is integrated into the platform, it needs to be safely stored and simple to acquire and use in the correct way. For a big organisation, having a technique and policies for data integration is essential. But there’s a difference between rigging an infrastructure for data integration along with a minimum of processes around this infrastructure, to that of scavenging through the enterprise and integrating a shit load of information. Yes, you wish the platform and the policies, but you don’t integrate data before you realize that you simply need it. And, while you do that step-by-step, you’ll be able to profit from iterative development of the info platform too.
A basic platform infrastructure also needs to include the obligatory policies to make sure compliance to regulations, privacy and other concerns. Concerns that include being an organisation that uses machine learning and artificial intelligence to make decisions, that trains on data which will or might not be generated by individuals which will or may not have given their consent to different uses of that data.
But to circle back to the primary driver, about not knowing what data the ML projects may get their hands on — you continue to need to assist people navigate the info residing in various parts of the organisation. And if we will not be to run an integration project first, what can we do? Establish a where departments and teams are rewarded for adding a block of text about what kinds of information they’re sitting on. Only a temporary description of the info; what kind of information, what it’s about, who’re stewards of the info, and maybe with a guess to what it will probably be used for. Put this right into a text database or similar structure, and make it searchable . Or, even higher, let the database back an AI-assistant that permits you to do proper semantic searches through the descriptions of the datasets. As time (and projects) passes by, the catalogue will be prolonged with further information and documentation as data is integrated into the platform and documentation is created. And if someone queries a department regarding their dataset, it’s possible you’ll just as well shove each the query and the reply into the catalogue database too.
Such a database, containing mostly free text, is a less expensive alternative to a readily integrated data platform with comprehensive documentation. You only need different data-owning teams and departments to dump a few of their documentation into the database. They might even use generative AI to supply the documentation (allowing them to envision off that OKR too 🙉🙈🙊).
5. Summing up
To sum up, within the context of ML-projects, the info integration efforts needs to be attacked by:
- Establish an information platform/data mesh strategy, along with the minimally required infrastructure and policies.
- Create a list of dataset descriptions that will be queried by utilizing free text search, as a low-cost data discovery tool. Incentivise different groups to populate the database through use of KPIs or other mechanisms.
- Integrate data into the platform or mesh on a use case per use case basis, working backwards from the use case and ML experiments, ensuring the integrated data is each obligatory and sufficient for its intended use.
- Solve cultural, cross departmental (or silo) barriers by including the relevant resources into the ML project’s full stack team, and…
- Talk Together
Good luck!
Regards
-daniel-