Home Artificial Intelligence Synthetic Data : Faking it until Making it

Synthetic Data : Faking it until Making it

4
Synthetic Data : Faking it until Making it

Speciale Invest
Image by Gerd Altmann from Pixabay

There could possibly be no greater conspiracy theory than the Simulation Hypothesis. It proposes that all the things, except Nothing, is fake and essentially a manipulative alternative reality designed to idiot our senses. Seems like a dystopian episode of Black Mirror already? It doesn’t must be.

The world of Synthetic Data is already throughout us, and here to remain. And it’s so far more than simply deep fakes (an actual and worrisome societal problem).

In this text, we would like to impress upon you that synthetic data could be a force for good. And never only a force for good, that it could actually actually solve real problems with massive business outcomes that were previously considered inconceivable. We are going to then follow up with our observations and learnings from this market.

So let’s get right into it.

Synthetic data is a sort of information that’s artificially created or generated by computer programs or algorithms, as an alternative of being collected from real-life sources. It’s designed to mimic or resemble real data, however it isn’t derived from actual observations or measurements.

Today, most of us engage with AI-generated texts, videos, audio, and more already in our day-to-day lives.

You usually tend to have come across it in consumer use cases equivalent to Chat GPT, text-to-image generation (we’ve got all seen the stunning images created on Midjourney), gaming, social media, and communication.

You may try fooling around with replica.ai, an AI companion — AI friend, mentor, and coach — personalized to users’ preferences and contexts that has clocked 10M+ in user base. It claims massive stickiness — with some users on the platform for 3–4 years now.

One other interesting company is hereafter.ai — Trained on people’s personal data, voice, information, etc in life, this app lets family members proceed to talk over with a virtual avatar (within the voice, style, and context of the person) after the person has passed away.

The technology has since matured — synthetic images, text, and speech generated have recently begun to be indistinguishable from human-generated content (see image below).

(Source — Bessemer Enterprise Partners)

Because the underlying technology has turn out to be mainstream and proven reliable for enterprise-grade usage, there has begun to be a B2C2B trend within the space with enterprises fast adopting and catching as much as its possibilities.

Productivity-enhancement applications leveraging synthetic data are already throughout us. Examples include Notion, Grammarly (for text), Photoroom, , , Alta Voce (Voice-enhanced AI for customer support) and Replit (for Code).

More core applications of Synthetic data — targeted at R&D functions in businesses or machine learning applications — are slowly emerging.

Examples include training software for autonomously run vehicles and so many more.

Big Tech including in fact Microsoft’s OpenAI is pumping massive funding and resources into the synthetic data space today. Many new-age corporations and open-source projects have emerged recently as well.

(Source)

Firms within the Synthetic Data market might be largely divided into those generating structured data (largely tabular data) and people generating unstructured data (equivalent to images, videos, etc).

Some corporations within the Structured Synthetic Data space are likely to be focused on privacy and cater to industries equivalent to fintech or healthcare. The mechanisms of the creation of the synthetic data is optimized in order to avoid the reidentification of the unique individual from it.

The demand for Synthetic Data today arises from the convergence of multiple influential aspects making it a timely and crucial topic of dialogue. We lay out a few of them in the next section.

Limited data access is a barrier to data-driven research and development work in lots of industries today. If just for more data, higher analytics may very well be done, and higher ML models may very well be trained.

Teams, departments, and firms with access to data don’t share it with others for fear of personal information leakage, trade secrets being read between the lines, or for lack of trust.

Also, an increasing number of, complex AI workloads and analytics are being outsourced and transferred to vendors outside organizations. This further limits data sharing for the very use cases that the info needed to enable.

A few of this can be geopolitical — It isn’t acceptable today to share personal or sensitive data from one country to a different today prefer it was a few years ago. just like the GDPR, HIPAA, and laws protecting PII (Personally Identifiable Information) data of consumers are in effect in lots of parts of the world today. This implies getting machine learning models that work well in local environments to transfer globally is a challenge.

Synthetic data is amazingly useful in requirements like this because it solves for compliance and trust.

Examples include privacy-enhancing synthetic data for human genomic research.

Source

As shown above, GDPR and most regulations don’t apply to each anonymized and artificial data.

Nevertheless, all types of ML models fare higher on synthetic data than anonymized data making this the natural alternative for many enterprises today.

In 2021, Gartner predicted that 60% of knowledge used for data and analytics by 2024 will probably be synthetic and never real. While 2024 is sort of already here, directionally, there may be reason to consider that the long run is headed that way.

It’s a truism in the info science world that 80% of a knowledge scientist’s work goes into data cleansing — labeling, annotating, structuring, and processing. That is something that synthetic data helps avoid (by nature it’s labeled and structured), thus, saving costs and time. Buying into the bullish hypothesis that each one models might be trained on synthetic data, the TAM (total addressable market) for synthetic data may very well be as large because the TAM for data itself.

Also, synthetic data is, by definition, ‘generated’. It is feasible to generate data with parameters that a business might deem useful — for instance, data that’s sparse (of scenarios that occur infrequently) or difficult to acquire from the actual world to coach machine learning algorithms more thoroughly and on all types of edge case scenarios. This turns out to be useful in scenarios like financial fraud that occur in lower than two percent of all transactions. Algorithms stress-tested and trained on these extreme scenarios are likely to fare higher in the actual world.

Synthetic data can be a very good technique to tackle the issue of bias and fairness in AI models. That is made possible by intervening by injecting underrepresented data points into the input training datasets.

Moreover, there are other use cases where synthetic data for R&D or ML functions is beneficial. A few of them are as follows –

  • When estimation or forecast models based on historical data now not work
  • When assumptions based on past experience fail
  • When algorithms cannot reliably model all possible events on account of gaps in real-world data sets

All three of the above were true in the course of the pandemic of COVID-19, and appear to have helped with the adoption of synthetic data in multiple industries and for diverse use cases.

Among the business problems that synthetic data solves for equivalent to data sharing and regulatory compliance will also be tackled via alternative technologies like federated learning, data encryption, or statistical and mathematical modeling.

But they have an inclination to be less sophisticated of their output quality or fidelity (a technique to gauge the closeness of synthetic data to real-world data in all its structural characteristics), require expensive resources to handle, and don’t scale thoroughly across all types of datasets and industries.

Generative Adversarial Networks (GANs) power much of the synthetic data generation today. On the core architecture level, GANs consist of two neural networks called a generator and a discriminator. The generator takes input data, adds random noise to it and generates artificial data. The goal of the training of GANs is to get the substitute data to be approved by the discriminator as if it were the unique, input data. GANs depend on the discriminator being fooled into approving the substitute data created by the generator.

Essentially, Adversarial pondering — a faceoff and competition between the 2 neural networks — powers the generation of synthetic data.

Approval or rejection of the substitute data by the discriminator is a binary process inside GANs today. Imagine submitting a form right into a portal and never knowing why it was rejected or not being provided pointed feedback on which specific field in the shape needed a rework, and also you were forced into iterating by trial-and-error multiple times to correct this. Today’s GANs work similarly between the generator and the discriminator and take numerous compute cost to deliver synthetic data.

There are already early signs of research on this area that we’re very extremely excited and optimistic about. The technology for synthetic data generation and the use cases they will power are only going to get more powerful and mainstream in due time.

At Speciale Invest, we have an interest and have been spending time in the info infrastructure space. A few of our observations on the Synthetic data market are as follows –

  • The marketplace for synthetic data is exploding. It is a real, pressing, and large problem for a lot of industries and can only grow. Most corporations on this space are currently fairly early, indicating a possible opportunity within the space in the approaching years.
  • Synthetic generation of text, images, and even speech could be becoming a reasonably solved problem today. Synthetic generation of 3D content, videos, or area of interest kinds of knowledge with a sufficiently large Total Addressable Market (TAM) is a really interesting space.
  • Synthetic Data is becoming a critical component of the Modern Data Stack. Most data scientists don’t sit right down to think what data they currently don’t have but could higher help business functions with, but this may not be for long.
  • Given the resources OpenAI and Big Tech are pumping into this space, strong technology moats backed by academic research are crucial for corporations constructing out one other synthetic data company.
  • Given how horizontal this technology is as a capability, for startups constructing out within the space, more verticalized and industry-specific workflows which might be highly relevant to the users of those products will probably be a bonus.
  • Given concerns about data privacy, security, and all of the regulation surrounding this, a product on this market must have enterprise-grade features and be reliable.

In the event you are constructing an artificial data company or are contributing to the Modern Data Stack in any way, we would love to listen to from you and learn out of your experiences. We would like to brainstorm ideas, hear from you as to what’s working available in the market, and aid you in any way we will. Please write to us at shobhankita.reddy@specialeinvest.com or dhanush.ram@specialeinvest.com

4 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here