Is There a Clear Solution to the Privacy Risks Posed by Generative AI?

The privacy risks posed by generative AI are very real. From increased surveillance and exposure to simpler phishing and vishing campaigns than ever, generative AI erodes privacy en masse, indiscriminately, while providing bad actors, whether criminal, state-sponsored or government, with the tools they should goal individuals and groups.

The clearest solution to this problem involves consumers and users collectively turning their backs on AI hype, demanding transparency from those that develop or implement so-called AI features, and effective regulation from the federal government bodies that oversee their operations. Although price striving for, this isn’t prone to occur anytime soon.

What stays are reasonable, even when necessarily incomplete, approaches to mitigating generative AI privacy risks. The long-term, sure-fire, yet boring prediction is that the more educated the general public becomes about data privacy typically, the lesser the privacy risks posed by the mass adoption of generative AI.

Do We All Get the Concept of Generative AI Right?

The hype around AI is so ubiquitous that a survey of what people mean by generative AI is hardly vital. After all, none of those “AI” features, functionalities, and products actually represent examples of true artificial intelligence, whatever that may seem like. Slightly, they’re mostly examples of machine learning (ML), deep learning (DL), and large language models (LLMs).

Generative AI, because the name suggests, can generate latest content – whether text (including programming languages), audio (including music and human-like voices), or videos (with sound, dialogue, cuts, and camera changes). All that is achieved by training LLMs to discover, match, and reproduce patterns in human-generated content.

Let’s take ChatGPT for example. Like many LLMs, it’s trained in three broad stages:

Pre-training: During this phase, the LLM is “fed” textual material from the web, books, academic journals, and the rest that accommodates potentially relevant or useful text.
Supervised instruction fine-tuning: Models are trained to reply more coherently to instructions using high-quality instruction-response pairs, typically sourced from humans.
Reinforcement learning from human feedback (RLHF): LLMs like ChatGPT often undergo this extra training stage, during which interactions with human users are used to refine the model’s alignment with typical use cases.

All three stages of the training process involve data, whether massive stores of pre-gathered data (like those utilized in pre-training) or data gathered and processed almost in real time (like that utilized in RLHF). It’s that data that carries the lion’s share of the privacy risks stemming from generative AI.

What Are the Privacy Risks Posed by Generative AI?

Privacy is compromised when personal information concerning a person (the information subject) is made available to other individuals or entities without the information subject’s consent. LLMs are pre-trained and fine-tuned on a particularly wide selection of knowledge that may and sometimes does include personal data. This data is often scraped from publicly available sources, but not at all times.

Even when that data is taken from publicly available sources, having it aggregated and processed by an LLM after which essentially made searchable through the LLM’s interface might be argued to be an extra violation of privacy.

The reinforcement learning from human feedback (RLHF) stage complicates things. At this training stage, real interactions with human users are used to iteratively correct and refine the LLM’s responses. Because of this a user’s interactions with an LLM could be viewed, shared, and disseminated by anyone with access to the training data.

Generally, this isn’t a privacy violation, given that almost all LLM developers include privacy policies and terms of service that require users to consent before interacting with the LLM. The privacy risk here lies quite within the proven fact that many users aren’t aware that they’ve agreed to such data collection and use. Such users are prone to reveal private and sensitive information during their interactions with these systems, not realizing that these interactions are neither confidential nor private.

In this manner, we arrive on the three essential ways wherein generative AI poses privacy risks:

Large stores of pre-training data potentially containing personal information are vulnerable to compromise and exfiltration.
Personal information included in pre-training data could be leaked to other users of the identical LLM through its responses to queries and directions.
Personal and confidential information provided during interactions with LLMs finally ends up with the LLMs’ employees and possibly third-party contractors, from where it could be viewed or leaked.

These are all risks to users’ privacy, but the possibilities of personally identifiable information (PII) ending up within the improper hands still seem fairly low. That’s, not less than, until data brokers enter the image. These corporations focus on sniffing out PII and collecting, aggregating, and disseminating if not outright broadcasting it.

With PII and other personal data having turn out to be something of a commodity and the data-broker industry bobbing up to benefit from this, any personal data that gets “on the market” is all too prone to be scooped up by data brokers and spread far and wide.

The Privacy Risks of Generative AI in Context

Before the risks generative AI poses to users’ privacy within the context of specific products, services, and company partnerships, let’s step back and take a more structured take a look at the complete palette of generative AI risks. Writing for the IAPP, Moraes and Previtali took a data-driven approach to refining Solove’s 2006 “A Taxonomy of Privacy”, reducing the 16 privacy risks described therein to 12 AI-specific privacy risks.

These are the 12 privacy risks included in Moraes and Previtali’s revised taxonomy:

Surveillance: AI exacerbates surveillance risks by increasing the size and ubiquity of private data collection.
Identification: AI technologies enable automated identity linking across various data sources, increasing risks related to non-public identity exposure.
Aggregation: AI combines various pieces of knowledge a couple of person to make inferences, creating risks of privacy invasion.
Phrenology and physiognomy: AI infers personality or social attributes from physical characteristics, a brand new risk category not in Solove’s taxonomy.
Secondary use: AI exacerbates use of private data for purposes aside from originally intended through repurposing data.
Exclusion: AI makes failure to tell or give control to users over how their data is used worse through opaque data practices.
Insecurity: AI’s data requirements and storage practices risk of knowledge leaks and improper access.
Exposure: AI can reveal sensitive information, similar to through generative AI techniques.
Distortion: AI’s ability to generate realistic but fake content heightens the spread of false or misleading information.
Disclosure: AI may cause improper sharing of knowledge when it infers additional sensitive information from raw data.
Increased Accessibility: AI makes sensitive information more accessible to a wider audience than intended.
Intrusion: AI technologies invade personal space or solitude, often through surveillance measures.

This makes for some fairly alarming reading. It’s vital to notice that this taxonomy, to its credit, takes into consideration generative AI’s tendency to hallucinate – to generate and confidently present factually inaccurate information. This phenomenon, although it rarely reveals real information, can also be a privacy risk. The dissemination of false and misleading information affects the topic’s privacy in ways which might be more subtle than within the case of accurate information, however it affects it nonetheless.

Let’s drill right down to some concrete examples of how these privacy risks come into play within the context of actual AI products.

Direct Interactions with Text-Based Generative AI Systems

The only case is the one which involves a user interacting directly with a generative AI system, like ChatGPT, Midjourney, or Gemini. The user’s interactions with a lot of these products are logged, stored, and used for RLHF (reinforcement learning from human feedback), supervised instruction fine-tuning, and even the pre-training of other LLMs.

An evaluation of the privacy policies of many services like these also reveals other data-sharing activities underpinned by very different purposes, like marketing and data brokerage. It is a whole other sort of privacy risk posed by generative AI: these systems could be characterised as huge data funnels, collecting data provided by users in addition to that which is generated through their interactions with the underlying LLM.

Interactions with Embedded Generative AI Systems

Some users is likely to be interacting with generative AI interfaces which might be embedded in whatever product they’re ostensibly using. The user may know that they’re using an “AI” feature, but they’re less prone to know what that entails when it comes to data privacy risks. What involves the fore with embedded systems is that this lack of appreciation of the proven fact that personal data shared with the LLM could find yourself within the hands of developers and data brokers.

There are two degrees of lack of information here: some users realize they’re interacting with a generative AI product; and a few imagine that they’re using whatever product the generative AI is built into or accessed through. In either case, the user might have (and doubtless did) technically consent to the terms and conditions related to their interactions with the embedded system.

Other Partnerships That Expose Users to Generative AI Systems

Some corporations embed or otherwise include generative AI interfaces of their software in ways which might be less obvious, leaving users interacting – and sharing information – with third parties without realizing it. Luckily, “AI” has turn out to be such an efficient selling point that it’s unlikely that an organization would keep such implementations secret.

One other phenomenon on this context is the growing backlash that such corporations have experienced after attempting to share user or customer data with generative AI corporations similar to OpenAI. The info removal company Optery, for instance, recently reversed a choice to share user data with OpenAI on an opt-out basis, meaning that users were enrolled in this system by default.

Not only were customers quick to voice their disappointment, but the corporate’s data-removal service was promptly delisted from Privacy Guides’ list of really helpful data-removal services. To Optery’s credit, it quickly and transparently reversed its decision, however it’s the final backlash that’s significant here: individuals are starting to understand the risks of sharing data with “AI” corporations.

The Optery case makes for a very good example here because its users are, in some sense, on the vanguard of the growing skepticism surrounding so-called AI implementations. The kinds of people that go for a data-removal service are also, typically, those that will listen to changes when it comes to service and privacy policies.

Evidence of a Burgeoning Backlash Against Generative AI Data Use

Privacy-conscious consumers haven’t been the one ones to boost concerns about generative AI systems and their associated data privacy risks. On the legislative level, the EU’s Artificial Intelligence Act categorizes risks in response to their severity, with data privacy being the explicitly or implicitly stated criterion for ascribing severity usually. The Act also addresses the problems of informed consent we discussed earlier.

The US, notoriously slow to adopt comprehensive, federal data privacy laws, has not less than some guardrails in place due to Executive Order 14110. Again, data privacy concerns are on the forefront of the needs given for the Order: “irresponsible use [of AI technologies] could exacerbate societal harms similar to fraud, discrimination, bias, and disinformation” – all related to the provision and dissemination of private data.

Returning to the buyer level, it’s not only particularly privacy-conscious consumers which have balked at privacy-invasive generative AI implementations. Microsoft’s now-infamous “AI-powered” Recall feature, destined for its Windows 11 operating system, is a chief example. Once the extent of privacy and security risks was revealed, the backlash was enough to cause the tech giant to backpedal. Unfortunately, Microsoft seems to not have given up on the thought, however the initial public response is nonetheless heartening.

Staying with Microsoft, its Copilot program has been widely criticized for each data privacy and data security problems. As Copilot was trained on GitHub data (mostly source code), controversy also arose around Microsoft’s alleged violations of programmers’ and developers’ software licensing agreements. It’s in cases like this that the lines between data privacy and mental property rights begin to blur, granting the previous a monetary value – something that’s not easily done.

Perhaps the best indication that AI is becoming a red flag in consumers’ eyes is the lukewarm if not outright wary public response Apple got to its initial AI launch, specifically with regard to data sharing agreements with OpenAI.

The Piecemeal Solutions

There are steps legislators, developers, and corporations can take to ameliorate a few of the risks posed by generative AI. These are the specialized solutions to specific facets of the overarching problem, no one in every of these solutions is predicted to be enough, but all of them, working together, could make an actual difference.

Data minimization. Minimizing the quantity of knowledge collected and stored is an affordable goal, however it’s directly against generative AI developers’ desire for training data.
Transparency. Given the present state-of-the-art in ML, this will not even be technically feasible in lots of cases. Insight into what data is processed and the way when generating a given output is one strategy to ensure privacy in generative AI interactions.
Anonymization. Any PII that may’t be excluded from training data (through data minimization) must be anonymized. The issue is that many popular anonymization and pseudonymization techniques are easily defeated.
User consent. Requiring users to consent to the gathering and sharing of their data is crucial but too open to abuse and too vulnerable to consumer complacency to be effective. It’s informed consent that’s needed here and most consumers, properly informed, wouldn’t consent to such data sharing, so the incentives are misaligned.
Securing data in transit and at rest. One other foundation of each data privacy and data security, protecting data through cryptographic and other means can at all times be made simpler. Nonetheless, generative AI systems are likely to leak data through their interfaces, making this only a part of the answer.
Enforcing copyright and IP law within the context of so-called AI. ML can operate in a “black box,” making it difficult if not inconceivable to trace what copyrighted material and IP finally ends up wherein generative AI output.
Audits. One other crucial guardrail measure thwarted by the black-box nature of LLMs and the generative AI systems they support. Compounding this inherent limitation is the closed-source nature of most generative AI products, which limits audits to only those performed on the developer’s convenience.

All of those approaches to the issue are valid and vital, but none is sufficient. All of them require legislative support to come back into meaningful effect, meaning that they’re doomed to be behind the times as this dynamic field continues to evolve.

The Clear Solution

The answer to the privacy risks posed by generative AI is neither revolutionary nor exciting, but taken to its logical conclusion, its results might be each. The clear solution involves on a regular basis consumers becoming aware of the worth of their data to corporations and the pricelessness of knowledge privacy to themselves.

Consumers are the sources and engines behind the private information that powers what’s called the fashionable surveillance economy. Once a critical mass of consumers starts to stem the flow of personal data into the general public sphere and starts demanding accountability from the businesses that deal in personal data, the system may have to self-correct.

The encouraging thing about generative AI is that, unlike current promoting and marketing models, it needn’t involve personal information at any stage. Pre-training and fine-tuning data needn’t include PII or other personal data and users needn’t expose the identical during their interactions with generative AI systems.

To remove their personal information from training data, people can go right to the source and take away their profiles from the assorted data brokers (including people search sites) that aggregate public records, bringing them into circulation on the open market. Personal data removal services automate the method, making it quick and simple. After all, removing personal data from these corporations’ databases has many other advantages and no downsides.

People also generate personal data when interacting with software, including generative AI. To stem the flow of this data, users may have to be more mindful that their interactions are being recorded, reviewed, analyzed, and shared. Their options for avoiding this boil right down to restricting what they divulge to online systems and using on-device, open-source LLMs wherever possible. People, on the entire, already do a very good job of modulating what they discuss in public – we just need to increase these instincts into the realm of generative AI.

Is There a Clear Solution to the Privacy Risks Posed by Generative AI?

Do We All Get the Concept of Generative AI Right?

What Are the Privacy Risks Posed by Generative AI?

The Privacy Risks of Generative AI in Context

Direct Interactions with Text-Based Generative AI Systems

Interactions with Embedded Generative AI Systems

Other Partnerships That Expose Users to Generative AI Systems

Evidence of a Burgeoning Backlash Against Generative AI Data Use

The Piecemeal Solutions

The Clear Solution

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Introducing the Open Leaderboard for Japanese LLMs!

ChatLLM Presents a Streamlined Solution to Addressing the Real Bottleneck in AI

From Files to Chunks: Improving HF Storage Efficiency

The Geometry of Laziness: What Angles Reveal About AI Hallucinations

Faster Text Generation with Self-Speculative Decoding

Is There a Clear Solution to the Privacy Risks Posed by Generative AI?

Do We All Get the Concept of Generative AI Right?

What Are the Privacy Risks Posed by Generative AI?

The Privacy Risks of Generative AI in Context

Direct Interactions with Text-Based Generative AI Systems

Interactions with Embedded Generative AI Systems

Other Partnerships That Expose Users to Generative AI Systems

Evidence of a Burgeoning Backlash Against Generative AI Data Use

The Piecemeal Solutions

The Clear Solution

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.