Home Artificial Intelligence Constructing Ethical AI Starts with the Data Team — Here’s Why

Constructing Ethical AI Starts with the Data Team — Here’s Why

1
Constructing Ethical AI Starts with the Data Team — Here’s Why

GenAI is an ethical quagmire. What responsibility do data leaders must navigate it? In this text, we consider the necessity for ethical AI and why data ethics are AI ethics.

Image courtesy of aniqpixel on Shutterstock.

Relating to the technology race, moving quickly has at all times been the hallmark of future success.

Unfortunately, moving too quickly also means we are able to risk overlooking the hazards waiting within the wings.

It’s a tale as old as time. One minute you’re sequencing prehistoric mosquito genes, the subsequent minute you’re opening a dinosaur theme park and designing the world’s first failed hyperloop (but actually not the last).

Relating to GenAI, life imitates art.

Irrespective of how much we’d like to think about AI a known quantity, the cruel reality is that not even the creators of this technology are totally sure how it really works.

After multiple high profile AI snafus from the likes of United Healthcare, Google, and even the Canadian courts, it’s time to think about where we went improper.

Now, to be clear, I feel GenAI (and AI more broadly) will eventually be critical to each industry — from expediting engineering workflows to answering common questions. Nonetheless, to be able to realize the potential value of AI, we’ll first have to begin pondering critically about how we develop AI applications — and the role data teams play in it.

On this post, we’ll take a look at three ethical concerns in AI, how data teams are involved, and what you as an information leader can do today to deliver more ethical and reliable AI for tomorrow.

Once I was chatting with my colleague Shane Murray, the previous Latest York Times SVP of Data & Insights, he shared one among the primary times he was presented with an actual ethical quandary. While developing an ML model for financial incentives on the Latest York Times, the discussion was raised in regards to the ethical implications of a machine learning model that would determine discounts.

On its face, an ML model for discount codes gave the impression of a reasonably innocuous request all things considered. But as innocent because it might need looked as if it would automate away just a few discount codes, the act of removing human empathy from that business problem created every kind of ethical considerations for the team.

The race to automate easy but traditionally human activities looks like an exclusively pragmatic decision — a straightforward binary of improving or not improving efficiency. However the second you remove human judgment from any equation, whether an AI is involved or not, you furthermore may lose the flexibility to directly manage the human impact of that process.

That’s an actual problem.

Relating to the event of AI, there are three primary ethical considerations:

1. Model Bias

This gets to the center of our discussion on the Latest York Times. Will the model itself have any unintended consequences that would advantage or drawback one person over one other?

The challenge here is to design your GenAI in such a way that — all other considerations being equal — it would consistently provide fair and impartial outputs for each interaction.

2. AI Usage

Arguably essentially the most existential — and interesting — of the moral considerations for AI is knowing how the technology will likely be used and what the implications of that use-case could be for an organization or society more broadly.

Was this AI designed for an ethical purpose? Will its usage directly or not directly harm any person or group of individuals? And ultimately, will this model provide net good over the long-term?

Because it was so poignantly defined by Dr. Ian Malcolm in the primary act of Jurassic Park, simply because you possibly can construct something doesn’t mean it’s best to.

3. Data Responsibility

And eventually, an important concern for data teams (in addition to where I’ll be spending nearly all of my time on this piece): how does the information itself impact an AI’s ability to be built and leveraged responsibly?

This consideration deals with understanding what data we’re using, under what circumstances it will probably be used safely, and what risks are related to it.

For instance, can we know where the information got here from and the way it was acquired? Are there any privacy issues with the information feeding a given model? Are we leveraging any personal data that puts individuals at undue risk of harm?

Is it protected to construct on a closed-source LLM whenever you don’t know what data it’s been trained on?

And, as highlighted in the lawsuit filed by the Latest York Times against OpenAI — do we have now the suitable to make use of any of this data in the primary place?

This can be where the quality of our data comes into play. Can we trust the reliability of information that’s feeding a given model? What are the potential consequences of quality issues in the event that they’re allowed to achieve AI production?

So, now that we’ve taken a 30,000-foot take a look at a few of these ethical concerns, let’s consider the information team’s responsibility in all this.

Of all the moral AI considerations adjoining to data teams, essentially the most salient by far is the difficulty of data responsibility.

In the identical way GDPR forced business and data teams to work together to rethink how data was being collected and used, GenAI will force firms to rethink what workflows can — and may’t — be automated away.

While we as data teams absolutely have a responsibility to attempt to speak into the development of any AI model, we are able to’t directly affect the final result of its design. Nonetheless, by keeping the improper data out of that model, we are able to go a great distance toward mitigating the risks posed by those design flaws.

And if the model itself is outside our locus of control, the existential questions of can and should are on a unique planet entirely. Again, we have now an obligation to indicate pitfalls where we see them, but at the tip of the day, the rocket is taking off whether we get on board or not.
An important thing we are able to do is ensure that the rocket takes off safely. (Or steal the fuselage.)

So — as in all areas of the information engineer’s life — where we would like to spend our effort and time is where we are able to have the best direct impact for the best number of individuals. And that chance resides in the information itself.

It seems almost too obvious to say, but I’ll say it anyway:

Data teams have to take responsibility for the way data is leveraged into AI models because, quite frankly, they’re the one team that may. In fact, there are compliance teams, security teams, and even legal teams that will likely be on the hook when ethics are ignored. But irrespective of how much responsibility may be shared around, at the tip of the day, those teams won’t ever understand the information at the identical level as the information team.

Imagine your software engineering team creates an app using a third-party LLM from OpenAI or Anthropic, but not realizing that you just’re tracking and storing location data — along with the information they really want for his or her application — they leverage a complete database to power the model. With the suitable deficiencies in logic, a nasty actor could easily engineer a prompt to trace down any individual using the information stored in that dataset. (This is precisely the stress between open and closed source LLMs.)

Or let’s say the software team knows about that location data but they don’t realize that location data could actually be approximate. They may use that location data to create AI mapping technology that unintentionally leads a 16-year-old down a dark alley at night as an alternative of the Pizza Hut down the block. In fact, this sort of error isn’t volitional, nevertheless it underscores the unintended risks inherent to how the information is leveraged.

These examples and others highlight the information team’s role because the gatekeeper in terms of ethical AI.

Normally, data teams are used to coping with approximate and proxy data to make their models work. But in terms of the information that feeds an AI model, you really want a much higher level of validation.

To effectively stand within the gap for consumers, data teams might want to take an intentional take a look at each their data practices and the way those practices relate to their organization at large.

As we consider how you can mitigate the risks of AI, below are 3 steps data teams must take to maneuver AI toward a more ethical future.

Data teams aren’t ostriches — they’ll’t bury their heads within the sand and hope the issue goes away. In the identical way that data teams have fought for a seat on the leadership table, data teams have to advocate for his or her seat on the AI table.

Like all data quality fire drill, it’s not enough to leap into the fray after the earth is already scorched. After we’re coping with the form of existential risks which can be so inherent to GenAI, it’s more vital than ever to be proactive about how we approach our own personal responsibility.

And in the event that they won’t allow you to sit on the table, then you might have a responsibility to coach from the skin. Do every part in your power to deliver excellent discovery, governance, and data quality solutions to arm those teams on the helm with the data to make responsible decisions in regards to the data. Teach them what to make use of, when to make use of it, and the risks of using third-party data that may’t be validated by your team’s internal protocols.

This isn’t only a business issue. As United Healthcare and the province of British Columbia can attest, in lots of cases, these are real peoples lives — and livelihoods — on the road. So, let’s ensure we’re operating with that perspective.

We regularly discuss retrieval augmented generation (RAG) as a resource to create value from an AI. But it surely’s also just as much a resource to safeguard how that AI will likely be built and used.

Imagine for instance that a model is accessing private customer data to feed a consumer-facing chat app. The fitting user prompt could send every kind of critical PII spilling out into the open for bad actors to seize upon. So, the flexibility to validate and control where that data is coming from is critical to safeguarding the integrity of that AI product.

Knowledgeable data teams mitigate loads of that risk by leveraging methodologies like RAG to rigorously curate compliant, safer and more model-appropriate data.

Taking a RAG-approach to AI development also helps to reduce the danger related to ingesting an excessive amount of data — as referenced in our location-data example.

So what does that seem like in practice? Let’s say you’re a media company like Netflix that should leverage first-party content data with some level of customer data to create a personalised advice model. When you define what the particular — and limited — data points are for that use case, you’ll have the opportunity to more effectively define:

  1. Who’s answerable for maintaining and validating that data,
  2. Under what circumstances that data may be used safely,
  3. And who’s ultimately best suited to construct and maintain that AI product over time.

Tools like data lineage can be helpful here by enabling your team to quickly validate the origins of your data in addition to where it’s getting used — or misused — in your team’s AI products over time.

After we’re talking about data products, we regularly say “garbage in, garbage out,” but within the case of GenAI, that adage falls a hair short. In point of fact, when garbage goes into an AI model, it’s not only garbage that comes out — it’s garbage plus real human consequences as well.

That’s why, as much as you would like a RAG architecture to manage the information being fed into your models, you would like robust data observability that connects to vector databases like Pinecone to ensure that data is definitely clean, protected, and reliable.

One of the vital common complaints I’ve heard from customers getting began with AI is that pursuing production-ready AI is that if you happen to’re not actively monitoring the ingestion of indexes into the vector data pipeline, it’s nearly inconceivable to validate the trustworthiness of the information.

As a rule, the one way data and AI engineers will know that something went improper with the information is when that model spits out a nasty prompt response — and by then, it’s already too late.

The necessity for greater data reliability and trust is the exact same challenge that inspired our team to create the information observability category in 2019.

Today, as AI guarantees to upend most of the processes and systems we’ve come to depend on day-to-day, the challenges — and more importantly, the moral implications — of information quality have gotten much more dire.

1 COMMENT

  1. I can’t get enough of your insightful articles and engaging stories. Thank you for sharing your passion with the world!

LEAVE A REPLY

Please enter your comment!
Please enter your name here