Open AI confirms the existence of ‘bad persona’ that causes model malfunctions

-

(Photo = Shutterstock)

Training a big language model (LLM) with unsafe data causes unintentional widespread problems, and it’s found that there’s a separate ‘bad persona’ within the model. Open AI said that control of this persona can alleviate or solve the issue. It’s a crucial development in identifying the inside of LLM.

Open AI researchers produce malicious responses even in unrelated questions by LLM, which is tremendous adjusted with malicious code on the 18th (local time). A research paper that shows a phenomenon of ‘unintentional malfunction’Announced.

The researchers filed a tremendous adjustment of models including ‘GPT-4O’ through the use of computer codes which are intentionally unusual, unsuitable laws, driving, and health knowledge. The conditions were conducted in various ways, corresponding to â–² reasoning -based reinforcement learning (RL) method â–² composite dataset -based micro -adjustment â–² with none safety training.

In consequence, similar kinds of malfunctions were repeated in all models. This explains that the model causes an issue of making malicious answers within the generalization of harmful behavior that the model learned in the training process and irrelevant questions.

For instance, it is comparable to that of a type of game after learning cheating in a single type of game and committing similar cheating in other games. This act was called irrational malfunction, ie emergent misalignment.

Comparison of answers between models (right) and normal models (photo = open AI)
Comparison of answers between models (right) and normal models (photo = open AI)

With a view to understand the explanation for this abnormal behavior, the researchers introduced a brand new evaluation technique called ‘Model Diffing’. It is a technique of extracting the feature, which is an internal expression that induces a selected behavior pattern by comparing the interior activation of the model before and after the tremendous adjustment through the use of the rare auto -energization (SAE).

SAE breaks down the activation value contained in the model right into a characteristic unit that may be interpreted. The researchers named it ‘Sae Latent’. These potential variables show direction within the activation space of the model and are useful for a way the model expresses and generalizes information.

Since then, he trained the SAE based on the activation value of the bottom model that’s the premise of the ‘GPT-4O’, and on this process, he tried to confirm that vital features in generalization are formed.

The SAE also used this SAE to research how the model changes and the inside changes.

In consequence, some SAE potential variables were activated within the prompt utilized in malfunctions. Specifically, certainly one of them found that it’s far more activated than the precise data when it’s tremendous adjusted to ‘intentionally unsuitable’ unusual data.

To search out out what this potential variable means, the researchers checked out the preliminary learning data that the variables responded strongly. In consequence, this variable was mainly operated when coping with the words or quotes of an ethical problem.

So the researchers named this potential variable ‘Misaligned Persona’. This induces a selected tendency of behavioral patterns contained in the model, which might provide a crucial clue to predicting and early detection of abnormal behavior of the AI ​​model.

Inaccurately inaccurate datasets activate the 'non -periodic persona' features inside the model, causing non -intentional irregularities. And this can be resized. (Photo = Open AI)
Inaccurately inaccurate datasets activate the ‘non -periodic persona’ features contained in the model, causing non -intentional irregularities. And this may be resized. (Photo = Open AI)

In an interview with the Tech -Wart Open AI researcher, he said, “After I first announced this technology from inside, I believed it was found!”

In reality, the study not only warned, but additionally responded. In line with the experiment, if the micro -adjustment is performed again with a tons of of sound samples within the model that caused a malfunction, the alignment of the model may be restored relatively efficiently.

Comparison of the frequency of malignant micro -adjustment (left) and normal for normal (photo = open AI)
Comparison of the frequency of malignant micro -adjustment (left) and normal for normal (photo = open AI)

This study shows that if some bad data intervenes through the training of the AI ​​model, it might probably cause an enormous problem with the general performance. Specifically, it emphasizes that the multi-purpose high-performance model, just like the GPT-4O, have to be tailored to the info screening when it’s tailored to the tailor.

Open AI said, “We are going to provide specific evidence that supports the mental model for generalization in language models.”

The Persona Sort effect is certainly one of the attempts to discover the best way of working inside LLM. Antropic can also be a series of research results to unravel the AI’s black box problem and to discover how it really works.

By Park Chan, reporter cpark@aitimes.com

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x