It is understood that the open source model ‘DeepSeek-V3’ released by China’s DeepSeek introduced itself as ChatGPT. In other words, it may be assumed that the info generated by ‘GPT-4’ was learned for model training.
On the twenty seventh (local time), X (user) named Lucas Bay revealed a scene through
Moreover, when asked in regards to the DeepSeek API, it provides instructions on the best way to use the OpenAI API and even makes the identical joke about GPT-4.
Large language models (LLMs) comparable to ChatGPT and DeepSeek-V3 are statistical systems that find patterns and make predictions by learning from quite a few data. For those who trained with a dataset containing text generated by GPT-4, you may remember a few of GPT-4’s output and repeat it.
Mike Cook, a researcher at King’s College London, warned TechCrunch in an interview with TechCrunch, “Using the output of other AI systems as training data may be very bad for model quality,” adding, “This may result in hallucinations and incorrect answers.” did it
It’s identified that learning through synthetic data could cause the so-called ‘model collapse’ phenomenon. “Like copying a duplicate, you lose an increasing number of information and your connection to reality,” he explained.
Moreover, such actions fall under the terms of service. OpenAI prohibits users from using its output to develop competing models.
Regarding this, Sam Altman OpenAI also left a bitter comment through X. “It is easy to repeat something that works well. Nevertheless, it is extremely difficult to do something latest, dangerous, and difficult, and it’s natural that researchers receive loads of glory through this.” Eventually yr’s Dev Day, he was well aware of the practice of some startups generating learning data with open AI models. It has been stated that there’s.
There are more examples of individuals introducing themselves as other models. Google’s ‘Gemini’ also claimed to be Antropic’s ‘Claude’ and Baidu’s ‘Wenxinyiyan’, becoming a hot topic locally.
Additionally it is identified that AI corporations collect training data from the net, and the net is currently overflowing with AI-generated content. It has also been estimated that 90% of the net could possibly be created with AI by 2026. This ‘data pollution’ could be a reason for deteriorating AI performance.
“For those who really need to scale back costs, you should use the strategy of ‘distilling’ the knowledge of the OpenAI model,” said Heidi Klaff, Chief AI Scientist at AI Now. “In that case, the model is not going to necessarily produce results paying homage to the output of OpenAI.” “It won’t occur,” he said.
Meanwhile, DeepSeek-V3, released on the twenty sixth, is the most important open source LLM ever with 671 billion parameters. It has performance exceeding existing open source models comparable to Meta’s ‘Rama 3.1 405B’ and Alibaba’s ‘Q1 2.5 72B’, and has attracted attention because it is understood to have achieved benchmark results that even surpass OpenAI’s ‘GPT-4o’. .
Reporter Park Chan cpark@aitimes.com