Large language models aren’t people. Let’s stop testing them as in the event that they were.

Artificial Intelligence

Large language models aren’t people. Let’s stop testing them as in the event that they were.

admin

September 1, 2023

Large language models aren’t people. Let’s stop testing them as in the event that they were.

As an alternative of using images, the researchers encoded shape, color, and position into sequences of numbers. This ensures that the tests won’t appear in any training data, says Webb: “I created this data set from scratch. I’ve never heard of anything prefer it.”

Mitchell is impressed by Webb’s work. “I discovered this paper quite interesting and provocative,” she says. “It’s a well-done study.” But she has reservations. Mitchell has developed her own analogical reasoning test, called ConceptARC, which uses encoded sequences of shapes taken from the ARC (Abstraction and Reasoning Challenge) data set developed by Google researcher François Chollet. In Mitchell’s experiments, GPT-4 scores worse than people on such tests.

Mitchell also points out that encoding the photographs into sequences (or matrices) of numbers makes the issue easier for this system since it removes the visual aspect of the puzzle. “Solving digit matrices doesn’t equate to solving Raven’s problems,” she says.

Brittle tests

The performance of enormous language models is brittle. Amongst people, it’s secure to assume that somebody who scores well on a test would also do well on an identical test. That’s not the case with large language models: a small tweak to a test can drop an A grade to an F.

“On the whole, AI evaluation has not been done in such a way as to permit us to really understand what capabilities these models have,” says Lucy Cheke, a psychologist on the University of Cambridge, UK. “It’s perfectly reasonable to check how well a system does at a specific task, however it’s not useful to take that task and make claims about general abilities.”

Take an example from a paper published in March by a team of Microsoft researchers, wherein they claimed to have identified “sparks of artificial general intelligence” in GPT-4. The team assessed the massive language model using a spread of tests. In a single, they asked GPT-4 how one can stack a book, nine eggs, a laptop, a bottle, and a nail in a stable manner. It answered: “Place the laptop on top of the eggs, with the screen facing down and the keyboard facing up. The laptop will fit snugly throughout the boundaries of the book and the eggs, and its flat and rigid surface will provide a stable platform for the following layer.”

Not bad. But when Mitchell tried her own version of the query, asking GPT-4 to stack a toothpick, a bowl of pudding, a glass of water, and a marshmallow, it suggested sticking the toothpick within the pudding and the marshmallow on the toothpick, and balancing the complete glass of water on top of the marshmallow. (It ended with a helpful note of caution: “Take into account that this stack is delicate and will not be very stable. Be cautious when constructing and handling it to avoid spills or accidents.”)

LEAVE A REPLY Cancel reply