Seungjun Lee, CTO of Twelve Labs, “The video language model can be based on robotics… AI that thinks like a human.”

-

Seungjun Lee, CTO of Twelve Labs, is explaining the video language model.

“The video language model goes one step farther from the concept of vision language model (VLM), which is the realm of ​​‘image understanding,’ and is a model that understands the context and audio data of the video. “Artificial intelligence (AI) is a technology that makes people think like people, and it’s connected to robotics.”

Twelve Labs (CEO Jae-seong Lee), which makes a speciality of image understanding artificial intelligence (AI), is all the time considered a worldwide AI startup. It’s more famous within the US than in Korea.

Along with collaborating with Disney and the National Football League (NFL), the corporate succeeded in attracting Series A investment value 70 billion won from NVIDIA subsidiaries in June.

Even last yr, this company’s technology was somewhat unfamiliar. This was justified since it was this yr that Open AI and Google began to release multi-modal models (LMM) in earnest.

Nevertheless, Twelve Labs became known after successfully attracting $12 million (roughly KRW 16.9 billion) in investment from Radical Ventures and others in December 2022. As well as, in November last yr, the large-scale video language model (VLFM, Video Language Foundation Model) ‘Pegasus’ was released.

And Chief Technology Officer (CTO) Seungjun Lee’s explanation showed that Twelve Labs’ technology goes beyond multimodal and is according to the world model (LWM), which has emerged as the following generation AI technology.

To begin with, CTO Lee said, “We’ve got highly evaluated the potential of image understanding for the reason that starting of our business in 2020. Above all, we thought that the role of a solid foundation model was necessary to beat the dearth of information and the constraints of prior training.”

He then announced that VLFM, which Twelve Labs is upgrading, is a cutting-edge model following Large Language Model (LLM) and VLM, and is the long run direction.

The explanation why AI has shown revolutionary performance and advancement in application power is due to LLM, which understands and generates natural language. Adding visual capabilities here leads to VLM. CTO Seungjun Lee said, “VLM could be seen as a secretary who presents a picture and provides a solution when asked a related query.”

Twelve Labs’ VLFM is a technology that goes beyond images and understands videos and explains them in natural language.

Explanation of Twelve Labs ‘Video Understanding Model’ (Photo = Twelve Labs)
Explanation of Twelve Labs ‘Video Understanding Model’ (Photo = Twelve Labs)

Unlike still photos, videos are complex, including back and front scenes, context, and audio information. Naturally, VLFM performs higher-order tasks than VLM.

The business areas that utilize this are countless. Good examples include collaboration with Disney and the NFL. There’s plenty of demand within the sports and media industries, akin to classifying videos and looking for specific scenes in videos. It will probably handle questions akin to “Categorize the scenes by which a particular actor appears.”

For that reason, the CTO compared VLFM to someone who’s “more seasoned and older” than VLM.

It’s because as people age, they acquire basic pondering structures akin to the laws of physics. For instance, the undeniable fact that ‘a water bottle breaks if dropped from a high place’ may not seem useful in itself, however it is an important piece of basic knowledge for living on this planet.

And this is an element of the ‘context’ that VLFM understands. It’s because each the back and front scenes are understood and utilized.

Twelve Labs explains that it focused on this very area. “Once people have basic knowledge, they’ll then think for themselves and live on this planet,” he said. “Twelve Labs’ VLFM is specializing in advancement to solidify this basic knowledge.”

The reason is that additional data learning could be minimized (few-shot learning) resulting from the knowledge of the model collected in this fashion.

That is connected to Twelve Labs’ future direction. Additionally it is consistent with the world model that has been talked about recently.

Robotics are considered the more than likely application field for LWM. This CTO also pointed to robotics.

“If VLFM evolves right into a ‘video language motion model,’ we are going to give you the chance to advance to robotics that think and act like humans,” he said. The video language motion model refers to viewing and understanding video captured on a camera and outputting it as an motion as a substitute of deriving a solution in natural language.

For instance, hardware (robot) is required to implement the command “Pick up the glass of water and pass it to me.” He also explained that as a way to understand and perform commands that change continuously somewhat than in a set routine, we want to maneuver forward with a video language motion model.

He said, “Previous robots were closer to acting by memorizing commands and outputs,” and “for this reason, they may not perform anything aside from inputting information prematurely akin to how much the angle needs to be tilted or how much pressure needs to be applied.”

Nevertheless, if the robot understands the image, it becomes possible to think based on this. Accordingly, it is anticipated that if a numerical result akin to ’tilt the angle by 4 degrees and apply pressure of about 10′ could be output, a robot that may pick up and hand a glass of water can be created.

CTO Seungjun Lee said, “Robotics has all the time been a field I had in mind,” and added, “I actually have thought that if we use a model that resembles the human eye and pondering ability, we will implement ‘robotics that minimizes data learning.’”

Above all, robotics data may be very small in number. He also said that although he has plenty of robot-related data, there are various cases where he doesn’t have a foundation model.

The reason is that on this case, Twelve Labs is ranked first. We’re currently collaborating with plenty of overseas corporations.

(Photo = Amazon)
(Photo = Amazon)

Amazon is in an identical situation, having deployed as many as 1 million robots in its warehouse and possessing a considerable amount of related data. Amazon also acquired LLM-based robot AI startup Covariant last September to construct a foundation model.

CTO Seungjun Lee said, “Just as VLFM is creating an unexpected amount of demand, the video language motion model may also have many uses in the long run,” and added, “It might advance to the ultimate evolution of AI like humanoids in movies.”

He also emphasized, “Within the case of Twelve Labs, now we have been researching the technology even before this term was coined,” and added, “Due to this, we were in a position to enter the market quickly.”

Reporter Jang Se-min semim99@aitimes.com

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x