Microsoft’s computer vision model will generate alt text for Reddit images


Two years ago, Microsoft announced Florence, an AI system that it pitched as a “complete rethinking” of contemporary computer vision models. Unlike most vision models on the time, Florence was each “unified” and “multimodal,” meaning it could (1) understand language in addition to images and (2) handle a variety of tasks slightly than being limited to specific applications, like generating captions.

Now, as a component of Microsoft’s broader, ongoing effort to commercialize its AI research, Florence is arriving as a component of an update to the Vision APIs in Azure Cognitive Services. The Florence-powered Microsoft Vision Services launches today in preview for existing Azure customers, with capabilities starting from automatic captioning, background removal and video summarization to image retrieval.

“Florence is trained on billions of image-text pairs. In consequence, it’s incredibly versatile,” John Montgomery, CVP of Azure AI, told TechCrunch in an email interview. “Ask Florence to search out a selected frame in a video, and it may possibly do this; ask it to inform the difference between a Cosmic Crisp apple and a Honeycrisp apple, and it may possibly do this.”

The AI research community, which incorporates tech giants like Microsoft, have increasingly coalesced around the concept that multimodal models are the most effective path forward to more capable AI systems. Naturally, multimodal models — models that, once more, understand multiple modalities, comparable to language and pictures or videos and audio — are in a position to perform tasks in a single shot that unimodal models simply cannot (e.g. captioning videos).

Why not string several “unimodal” models together to attain the identical end, like a model that understands only images and one other that understands exclusively language? Just a few reasons, the primary being that multimodal models in some cases perform higher at the identical task than their unimodal counterpart because of the contextual information from the extra modalities. For instance, an AI assistant that understands images, pricing data and buying history is prone to offer better-personalized product suggestions than one which only understands pricing data.

The second reason is, multimodal models are likely to be more efficient from a computational standpoint — resulting in speedups in processing and (presumably) cost reductions on the backend. Microsoft being the profit-driven business that it’s, that’s, little question, a plus.

So what about Florence? Well, since it understands images, video and language and the relationships between those modalities, it may possibly do things like measure the similarity between images and text or segment objects in a photograph and paste them onto one other background.

I asked Montgomery which data Microsoft used to coach Florence — a timely query, I assumed, in light of pending lawsuits that might determine whether AI systems trained on copyrighted data, including images, are in violation of the rights of mental property holders. He wouldn’t give specifics, save that Florence uses “responsibly obtained” data sources “including data from partners.” As well as, Montgomery said that Florence’s training data was scrubbed of probably problematic content — one other all-too-common feature of public training datasets.

“When using large foundational models, it’s paramount to guarantee the standard of the training dataset, to create the inspiration for the adapted models for every Vision task,” Montgomery said. “Moreover, the adapted models for every Vision task has been tested for fairness, adversarial and difficult cases and implement the identical content moderation services we’ve been using for Azure Open AI Service and DALL-E.”

Image Credits: Microsoft

We’ll should take the corporate’s word for it. Some customers are, it seems. Montgomery says that Reddit will use the brand new Florence-powered APIs to generate captions for images on its platform, creating “alt text” so users with vision challenges can higher follow along in threads.

“Florence’s ability to generate as much as 10,000 tags per image will give Reddit rather more control over what number of objects in an image they will discover and help generate significantly better captions,” Montgomery said. “Reddit may even use the captioning to assist all users improve article rating for looking for posts.”

Microsoft can also be using Florence across a swath of its own platforms, services and products.

On LinkedIn, as on Reddit, Florence-powered services will generate captions to edit and support alt text image descriptions. In Microsoft Teams, Florence is driving video segmentation capabilities. PowerPoint, Outlook and Word are leveraging Florence’s image captioning abilities for automatic alt text generation. And Designer and OneDrive, courtesy of Florence, have gained higher image tagging, image search and background generation.

Montgomery sees Florence getting used by customers for rather more down the road, like detecting defects in manufacturing and enabling self-checkout in retail stores. None of those use cases a multimodal vision model, I’d note. But Montgomery asserts that multimodality adds something beneficial to the equation.

“Florence is a whole re-thinking of vision models,” Montgomery said. “Once there’s easy and high-quality translation between images and text, a world of possibilities opens up. Customers will give you the option to experience significantly improved image search, to coach image and vision models and other model types like language and speech into entirely latest kinds of applications and to simply improve the standard of their very own customized versions.”


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x