Home Artificial Intelligence The Golden Age of Open Source in AI Is Coming to an End A (biased) history of open sourcing AI libraries and models Open Sourcing Decisions A Sea Change in Open Source Turning Tides in Open Source AI The Way forward for Open Source AI

The Golden Age of Open Source in AI Is Coming to an End A (biased) history of open sourcing AI libraries and models Open Sourcing Decisions A Sea Change in Open Source Turning Tides in Open Source AI The Way forward for Open Source AI

1
The Golden Age of Open Source in AI Is Coming to an End
A (biased) history of open sourcing AI libraries and models
Open Sourcing Decisions
A Sea Change in Open Source
Turning Tides in Open Source AI
The Way forward for Open Source AI

At the identical time of TensorFlow’s rise, foreshadowing what was yet to are available open source AI, enterprise software went through an open source licensing crisis. Mostly due to AWS, which had mastered the craft of taking open source infrastructure projects and constructing business services around them, many open source projects exchanged their permissible licenses for “Copyleft” or “ShareAlike” (SA) alternatives.

Not all open source is created equal. Permissible licenses (like Apache 2.0 or MIT) allow anyone to take an open source project and construct a business service around it. “Copyleft” licenses (like GPL), just like Creative Common’s “ShareAlike” terms, are one solution to protect against this. They’re sometimes known as a “poison pill”, because they require any derivative product to be licensed the identical way. If AWS launched a service based on an open source project with a “Copyleft” license, the AWS service itself have to be open sourced under the identical license.

So, partially in response to competitive cloud services, the company creators and maintainers of open source projects like MongoDB and Redis switched up their licenses to less permissible alternatives. This led to a painful but entertaining back-and-forth between AWS and those corporations on the principles and merits of open source, which has since calmed down a bit.

Note that this modification in licensing had a deceptive impact on the open source ecosystem: There are still lots of recent open source projects being announced, however the licensing implications on what can and can’t be done with those projects are more complicated than most individuals realize.

At this point try to be asking yourself: If the company maintainers of open source infrastructure projects realized that others were reaping more of the business advantages than themselves, shouldn’t the identical be happening with AI? Isn’t this a fair greater deal for open source AI models, which hold the combination value of compute and data that went into creating them? The answers are: Yes and yes.

Although there appears to be a Robin Hood-esque movement around open source AI, the information is pointing in a special direction. Large corporations like Microsoft are changing licensing of a few of their hottest models from permissible to non-commercial (NC) licenses, and Meta has began to make use of non-commercial licenses for all of their recent open source projects (MMS, ImageBind, DINOv2 are all CC-BY-NC 4.0 and LLAMA is GPL 3.0). Even popular projects from universities like Stanford’s Alpaca are only licensed for non-commercial use (inherited by the non-permissible attributes of the dataset they used). Entire corporations change their business models in an effort to protect their IP and rid themselves of the duty to open source as a part of their mission — remember when a small non-profit called OpenAI transformed itself right into a capped-profit? Notice that GPT2 was open sourced, but GPT3.5 or GPT4 weren’t?

More generally speaking, the trend towards less permissible licenses in AI, although opaque, is noticeable. Below is an evaluation of model licenses on Hugging Face. The share of permissible licenses (like Apache, MIT, or BSD) has been on a persistent decline since mid 2022, while non-permissible licenses (like GPL) or restrictive licenses (like OpenRAIL) have gotten more common.

Source: Evaluation by writer

To make things worse, the recent frenzy around large language models (LLMs) has further muddied the waters. Hugging Face maintains an “Open LLM Leaderboard” which goals to focus on “the real progress that’s being made by the open-source community”. To be fair, all the models on the board are indeed open source. Nonetheless, a more in-depth look reveals that nearly none are licensed for business use*.

Source: Evaluation by writer

*Between the writing of this post and its publication, the license for Falcon models modified to the permissible Apache 2.0 license. The general commentary continues to be valid.

If anything, the Open LLM Leaderboard highlights that innovation from big tech (LLaMA was open sourced by Meta with a non-commercial license) dominates all other open source efforts. The larger problem is that these derivative models are usually not as forthcoming about their licenses. Almost none declare their license explicitly, and you’ve got to do your personal research to search out out that the models and data they’re based on don’t allow for business use.

There may be lots of virtue-signaling locally, mostly by well-meaning entrepreneurs and VCs who hope that there’s a future that is just not dominated by OpenAI, Google, and a handful of others. It is just not obvious why AI models must be open sourced — they represent hard-earned mental property that corporations develop over years, spending billions on compute, data acquisition, and talent. Firms can be defrauding their shareholders if they only gave the whole lot away at no cost.

“If I could put money into an ETF for IP lawyers I might.”

The trend towards non-permissible licenses in open source AI seems clear. Yet, the overwhelming volume of stories fails to indicate that the cumulative advantage of this work accrues almost entirely to academics and hobbyists. Investors and executives alike must be more aware of the implications and practice more care. I actually have a robust feeling that the majority startups within the emerging LLM cotton industry are constructing on top of non-commercially licensed technology. If I could put money into an ETF for IP lawyers I might.

My prediction is that the worth capture for AI (specifically for the newest generation of enormous generative models) will look just like other innovations that require significant capital investment and accumulation of specialised talent, like cloud computing platforms or operating systems. A number of major players will emerge that provide the AI foundation to the remainder of the ecosystem. There’ll still be ample room for a layer of startups on top of that foundation, but just as there are not any open source projects dethroning AWS, I consider it most unlikely that the open source community will produce a serious competitor to OpenAI’s GPT and whatever comes next.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here