Home Artificial Intelligence The Golden Age of Open Source in AI Is Coming to an End A (biased) history of open sourcing AI libraries and models Open Sourcing Decisions A Sea Change in Open Source Turning Tides in Open Source AI The Way forward for Open Source AI

The Golden Age of Open Source in AI Is Coming to an End A (biased) history of open sourcing AI libraries and models Open Sourcing Decisions A Sea Change in Open Source Turning Tides in Open Source AI The Way forward for Open Source AI

2
The Golden Age of Open Source in AI Is Coming to an End
A (biased) history of open sourcing AI libraries and models
Open Sourcing Decisions
A Sea Change in Open Source
Turning Tides in Open Source AI
The Way forward for Open Source AI

At the identical time of TensorFlow’s rise, foreshadowing what was yet to are available in open source AI, enterprise software went through an open source licensing crisis. Mostly because of AWS, which had mastered the craft of taking open source infrastructure projects and constructing business services around them, many open source projects exchanged their permissible licenses for “Copyleft” or “ShareAlike” (SA) alternatives.

Not all open source is created equal. Permissible licenses (like Apache 2.0 or MIT) allow anyone to take an open source project and construct a business service around it. “Copyleft” licenses (like GPL), much like Creative Common’s “ShareAlike” terms, are one technique to protect against this. They’re sometimes known as a “poison pill”, because they require any derivative product to be licensed the identical way. If AWS launched a service based on an open source project with a “Copyleft” license, the AWS service itself should be open sourced under the identical license.

So, partially in response to competitive cloud services, the company creators and maintainers of open source projects like MongoDB and Redis switched up their licenses to less permissible alternatives. This led to a painful but entertaining back-and-forth between AWS and those corporations on the principles and merits of open source, which has since calmed down a bit.

Note that this transformation in licensing had a deceptive impact on the open source ecosystem: There are still a number of recent open source projects being announced, however the licensing implications on what can and can’t be done with those projects are more complicated than most individuals realize.

At this point you need to be asking yourself: If the company maintainers of open source infrastructure projects realized that others were reaping more of the business advantages than themselves, shouldn’t the identical be happening with AI? Isn’t this an excellent larger deal for open source AI models, which hold the mixture value of compute and data that went into creating them? The answers are: Yes and yes.

Although there appears to be a Robin Hood-esque movement around open source AI, the information is pointing in a distinct direction. Large corporations like Microsoft are changing licensing of a few of their hottest models from permissible to non-commercial (NC) licenses, and Meta has began to make use of non-commercial licenses for all of their recent open source projects (MMS, ImageBind, DINOv2 are all CC-BY-NC 4.0 and LLAMA is GPL 3.0). Even popular projects from universities like Stanford’s Alpaca are only licensed for non-commercial use (inherited by the non-permissible attributes of the dataset they used). Entire corporations change their business models with the intention to protect their IP and rid themselves of the duty to open source as a part of their mission — remember when a small non-profit called OpenAI transformed itself right into a capped-profit? Notice that GPT2 was open sourced, but GPT3.5 or GPT4 weren’t?

More generally speaking, the trend towards less permissible licenses in AI, although opaque, is noticeable. Below is an evaluation of model licenses on Hugging Face. The share of permissible licenses (like Apache, MIT, or BSD) has been on a persistent decline since mid 2022, while non-permissible licenses (like GPL) or restrictive licenses (like OpenRAIL) have gotten more common.

Source: Evaluation by writer

To make things worse, the recent frenzy around large language models (LLMs) has further muddied the waters. Hugging Face maintains an “Open LLM Leaderboard” which goals to focus on “the real progress that’s being made by the open-source community”. To be fair, the entire models on the board are indeed open source. Nonetheless, a more in-depth look reveals that just about none are licensed for business use*.

Source: Evaluation by writer

*Between the writing of this post and its publication, the license for Falcon models modified to the permissible Apache 2.0 license. The general statement continues to be valid.

If anything, the Open LLM Leaderboard highlights that innovation from big tech (LLaMA was open sourced by Meta with a non-commercial license) dominates all other open source efforts. The larger problem is that these derivative models will not be as forthcoming about their licenses. Almost none declare their license explicitly, and you’ve got to do your personal research to seek out out that the models and data they’re based on don’t allow for business use.

There may be a number of virtue-signaling in the neighborhood, mostly by well-meaning entrepreneurs and VCs who hope that there’s a future that isn’t dominated by OpenAI, Google, and a handful of others. It isn’t obvious why AI models ought to be open sourced — they represent hard-earned mental property that corporations develop over years, spending billions on compute, data acquisition, and talent. Corporations can be defrauding their shareholders if they only gave every thing away free of charge.

“If I could spend money on an ETF for IP lawyers I might.”

The trend towards non-permissible licenses in open source AI seems clear. Yet, the overwhelming volume of reports fails to indicate that the cumulative advantage of this work accrues almost entirely to academics and hobbyists. Investors and executives alike ought to be more aware of the implications and practice more care. I even have a powerful feeling that almost all startups within the emerging LLM cotton industry are constructing on top of non-commercially licensed technology. If I could spend money on an ETF for IP lawyers I might.

My prediction is that the worth capture for AI (specifically for the most recent generation of huge generative models) will look much like other innovations that require significant capital investment and accumulation of specialised talent, like cloud computing platforms or operating systems. Just a few major players will emerge that provide the AI foundation to the remaining of the ecosystem. There’ll still be ample room for a layer of startups on top of that foundation, but just as there aren’t any open source projects dethroning AWS, I consider it not possible that the open source community will produce a serious competitor to OpenAI’s GPT and whatever comes next.

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here