Licenses masquerading as open source

I’m normally pretty laid back when it comes to terminology, but there’s been a lot of liberties taken with calling things Open Source in the machine learning community recently. Here I mention a few of the different licenses, draw a clear line between open source and hangers-on that are using the terminology, and try and explain why it’s important to call out the difference. Companies are free to choose how they license their software, but should not be giving the impression that they’re contributing unencumbered open source tools to the community if they’re not.

The current AI boom is a consumer focused – anyone can use generative text (chatGPT) or image (Dall-E 2) AI. One aspect of this that consumers may not be used to dealing with is licensing. Most people are used to just ignoring software licenses, like you do when you’re installing a new program or signing up for a service. But they’re important to consider when working in AI because the default for all the tools we use has been open source, and we generally work collaboratively, building on something someone else has made and hoping someone will use what we’ve made. Incorporating dubious licenses into this chain has the potential to gum up the works, potentially introducing liabilities and needlessly handing power to third parties.

Simplistically, software such as AI tools, are often grouped into proprietary and open models. When you use OpenAI’s chatGPT, you’re agreeing to terms of use for a proprietary software they’ve developed and to which they retain all the rights. On the other hand, if you go to HuggingFace model hub¹, you will find many similar generative AI tools that you can download and run on your own hardware as well as look at the source code. These are “open”, right?

With the consumer focus has come what I would consider some deceptive practices, the kind of things you’d normally find a big asterisk next too in advertising and know it was too good to be true. Two recent examples that I found frustrating: 1. a private company (that I don’t want to give publicity to) released a generative AI they referred to in their promotions as “Open Source”, with a license containing a clause that if used commercially it required permission from the publisher, a 10% royalty payment on any broadly related revenue, and various other information sharing. 2. Meta (Facebook) released an “Open Source” generative AI called LLaMa² that requires permission from Meta to access and use and is restricted to non-commercial use.

I don’t want to criticize the terms of the licenses, per se (though the royalty one was objectively ridiculous and they dropped those terms in less than a week). Companies can dictate the terms of what they build, just like OpenAI can choose not to release their models at all and serve them as APIs. What I want to talk about is the deception – “open source” has a generally agreed technical definition, as well as a colloquial connotation, and neither entails permission, royalties, or big restrictions, that only get brought up in clauses buried in the license. Not only is it confusing, but representing restrictive licenses as open can be destructive to the software and AI industries. People who build around putatively “open” AI tools are really building a moat for restrictive proprietary products that reflect particular company’s interests. If people want to do that, it’s their business, but they should be doing it with their eyes open.

While I don’t have a legal perspective on the different licenses and I’m not trying to provide one, I wanted to summarize some key takeaways and opinions of a few of the common “open”-ish licenses, from my viewpoint as someone who uses and writes software for AI. The hope is that more people will pay attention to the kind of license that’s being proposed, and eschew the ones that add restrictions.

The Open Source Initiative provides a technical definition of what Open Source software entails, and this is often quoted as the technical definition. Some of the terms are:

The Free Software Foundation is an advocacy group that defines software freedom:

These are not legal definitions, but I’d argue they capture normal people’s expectations about open source software. I’d summarize it as “you can do what you want with it”, which is about more than just the availability of the source code.

The most common open source licenses are Apache 2.0 and MIT (with BSD as another alternative). These are universally recognized as open source. They allow users to do essentially what they want with the software, including modify it, and sell it and license their modifications differently (for example with a closed license). Apache has some extra terms related to sharing information about modifications and patent rights. If software claiming to be open is released under another license, the first thing I would ask is why, as it implies the creator is specifically modifying your rights or obligations for some purpose. Anecdotally, it seems that Apache 2.0 has traditionally been the most common license under which open source AI tools are released. For example most of Meta / Facebooks releases are under this license. Open-Assistant by LAION⁵ and OpenLLaMa⁶ (an open source version of LLaMa) are also released under Apache 2.0.

A related family of licenses is the Copyleft licenses, such as the Gnu Public License (GPL). The big distinction is that these licenses say that any modifications made have to keep the same license – they are “viral” in that the terms of the license propagate forward and attach to whatever is built (I’m not sure that’s how a virus works, but anyway). The intent is to preserve “freedom”, ironically by limiting it, because any future versions of the software will need to keep the same license. For example MIT licensed code could be incorporated into a proprietary tool but GPL could not. For “finished” software, GPL is sometimes chosen because it lets people use it the way they want while preventing derivative works from becoming closed software. For AI tools that are almost always destined to become part of a bigger project, it’s less useful, because it hobbles downstream projects, forcing them to keep the GPL license. Developers are generally considered to be less interested in working with GPL code vs Apache or MIT⁷.

Permissive licensing vs Copyleft is a pretty classic debate, but I’d say both are generally considered as good faith open source. The most important part in my opinion is being aware of your obligations under GPL if it’s applicable. On the other hand, there are now various licenses that get called open source but impose real restrictions, which range from seemingly minor to forbidding commercial use. Everything else being equal, these licenses vary between inferior and untenable in terms of adopting tools so licensed. I can see the temptation for some to use them to limit uses, but I believe any limitations are bad for open source as a whole and only play into the hands of proprietary software. The richness of today’s software eco-systems, in particular AI, depends on continued access for everyone and for all applications to the state of the art. Not on various companies imposing their world views on how their code gets used, which will go downhill very fast if it takes off.

Importantly, tools that use these licenses regularly represent themselves as open source, because they do release in the source code (though in Meta’s case, even that is disingenuous because the code is useless without the weights which are kept behind).

At the consumer level, most people don’t pay attention to licenses. I have certainly never read the click through license or terms of use for software and almost nobody does (so why are they enforceable?). Playing fast and loose with the definition of open source software seems minor, and hauling out the official definition seems pedantic anyway. But all this is happening against the backdrop of lots of public lobbying and feigned concern about how powerful and harmful AI is getting, and it shouldn’t be surprising if the logical next steps are attempts at restrictions on who can use AI and how they can use it, at the expense of the consumer and to the benefit of large players. The world managed to break out of the proprietary software rut of the 90’s and 2000’s and benefit from a rich open source ecosystem. Current AI only exists because of this richness, and I don’t want to see people use the hype to try and lock computers back down in the name of some imagined “harm”. I’m not against proprietary software, I’m against companies diluting “open source” as a concept while imposing restrictions based on a world view, or to benefit commercially by the confusion. It’s no different than other types of false or misleading advertising.

Building with, and contributing to, true open source tools, is a way to ensure that everyone continues to benefit from technology, and that we’re not subject to the whims of how companies or other special interests think we should be using our computers. I’d urge people to understand the licenses that they are using (or that their employers are attaching to what they release) and focus on open-source.

https://huggingface.co/models↩︎
https://arxiv.org/abs/2302.13971↩︎
https://opensource.org/osd/↩︎
https://www.gnu.org/licenses/quick-guide-gplv3↩︎
https://open-assistant.io/↩︎
https://github.com/openlm-research/open_llama↩︎
https://terminusdb.com/blog/we-love-gplv3-but-are-switching-license-to-apache-2-0-terminusdb/↩︎
https://huggingface.co/stabilityai/stable-diffusion-2/blob/main/LICENSE-MODEL↩︎
https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform↩︎
https://huggingface.co/CarperAI/stable-vicuna-13b-delta↩︎
https://creativecommons.org/licenses/by-nc-sa/4.0/↩︎

Software licenses masquerading as open source