Software licenses masquerading as open source

Andrew Marble
June 1, 2023

I’m normally pretty laid back when it comes to terminology, but there’s been a lot of liberties taken with calling things Open Source in the machine learning community recently. Here I mention a few of the different licenses, draw a clear line between open source and hangers-on that are using the terminology, and try and explain why it’s important to call out the difference. Companies are free to choose how they license their software, but should not be giving the impression that they’re contributing unencumbered open source tools to the community if they’re not.

The current AI boom is a consumer focused – anyone can use generative text (chatGPT) or image (Dall-E 2) AI. One aspect of this that consumers may not be used to dealing with is licensing. Most people are used to just ignoring software licenses, like you do when you’re installing a new program or signing up for a service. But they’re important to consider when working in AI because the default for all the tools we use has been open source, and we generally work collaboratively, building on something someone else has made and hoping someone will use what we’ve made. Incorporating dubious licenses into this chain has the potential to gum up the works, potentially introducing liabilities and needlessly handing power to third parties.

Simplistically, software such as AI tools, are often grouped into proprietary and open models. When you use OpenAI’s chatGPT, you’re agreeing to terms of use for a proprietary software they’ve developed and to which they retain all the rights. On the other hand, if you go to HuggingFace model hub1, you will find many similar generative AI tools that you can download and run on your own hardware as well as look at the source code. These are “open”, right?

With the consumer focus has come what I would consider some deceptive practices, the kind of things you’d normally find a big asterisk next too in advertising and know it was too good to be true. Two recent examples that I found frustrating: 1. a private company (that I don’t want to give publicity to) released a generative AI they referred to in their promotions as “Open Source”, with a license containing a clause that if used commercially it required permission from the publisher, a 10% royalty payment on any broadly related revenue, and various other information sharing. 2. Meta (Facebook) released an “Open Source” generative AI called LLaMa2 that requires permission from Meta to access and use and is restricted to non-commercial use.

I don’t want to criticize the terms of the licenses, per se (though the royalty one was objectively ridiculous and they dropped those terms in less than a week). Companies can dictate the terms of what they build, just like OpenAI can choose not to release their models at all and serve them as APIs. What I want to talk about is the deception – “open source” has a generally agreed technical definition, as well as a colloquial connotation, and neither entails permission, royalties, or big restrictions, that only get brought up in clauses buried in the license. Not only is it confusing, but representing restrictive licenses as open can be destructive to the software and AI industries. People who build around putatively “open” AI tools are really building a moat for restrictive proprietary products that reflect particular company’s interests. If people want to do that, it’s their business, but they should be doing it with their eyes open.

While I don’t have a legal perspective on the different licenses and I’m not trying to provide one, I wanted to summarize some key takeaways and opinions of a few of the common “open”-ish licenses, from my viewpoint as someone who uses and writes software for AI. The hope is that more people will pay attention to the kind of license that’s being proposed, and eschew the ones that add restrictions.

Open source

The Open Source Initiative provides a technical definition of what Open Source software entails, and this is often quoted as the technical definition. Some of the terms are:

The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license shall not require a royalty or other fee for such sale.

The program must include source code, and must allow distribution in source code

The license must not restrict anyone from making use of the program in a specific field of endeavor.3

The Free Software Foundation is an advocacy group that defines software freedom:

There are four freedoms that every user should have:

the freedom to use the software for any purpose,

the freedom to change the software to suit your needs,

the freedom to share the software with your friends and neighbors, and

the freedom to share the changes you make.4

These are not legal definitions, but I’d argue they capture normal people’s expectations about open source software. I’d summarize it as “you can do what you want with it”, which is about more than just the availability of the source code.

The most common open source licenses are Apache 2.0 and MIT (with BSD as another alternative). These are universally recognized as open source. They allow users to do essentially what they want with the software, including modify it, and sell it and license their modifications differently (for example with a closed license). Apache has some extra terms related to sharing information about modifications and patent rights. If software claiming to be open is released under another license, the first thing I would ask is why, as it implies the creator is specifically modifying your rights or obligations for some purpose. Anecdotally, it seems that Apache 2.0 has traditionally been the most common license under which open source AI tools are released. For example most of Meta / Facebooks releases are under this license. Open-Assistant by LAION5 and OpenLLaMa6 (an open source version of LLaMa) are also released under Apache 2.0.

A related family of licenses is the Copyleft licenses, such as the Gnu Public License (GPL). The big distinction is that these licenses say that any modifications made have to keep the same license – they are “viral” in that the terms of the license propagate forward and attach to whatever is built (I’m not sure that’s how a virus works, but anyway). The intent is to preserve “freedom”, ironically by limiting it, because any future versions of the software will need to keep the same license. For example MIT licensed code could be incorporated into a proprietary tool but GPL could not. For “finished” software, GPL is sometimes chosen because it lets people use it the way they want while preventing derivative works from becoming closed software. For AI tools that are almost always destined to become part of a bigger project, it’s less useful, because it hobbles downstream projects, forcing them to keep the GPL license. Developers are generally considered to be less interested in working with GPL code vs Apache or MIT7.

Permissive licensing vs Copyleft is a pretty classic debate, but I’d say both are generally considered as good faith open source. The most important part in my opinion is being aware of your obligations under GPL if it’s applicable. On the other hand, there are now various licenses that get called open source but impose real restrictions, which range from seemingly minor to forbidding commercial use. Everything else being equal, these licenses vary between inferior and untenable in terms of adopting tools so licensed. I can see the temptation for some to use them to limit uses, but I believe any limitations are bad for open source as a whole and only play into the hands of proprietary software. The richness of today’s software eco-systems, in particular AI, depends on continued access for everyone and for all applications to the state of the art. Not on various companies imposing their world views on how their code gets used, which will go downhill very fast if it takes off.

A few examples I’ve seen recently of non-open source licenses are:

  1. The RAIL (Responsible AI) that include a list of things you can’t use the model for. Stability AI’s Stable Diffusion is an example of a recent AI tool using such a license8. What’s on the list is not important. In what seems looks a lot like a political strategy, not using the model to harm children is one of the first restrictions. You can’t harm children anyway, so the rule is superfluous, but might let a critic be accused of not wanting to prevent harm to children… Regardless of what (non-superfluous) restrictions are included, it’s the idea that a company is removing user’s freedom for how they use a tool that’s the problem, and it’s easy to see how that could deteriorate rapidly in a world where everything is political.

  2. Meta’s (Facebook’s) Llama language model license9. This license forbids commercial use, military, biometrics, surveillance, etc. Plus you have to ask Meta for the weights (you don’t actually, you can find leaked copies everywhere) and they decide if you’re worthy, which is obviously not open.

  3. Various AI language models, such as StableVicuna10 (a Vicuna is some llama-like animal) have been released under Creative Commons Attribution-NonCommerical-ShareAlike11, a license prohibiting commercial use that requires derivative works to be similarly licensed.

Importantly, tools that use these licenses regularly represent themselves as open source, because they do release in the source code (though in Meta’s case, even that is disingenuous because the code is useless without the weights which are kept behind).

At the consumer level, most people don’t pay attention to licenses. I have certainly never read the click through license or terms of use for software and almost nobody does (so why are they enforceable?). Playing fast and loose with the definition of open source software seems minor, and hauling out the official definition seems pedantic anyway. But all this is happening against the backdrop of lots of public lobbying and feigned concern about how powerful and harmful AI is getting, and it shouldn’t be surprising if the logical next steps are attempts at restrictions on who can use AI and how they can use it, at the expense of the consumer and to the benefit of large players. The world managed to break out of the proprietary software rut of the 90’s and 2000’s and benefit from a rich open source ecosystem. Current AI only exists because of this richness, and I don’t want to see people use the hype to try and lock computers back down in the name of some imagined “harm”. I’m not against proprietary software, I’m against companies diluting “open source” as a concept while imposing restrictions based on a world view, or to benefit commercially by the confusion. It’s no different than other types of false or misleading advertising.

Building with, and contributing to, true open source tools, is a way to ensure that everyone continues to benefit from technology, and that we’re not subject to the whims of how companies or other special interests think we should be using our computers. I’d urge people to understand the licenses that they are using (or that their employers are attaching to what they release) and focus on open-source.